**WHAT IS AN R2 STATISTIC?**

An

**R2 statistic**

**is a measure of goodness-of-fit, also known as the coefficient of determination. It is the proportion of variability in a data set that is accounted for by the chosen model. In a simple linear regression (**

*y*=

*mx*+

*b*), the R2 statistic is the percentage of variability in ‘

*y*’ that can be explained by movements in ‘

*x*.’ Note that in a linear regression, R2 equals the square of the sample correlation coefficient between the outcomes (

*y*) and their predicted values, and can be thought of as a percentage from 1 to 100.

**WHAT IS A RANDOM WALK? IS IT STATIONARY OR NON-STATIONARY?**

A random walk is a path that consists of taking successive random steps. Often, the “steps” taken by a stock price are assumed to follow a random walk. Random walks are not stationary, i.e., where you “are” in the walk depends on where you were immediately prior to your last step. Non-stationary behaviors can include trends, cycles, random walks or combinations of the three. Note that while non-stationary data cannot be modeled or forecasted, the data can usually be transformed so that they can be modeled.

In a pure random walk (

*Yt*=

*Yt*-1 +

*εt*), the value or location at time

*t*will be equal to the last period value plus a stochastic (non-systematic) component that is a white noise, which means

*εt*is independent and identically distributed with mean of 0 and variance equal to σ². The random walk is a non-mean reverting process that can move away from the mean either in a positive or negative direction. Put another way, the means, variances and co-variances of the walk change over time. Another characteristic of a random walk is that the variance evolves over time and goes to infinity as time goes to infinity; therefore, a random walk cannot be predicted.

**WHAT HAPPENS IF YOU CREATE A REGRESSION BASED ON 2 VARIABLES THAT EACH ARE CONTINUOUSLY INCREASING WITH TIME?**

If you attempt to model two series that are both time-dependent (say, consumption and income), you will get a spurious regression, i.e., a model with a high R2, but poor predictive properties. This is because both series are non-stationary. With non-stationary variables, one needs to transform them into stationary series; the easiest way to do this is by differencing, or looking at changes in the series. Changes in a non-stationary series are usually stationary.

What to do if two series are non-stationary: Rather than differencing each series, one may be able to create a better model by finding a

*cointegrating*relationship. With cointegration, the aim is to detect any common stochastic trends in the underlying data; whereas the two series may not be stationary, the difference between two non-stationary series may itself, be stationary.

**IF I HAVE A REGRESSION BETWEEN X AND Y, WHAT TEST STATISTICS SHOULD I LOOK AT TO DETERMINE WHETHER I HAVE A GOOD MODEL?**

**R2:**See above.**t-Statistic**: In a least squares regression, the*t*-statistic is the estimated regression coefficient of a given independent variable divided by its standard error. If the*t*-statistic is more than 2 (i.e., the coefficient is at least twice as large as the standard error), one can conclude that the variable in question has a significant impact on the dependent variable.*F***-Test:**In a regression model, the*F*-test allows one to compare the null hypothesis.*H*0: All non-constant coefficients in the regression equation are zero (i.e., the model has no explanatory power).*Ha*: At least one of the non-constant coefficients in the regression equation is non-zero.

*F*-test tests the joint explanatory power of the variables, while the

*t*tests test their explanatory power individually. One rejects the null hypothesis when the

*F*statistic is greater than its critical value.

**Durbin-Watson statistic:**

the Durbin–Watson (

*d*) statistic measures the presence of autocorrelation in the residuals. The value of

*d*always lies between 0 and 4. A value of 2 indicates no autocorrelation. If the Durbin–Watson statistic is substantially less than 2, there is evidence of positive serial correlation.

**Akaike Information Criterion:**

From among a set of models, the Akaike Information Criterion (AIC) suggests the preferred model

*as the one with the minimum AIC value. The AIC = 2*

*k*– 2 × ln(

*L*) where

*k*is the number of parameters in the statistical model, and

*L*is the maximized value of the likelihood function for the estimated model. The AIC rewards goodness of fit, but also includes a penalty that is an increasing function of the number of estimated parameters, which discourages overfitting.

**IF I HAVE DATA FROM 1912-2012, WHAT’S THE DANGER IF I BUILD A MODEL FORECASTING VALUES IN 2012 USING ALL OF THE HISTORICAL DATA?**

Fitting a model with all available data is called in-sample testing. A model can be constructed that may perform exceptionally well during a selected period of history, but structural changes over time may mean that a model that worked well in the past may not work well in the future. Rather than back-testing a model using in-sample data, create a model that uses some historical data and test how well the model works when applied to data that

*wasn’t*used to construct the model. While back-testing can provide valuable information regarding a model’s potential, back-testing alone often produces deceptive results.

**WHY SHOULD I CARE ABOUT RESIDUALS IN A REGRESSION?**

If a plot of residuals versus time does not look like white noise, the model is likely misspecified. Correlation of residuals is known as autocorrelation, and can be checked by calculating the Durbin-Watson statistic (above). With autocorrelation of residuals, the estimated regression coefficients are still unbiased but may no longer have the minimum variance among all unbiased estimates (they are inefficient); moreover, confidence intervals may be underestimated and estimation of the test statistics for the F-test (and so significance) may also be underestimated, potentially leading to the conclusion that the set of explanatory variables is not significant as a whole.

**WHAT IS A MONTE CARLO SIMULATION?**

A Monte Carlo simulation is a computerized mathematical procedure for sampling random outcomes for a given process. It provides a range of possible outcomes (and their associated probabilities) rather than a discrete point estimate of a given outcome. Monte Carlo simulation performs risk analysis by building models of possible results by substituting a range of values—a

*probability distribution*—for any factor that has inherent uncertainty. It calculates results over and over, each time using a different set of random values chosen from the probability functions. During a Monte Carlo simulation, values are sampled at random from the input probability distributions. Each set of results from that sample is recorded; the values comprise a probability distribution of possible outcomes. Depending upon the number of uncertainties and the ranges specified for them, a Monte Carlo simulation could involve thousands or tens of thousands of recalculations.

Monte Carlo simulation provides a number of advantages over deterministic, or “single-point estimate” analysis:

1)

**Probabilistic Results.**Results show not only what could happen, but also how likely each outcome is.

2)

**Sensitivity Analysis.**In a Monte Carlo simulation, it’s easy to see which inputs had the biggest effect on line results. In deterministic models, it’s difficult to model different combinations of values for different inputs to see the effects of truly different scenarios.

3)

**Correlation of Inputs.**In a Monte Carlo simulation, it’s possible to model relationships among input variables.