Tagged: Model Selection Criteria

Model Selection Criteria

All models are wrong, but some are useful. Model selection criteria are rules used to select a (statistical) model among competing models, based on given data.

There are several model selection criteria that are used to choose among a set of candidate models, and/ or compare models for forecasting purposes.

All model selection criteria aim at minimizing the residual sum of squares (or increasing the coefficient of determination value). The criterion Adj-$R^2$, Akaike Information, Bayesian Information Criterion, Schwarz Information Criterion and Mallow’s $C_p$ impose a penalty for including an increasingly large number of regressors. Therefore, there is a trade-off between the goodness of fit of the model and its complexity. The complexity refers to the number of parameters in the model.

Coefficient of Determination ($R^2$)

$$R^2=\frac{\text{Explained Sum of Square}}{\text{Total Sum of Squares}}=1-\frac{\text{Residuals Sum of Squares}}{\text{Total Sum of Squares}}$$

Adding more variables to the model may increase $R^2$ but it may also increase the variance of forecast error.
There are some problems with $R^2$

  • It measures in-sample goodness of fit (how close an estimated $Y$ value is to its actual values) in the given sample. There is no guarantee that $R^2$ will forecast well out-of-sample observations.
  • In comparing two or more $R^2$’s, the dependent variable must be the same.
  • $R^2$ cannot fall when more variables are added to the model.

Adjusted Coefficient of Determination ($R^2$)


$\overline{R}^2 \ge R^2$ shows that the adjusted $R^2$ penalizes for adding more regressors (explanatory variables). Unlike $R^2$, the adjusted $R^2$ will increase only if the absolute $t$-value of the added variable is greater than 1. For comparative purposes, $\overline{R}^2$ is a better measure than $R^2$. The regressand (dependent variable) must be the same for the comparison of models to be valid.

Akaike’s Information Criterion (AIC)

$$AIC=e^{\frac{2K}{n}}\frac{\sum \hat{u}^2_i}{n}=e^{\frac{2k}{n}}\frac{RSS}{n}$$
where $k$ is the number of regressors including the intercept. The formula of AIC is

$$\ln AIC = \left(\frac{2k}{n}\right) + \ln \left(\frac{RSS}{n}\right)$$
where $\ln AIC$ is natural log of AIC and $\frac{2k}{n}$ is penalty factor.

AIC imposes a harsher penalty than the adjusted coefficient of determination for adding more regressors. In comparing two or more models, the model with the lowest value of AIC is preferred. AIC is useful for both in-sample and out-of-sample forecasting performance of a regression model. AIC is used to determine the lag length in an AR(p) model also.

Schwarz’s Information Criterion (SIC)

SIC &=n^{\frac{k}{n}}\frac{\sum \hat{u}_i^2}{n}=n^{\frac{k}{n}}\frac{RSS}{n}\\
\ln SIC &= \frac{k}{n} \ln n + \ln \left(\frac{RSS}{n}\right)
where $\frac{k}{n}\ln\,n$ is the penalty factor. SIC imposes a harsher penalty than AIC.

Like AIC, SIC is used to compare in-sample or out-of-sample forecasting performance of a model. The lower the values of SIC, the better is the model.

Mallow’s $C_p$ Criterion

For Model selection the Mallow criteria is
where $RSS_p$ is residual sum of square using the $p$ regression in the model.
E(C_p)&\approx \frac{(n-p)\sigma^2}{\sigma^2}-(n-2p)\approx p
A model that has a low $C_p$ values, about equal to $p$ is preferable.

Bayesian Information Criteria (BIC)

The Bayesian information Criteria is based on the likelihood function and it is closely related to the AIC. The penalty term in BIC is larger than in AIC.
where $\hat{L}$ is the maximized value of the likelihood function of the regression model.

Note that no one of these criteria is necessarily superior to the others.

Read more about

Coefficient of Determination: A model Selection Criteria

$R^2$ pronounced R-Squared (Coefficient of determination) is a useful statistics to check the value of regression fit. $R^2$ measures the proportion of total variation about the mean $\bar{Y}$ explained by the regression. R is the correlation between $Y$ and $\hat{Y}$ and is usually the multiple correlation coefficient. Coefficient of determination ($R^2$) can take values as high as 1 or  (100%) when all the values are different i.e. $0\le R^2\le 1$. When repeats runs exist in the data the value of $R^2$ cannot attain 1, no matter how well model fits, because no model can explain the variation in the data due to the pure error. A perfect fit to data for which $\hat{Y}_i=Y_i$, $R^2=1$. If $\hat{Y}_i=\bar{Y}$, that is if $\beta_1=\beta_2=\cdots=\beta_{p-1}=0$ or if a model $Y=\beta_0 +\varepsilon$ alone has been fitted, then $R^2=0$. Therefore we can say that $R^2$ is a measure of the usefulness of the terms, other than $\beta_0$ in the model.

Note that we must sure that an improvement/ increase in $R^2$ value due to adding a new term (variable) to the model under study should have some real significance and is not due to the fact that the number of parameters in the model is getting else to saturation point. If there is no pure error $R^2$ can be made unity.

R^2 &= \frac{\text {SS due to regression given}\, b_0}{\text{Total SS corrected for mean} \, \bar{Y}} \\
&= \frac{SS \, (b_1 | b_0)}{S_{YY}} \\
&= \frac{\sum(\hat{Y_i}-\bar{Y})^2} {\sum(Y_i-\bar{Y})^2}r \\
&= \frac{S^2_{XY}}{(S_{XY})(S_{YY})}

where summation are over i=1,2,…,n.

Note that when interpreting R-Square $R^2$ does not indicate whether:

  • the independent variables (explanatory variables) are a cause of the changes in the dependent variable;
  • omitted-variable bias exists;
  • the correct regression was used;
  • the most appropriate set of explanatory variables has been selected;
  • there is collinearity (or multicollinearity) present in the data;
  • the model might be improved by using transformed versions of the existing set of explanatory variables.

Learn more about