# Model Selection Criteria

## Coefficient of Determination Formula

Coefficient of Determination as a Link between Regression and Correlation Analysis

### Coefficient of Determination Formula in Statistics

The R squared ($r^2$; the square of the correlation coefficient) shows the percentage of the total variation of the dependent variable ($Y$) that can be explained by the independent (explanatory) variable ($X$). For this reason, $r^2$ (r-squared) is sometimes called the coefficient of determination.

Since

$r=\frac{\sum x_i y_y}{\sqrt{\sum x_i^2} \sqrt{\sum y_i^2}},$

Coefficient of Determination Formula

\begin{align*}
r^2&=\frac{(\sum x_iy_i)^2}{(\sum x_i^2)(\sum y_i^2)}=\frac{\sum \hat{y}^2}{\sum y^2}\\
&=\frac{\text{Explained Variation}}{\text{Total Variation}}
\end{align*}

where $r$ shows the degree of covariability of $X$ and $Y$. Note that in the formula used here is in deviation form, that is, $x=X-\mu$ and $y=Y-\mu$.

The link of $r^2$ between regression and correlation analysis can be considered from these points.

• If all the observations lie on the regression line then there will be no scattered points. In other words, the total variation of variable $Y$ is explained completely by the estimated regression line, which shows that there would be no scatterness in the data points(or no unexplained variation). That is
$\frac{\sum e^2}{\sum y^2}=\frac{\text{Unexplained Variation}}{\text{Total Variation}}=0$
Hence, $r^2=r=1$.
• If the regression line explains only part of the variation in variable $Y$ then there will be some explained variation, that is,
$\frac{\sum e^2}{\sum y^2}=\frac{\text{Unexplained Variation}}{\text{Total Variation}}>0$
then, $r^2$ will be smaller than 1.
• If the regression line does not explain any part of the variation of variable $Y$, that is,
$\frac{\sum e^2}{\sum y^2}=\frac{\text{Unexplained Variation}}{\text{Total Variation}}=1\Rightarrow=\sum y^2 = \sum e^2$
then, $r^2=0$.

Because $r^2=1-\frac{\text{unexlained variation}}{\text{total variation}}$

Regression Model in R Programming Language

## Coefficient of Determination: A model Selection Criteria

$R^2$ pronounced R-squared (Coefficient of determination) is a useful statistic to check the regression fit value. $R^2$ measures the proportion of total variation about the mean $\bar{Y}$ explained by the regression. R is the correlation between $Y$ and $\hat{Y}$ and is usually the multiple correlation coefficient. The coefficient of determination ($R^2$) can take values as high as 1 or  (100%) when all the values are different i.e. $0\le R^2\le 1$.

#### Coefficient of Determination

When repeat runs exist in the data the value of $R^2$ cannot attain 1, no matter how well the model fits, because no model can explain the variation in the data due to the pure error. A perfect fit to data for which $\hat{Y}_i=Y_i$, $R^2=1$. If $\hat{Y}_i=\bar{Y}$, that is if $\beta_1=\beta_2=\cdots=\beta_{p-1}=0$ or if a model $Y=\beta_0 +\varepsilon$ alone has been fitted, then $R^2=0$. Therefore we can say that $R^2$ is a measure of the usefulness of the terms other than $\beta_0$ in the model.

Note that we must be sure that an improvement/ increase in $R^2$ value due to adding a new term (variable) to the model under study should have some real significance and is not because the number of parameters in the model is getting else to saturation point. If there is no pure error $R^2$ can be made unity.

\begin{align*}
R^2 &= \frac{\text {SS due to regression given}\, b_0}{\text{Total SS corrected for mean} \, \bar{Y}} \\
&= \frac{SS \, (b_1 | b_0)}{S_{YY}} \\
&= \frac{\sum(\hat{Y_i}-\bar{Y})^2} {\sum(Y_i-\bar{Y})^2}r \\
&= \frac{S^2_{XY}}{(S_{XY})(S_{YY})}
\end{align*}

where summation are over $i=1,2,\cdots, n$.

### Interpreting R-Square $R^2$ does not indicate whether:

• the independent variables (explanatory variables) are a cause of the changes in the dependent variable;
• omitted-variable bias exists;
• the correct regression was used;
• the most appropriate set of explanatory variables has been selected;
• there is collinearity (or multicollinearity) present in the data;
• the model might be improved using transformed versions of the existing explanatory variables.