# Basic Statistics and Data Analysis

## Correlation Coeficient values lies between +1 and -1?

We know that the ratio of the explained variation to the total variation is called the coefficient of determination. This ratio is non-negative, therefore denoted by $r^2$, thus

\begin{align*}
r^2&=\frac{\text{Explained Variation}}{\text{Total Variation}}\\
&=\frac{\sum (\hat{Y}-\overline{Y})^2}{\sum (Y-\overline{Y})^2}
\end{align*}

It can be seen that if the total variation is all explained, the ratio $r^2$ (Coefficient of Determination) is one and if the total variation is all unexplained then the explained variation and the ratio r2 is zero.

The square root of the coefficient of determination is called the correlation coefficient, given by

\begin{align*}
r&=\sqrt{ \frac{\text{Explained Variation}}{\text{Total Variation}} }\\
&=\pm \sqrt{\frac{\sum (\hat{Y}-\overline{Y})^2}{\sum (Y-\overline{Y})^2}}
\end{align*}

and

$\sum (\hat{Y}-\overline{Y})^2=\sum(Y-\overline{Y})^2-\sum (Y-\hat{Y})^2$

therefore

\begin{align*}
r&=\sqrt{ \frac{\sum(Y-\overline{Y})^2-\sum (Y-\hat{Y})^2} {\sum(Y-\overline{Y})^2} }\\
&=\sqrt{1-\frac{\sum (Y-\hat{Y})^2}{\sum(Y-\overline{Y})^2}}\\
&=\sqrt{1-\frac{\text{Unexplained Variation}}{\text{Total Variation}}}=\sqrt{1-\frac{S_{y.x}^2}{s_y^2}}
\end{align*}

where $s_{y.x}^2=\frac{1}{n} \sum (Y-\hat{Y})^2$ and $s_y^2=\frac{1}{n} \sum (Y-\overline{Y})^2$

\begin{align*}
\Rightarrow r^2&=1-\frac{s_{y.x}^2}{s_y^2}\\
\Rightarrow s_{y.x}^2&=s_y^2(1-r^2)
\end{align*}

Since variances are non-negative

$\frac{s_{y.x}^2}{s_y^2}=1-r^2 \geq 0$

Solving for inequality we have

\begin{align*}
1-r^2 & \geq 0\\
\Rightarrow r^2 \leq 1\, \text{or}\, |r| &\leq 1\\
\Rightarrow & -1 \leq r\leq 1
\end{align*}

## Alternative Proof

Since $\rho(X,Y)=\rho(X^*,Y^*)$ where $X^*=\frac{X-\mu_X}{\sigma_X}$ and $Y^*=\frac{Y-Y^*}{\sigma_Y}$

and as covariance is bi-linear and X* ,Y* have zero mean and variance 1, therefore

\begin{align*}
\rho(X^*,Y^*)&=Cov(X^*,Y^*)=Cov\{\frac{X-\mu_X}{\sigma_X},\frac{Y-\mu_Y}{\sigma_Y}\}\\
&=\frac{Cov(X-\mu_X,Y-\mu_Y)}{\sigma_X\sigma_Y}\\
&=\frac{Cov(X,Y)}{\sigma_X \sigma_Y}=\rho(X,Y)
\end{align*}

We also know that the variance of any random variable is ≥0, it could be zero i.e .(Var(X)=0) if and only if X is a constant (almost surely), therefore

$V(X^* \pm Y^*)=V(X^*)+V(Y^*)\pm2Cov(X^*,Y^*)$

As Var(X*)=1 and Var(Y*)=1, the above equation would be negative if $Cov(X^*,Y^*)$ is either greater than 1 or less than -1. Hence $1\geq \rho(X,Y)=\rho(X^*,Y^*)\geq -1$.

If $\rho(X,Y )=Cov(X^*,Y^*)=1$ then $Var(X^*- Y ^*)=0$ making X* =Y* almost surely. Similarly, if $\rho(X,Y )=Cov(X^*,Y^*)=-1$ then X*=−Y* almost surely. In either case, Y would be a linear function of X almost surely.

We can see that Correlation Coefficient values lies between -1 and +1.

## Coefficient of Determination: A model Selection Criteria

$R^2$ pronounced R-Squared (Coefficient of determination) is a useful statistics to check the value of regression fit. $R^2$ measures the proportion of total variation about the mean $\bar{Y}$ explained by the regression. R is the correlation between $Y$ and $\hat{Y}$ and is usually the multiple correlation coefficient. Coefficient of determination ($R^2$) can take values as high as 1 or  (100%) when all the  values are different i.e. $0\le R^2\le 1$. When repeats runs exists in the data the value of $R^2$ cannot attain 1, no matter how well model fits, because no model can explain the variation in the data due to pure error. A perfect fit to data for which $\hat{Y}_i=Y_i$, $R^2=1$. If $\hat{Y}_i=\bar{Y}$, that is if $\beta_1=\beta_2=\cdots=\beta_{p-1}=0$ or if a model $Y=\beta_0 +\varepsilon$ alone has been fitted, then $R^2=0$. Therefore we can say that $R^2$ is a measure of usefulness of the terms, other than $\beta_0$ in the model.

Note that we must sure that an improvement/ increase in $R^2$ value due to adding a new term (variable) to the model under study should have some real significance and is not due to the fact that the number of parameters in the model is getting else to saturation point. If there is no pure error $R^2$ can be made unity.

\begin{align*}
R^2 &= \frac{\text {SS due to regression given}\, b_0}{\text{Total SS corrected for mean} \, \bar{Y}} \\
&= \frac{SS \, (b_1 | b_0)}{S_{YY}} \\
&= \frac{\sum(\hat{Y_i}-\bar{Y})^2} {\sum(Y_i-\bar{Y})^2}r \\
&= \frac{S^2_{XY}}{(S_{XY})(S_{YY})}
\end{align*}

where summation are over i=1,2,…,n.

## Note that when interpreting R-Square $R^2$ does not indicate whether:

• the independent variables (explanatory variables) are a cause of the changes in the dependent variable;
• omitted-variable bias exists;
• the correct regression was used;
• the most appropriate set of explanatory variables has been selected;
• there is collinearity (or multicollinearity) present in the data;
• the model might be improved by using transformed versions of the existing set of explanatory variables.