# Category: Pearson’s Correlation Coefficient

## Coefficient of Determination

Coefficient of Determination as a Link between Regression and Correlation Analysis

The R squared ($r^2$; the square of the correlation coefficient) shows the percentage of the total variation of the dependent variable ($Y$) that can be explained by the independent (explanatory) variable ($X$). For this reason, $r^2$ (r-squared) is sometimes called the coefficient of determination.

Since

$r=\frac{\sum x_i y_y}{\sqrt{\sum x_i^2} \sqrt{\sum y_i^2}},$

then

\begin{align*}
r^2&=\frac{(\sum x_iy_i)^2}{(\sum x_i^2)(\sum y_i^2)}=\frac{\sum \hat{y}^2}{\sum y^2}\\
&=\frac{\text{Explained Variation}}{\text{Total Variation}}
\end{align*}

where $r$ shows the degree of covariability of $X$ and $Y$. Note that in the formula used here is in deviation form, that is, $x=X-\mu$ and $y=Y-\mu$.

The link of $r^2$ between regression and correlation analysis can be considered from these points.

• If all the observations lie on the regression line then there will be no scattered of points. In other words, the total variation of variable $Y$ is explained completely by the estimated regression line, which shows that there would be no scatterness in the data points(or no unexplained variation). That is
$\frac{\sum e^2}{\sum y^2}=\frac{\text{Unexplained Variation}}{\text{Total Variation}}=0$
Hence, $r^2=r=1$.
• If the regression line explains only part of the variation in variable $Y$ then there will be some explained variation, that is,
$\frac{\sum e^2}{\sum y^2}=\frac{\text{Unexplained Variation}}{\text{Total Variation}}>0$
then, $r^2$ will be smaller than 1.
• If the regression line does not explain any part of the variation of variable $Y$, that is,
$\frac{\sum e^2}{\sum y^2}=\frac{\text{Unexplained Variation}}{\text{Total Variation}}=1\Rightarrow=\sum y^2 = \sum e^2$
then, $r^2=0$.

Because $r^2=1-\frac{\text{unexlained variation}}{\text{total variation}}$

## Correlation Coeficient values lies between +1 and -1?

We know that the ratio of the explained variation to the total variation is called the coefficient of determination which is the square of the correlation coefficient. This ratio is non-negative, therefore denoted by $r^2$, thus

\begin{align*}
r^2&=\frac{\text{Explained Variation}}{\text{Total Variation}}\\
&=\frac{\sum (\hat{Y}-\overline{Y})^2}{\sum (Y-\overline{Y})^2}
\end{align*}

It can be seen that if the total variation is all explained, the ratio $r^2$ (Coefficient of Determination) is one and if the total variation is all unexplained then the explained variation and the ratio $r^2$ is zero.

The square root of the coefficient of determination is called the correlation coefficient, given by

\begin{align*}
r&=\sqrt{ \frac{\text{Explained Variation}}{\text{Total Variation}} }\\
&=\pm \sqrt{\frac{\sum (\hat{Y}-\overline{Y})^2}{\sum (Y-\overline{Y})^2}}
\end{align*}

and

$\sum (\hat{Y}-\overline{Y})^2=\sum(Y-\overline{Y})^2-\sum (Y-\hat{Y})^2$

therefore

\begin{align*}
r&=\sqrt{ \frac{\sum(Y-\overline{Y})^2-\sum (Y-\hat{Y})^2} {\sum(Y-\overline{Y})^2} }\\
&=\sqrt{1-\frac{\sum (Y-\hat{Y})^2}{\sum(Y-\overline{Y})^2}}\\
&=\sqrt{1-\frac{\text{Unexplained Variation}}{\text{Total Variation}}}=\sqrt{1-\frac{S_{y.x}^2}{s_y^2}}
\end{align*}

where $s_{y.x}^2=\frac{1}{n} \sum (Y-\hat{Y})^2$ and $s_y^2=\frac{1}{n} \sum (Y-\overline{Y})^2$

\begin{align*}
\Rightarrow r^2&=1-\frac{s_{y.x}^2}{s_y^2}\\
\Rightarrow s_{y.x}^2&=s_y^2(1-r^2)
\end{align*}

Since variances are non-negative

$\frac{s_{y.x}^2}{s_y^2}=1-r^2 \geq 0$

Solving for inequality we have

\begin{align*}
1-r^2 & \geq 0\\
\Rightarrow r^2 \leq 1\, \text{or}\, |r| &\leq 1\\
\Rightarrow & -1 \leq r\leq 1
\end{align*}

## Alternative Proof

Since $\rho(X,Y)=\rho(X^*,Y^*)$ where $X^*=\frac{X-\mu_X}{\sigma_X}$ and $Y^*=\frac{Y-Y^*}{\sigma_Y}$

and as covariance is bi-linear and X* ,Y* have zero mean and variance 1, therefore

\begin{align*}
\rho(X^*,Y^*)&=Cov(X^*,Y^*)=Cov\{\frac{X-\mu_X}{\sigma_X},\frac{Y-\mu_Y}{\sigma_Y}\}\\
&=\frac{Cov(X-\mu_X,Y-\mu_Y)}{\sigma_X\sigma_Y}\\
&=\frac{Cov(X,Y)}{\sigma_X \sigma_Y}=\rho(X,Y)
\end{align*}

We also know that the variance of any random variable is ≥0, it could be zero i.e .(Var(X)=0) if and only if X is a constant (almost surely), therefore

$V(X^* \pm Y^*)=V(X^*)+V(Y^*)\pm2Cov(X^*,Y^*)$

As Var(X*)=1 and Var(Y*)=1, the above equation would be negative if $Cov(X^*,Y^*)$ is either greater than 1 or less than -1. Hence $1\geq \rho(X,Y)=\rho(X^*,Y^*)\geq -1$.

If $\rho(X,Y )=Cov(X^*,Y^*)=1$ then $Var(X^*- Y ^*)=0$ making X* =Y* almost surely. Similarly, if $\rho(X,Y )=Cov(X^*,Y^*)=-1$ then X*=−Y* almost surely. In either case, Y would be a linear function of X almost surely.

We can see that the Correlation Coefficient values lie between -1 and +1.

## Pearson Correlation Coefficient use, Interpretation, Properties

The correlation coefficient or Pearson’s Correlation Coefficient was originated by Karl Pearson in the 1900s. The Pearson’s Correlation Coefficient is a measure of the (degree of) strength of the linear relationship between two continuous random variables denote by $\rho_{XY}$ for population and for sample it is denoted by $r_{XY}$.

The Correlation coefficient can take values that occur in the interval [1,-1]. If the coefficient value is 1 or -1, there will be a perfect linear relationship between the variables. A positive sign with a coefficient value shows a positive (direct, or supportive), while a negative sign with a coefficient value shows the negative (indirect, opposite) relationship between the variables. The zero-value implies the absence of a linear relation and it also shows that variables are independent. Zero value also shows that there may be some other sort of relationship between the variables of interest such as a systematic or circular relationship between the variables.

Mathematically, if two random variables such as $X$ and $Y$ follow an unknown joint distribution then the simple linear correlation coefficient is equal to covariance between $X$ and $Y$ divided by the product of their standard deviations i.e

$\rho=\frac{Cov(X, Y)}{\sigma_X \sigma_Y}$

where $Cov(X, Y)$ is a measure of covariance between $X$ and $Y$, $\sigma_X$ and $\sigma_Y$ are the respective standard deviation of the random variables.

For a sample of size $n$, $(X_1, Y_1),(X_2, Y_2),\cdots,(X_n, Y_n)$ from the joint distribution, the quantity given bellow is an estimate of $\rho$, called sampling correlation and denoted by r.

\begin{eqnarray*}
r&=&\frac{\sum_{i=1}^{n}(X_i-\bar{X})(Y_i-\bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i-\bar{X})^2 \times \sum_{i=1}^{n}(Y_i-\bar{Y})^2}}\\
&=& \frac{Cov(X,Y)}{S_X  X_Y}
\end{eqnarray*}

Note that

• The existence of a statistical correlation does not mean that there exists a cause and effect relation between the variables. Cause and effect mean that change in one variable does cause a change in the other variable.
• The changes in the variables may be due to a common cause or random variations.
• There are many kinds of correlation coefficients. The choice of which to use for a particular set of data depends on different factors such as
• Type of Scale (Measurement Scale) used to express the variables
• Nature of the underlying distribution (continuous or discrete)
• Characteristics of the distribution of the scores (linear or non-linear)
• Correlation is perfectly linear if a constant change in $X$ is accompanied by a constant change in $Y$. In this case, all the points in the scatter diagram will lie in a straight line.
• A high correlation coefficient does not necessarily imply a direct dependence of the variables. For example, there may be a high correlation between the number of crimes and shoe prices. Such a kind of correlation referred to as non-sense or spurious correlations.

## Properties of the Correlation Coefficient

1. The correlation coefficient is symmetrical with respect to $X$ and $Y$ i.e. $r_{XY}=r_{YX}$.
2. The Correlation coefficient is a pure number and it does not depend upon the units in which the variables are measure.
3. The correlation coefficient is the geometric mean of the two regression coefficients. Thus if the two regression lines of $Y$ on $X$ and $X$ on $Y$ are written as $Y=a+bX$ and $X=c+dy$ respectively then $bd=r^2$.
4. The correlation coefficient is independent of the choice of origin and scale of measurement of the variables, i.e. $r$ remains unchanged if constants are added to or subtracted from the variables and if the variables having the same size are multiplied or divided by the class interval size.
5. The correlation coefficient lies between -1 and +1, symbolically $-1\le r \le 1$.

# High Correlation does not Indicate Cause and Effect

The correlation is a measure of the co-variability of variables. It is used to measure the strength between two quantitative variables. It also tells the direction of a relationship between the variables. The positive value of the correlation coefficient indicates that there is a direct (supportive or positive) relationship between the variables while the negative value indicates there is negative (opposite or indirect) relationship between the variables.

By definition, the correlation is interdependence between two quantitative variables. The causation (known as) cause and effect, is when an observed event or action appears to have caused a second event or action. Therefore, It does not necessarily imply any functional relationship between variables concerned. Correlation theory does not establish any causal relationship between the variables as it is interdependence between the variables. Knowledge of the value of the coefficient of correlation r alone will not enable us to predict the value of Y from X.

Sometimes there is the high correlation between unrelated variable such as the number of births and numbers of murders in a country. This is a spurious correlation.

For example, suppose there is a positive correlation between watching violent movies and violent behavior in adolescence. The cause of both these could be a third variable (extraneous variable) say, growing up in a violent environment which causes the adolescence to watch violence related movies and to have violent behavior.

Other Examples

• The number of absences from class lecture decreases the grades.
• As the weather gets colder, air conditioning costs decrease.
• As the speed of the train (car, bus, or any other vehicle) is increased the length of time to get to the final point will also decrease.
• As the age of a chicken increases the number of eggs it produces also decreases.