# Tagged: correlation

## Correlation Coeficient values lies between +1 and -1?

We know that the ratio of the explained variation to the total variation is called the coefficient of determination which is the square of the correlation coefficient. This ratio is non-negative, therefore denoted by $r^2$, thus

\begin{align*}
r^2&=\frac{\text{Explained Variation}}{\text{Total Variation}}\\
&=\frac{\sum (\hat{Y}-\overline{Y})^2}{\sum (Y-\overline{Y})^2}
\end{align*}

It can be seen that if the total variation is all explained, the ratio $r^2$ (Coefficient of Determination) is one and if the total variation is all unexplained then the explained variation and the ratio $r^2$ is zero.

The square root of the coefficient of determination is called the correlation coefficient, given by

\begin{align*}
r&=\sqrt{ \frac{\text{Explained Variation}}{\text{Total Variation}} }\\
&=\pm \sqrt{\frac{\sum (\hat{Y}-\overline{Y})^2}{\sum (Y-\overline{Y})^2}}
\end{align*}

and

$\sum (\hat{Y}-\overline{Y})^2=\sum(Y-\overline{Y})^2-\sum (Y-\hat{Y})^2$

therefore

\begin{align*}
r&=\sqrt{ \frac{\sum(Y-\overline{Y})^2-\sum (Y-\hat{Y})^2} {\sum(Y-\overline{Y})^2} }\\
&=\sqrt{1-\frac{\sum (Y-\hat{Y})^2}{\sum(Y-\overline{Y})^2}}\\
&=\sqrt{1-\frac{\text{Unexplained Variation}}{\text{Total Variation}}}=\sqrt{1-\frac{S_{y.x}^2}{s_y^2}}
\end{align*}

where $s_{y.x}^2=\frac{1}{n} \sum (Y-\hat{Y})^2$ and $s_y^2=\frac{1}{n} \sum (Y-\overline{Y})^2$

\begin{align*}
\Rightarrow r^2&=1-\frac{s_{y.x}^2}{s_y^2}\\
\Rightarrow s_{y.x}^2&=s_y^2(1-r^2)
\end{align*}

Since variances are non-negative

$\frac{s_{y.x}^2}{s_y^2}=1-r^2 \geq 0$

Solving for inequality we have

\begin{align*}
1-r^2 & \geq 0\\
\Rightarrow r^2 \leq 1\, \text{or}\, |r| &\leq 1\\
\Rightarrow & -1 \leq r\leq 1
\end{align*}

## Alternative Proof

Since $\rho(X,Y)=\rho(X^*,Y^*)$ where $X^*=\frac{X-\mu_X}{\sigma_X}$ and $Y^*=\frac{Y-Y^*}{\sigma_Y}$

and as covariance is bi-linear and X* ,Y* have zero mean and variance 1, therefore

\begin{align*}
\rho(X^*,Y^*)&=Cov(X^*,Y^*)=Cov\{\frac{X-\mu_X}{\sigma_X},\frac{Y-\mu_Y}{\sigma_Y}\}\\
&=\frac{Cov(X-\mu_X,Y-\mu_Y)}{\sigma_X\sigma_Y}\\
&=\frac{Cov(X,Y)}{\sigma_X \sigma_Y}=\rho(X,Y)
\end{align*}

We also know that the variance of any random variable is ≥0, it could be zero i.e .(Var(X)=0) if and only if X is a constant (almost surely), therefore

$V(X^* \pm Y^*)=V(X^*)+V(Y^*)\pm2Cov(X^*,Y^*)$

As Var(X*)=1 and Var(Y*)=1, the above equation would be negative if $Cov(X^*,Y^*)$ is either greater than 1 or less than -1. Hence $1\geq \rho(X,Y)=\rho(X^*,Y^*)\geq -1$.

If $\rho(X,Y )=Cov(X^*,Y^*)=1$ then $Var(X^*- Y ^*)=0$ making X* =Y* almost surely. Similarly, if $\rho(X,Y )=Cov(X^*,Y^*)=-1$ then X*=−Y* almost surely. In either case, Y would be a linear function of X almost surely.

We can see that the Correlation Coefficient values lie between -1 and +1.

## Descriptive Statistics Multivariate Data set

Much of the information contained in the data can be assessed by calculating certain summary numbers, known as descriptive statistics such as Arithmetic mean (a measure of location), an average of the squares of the distances of all of the numbers from the mean (variation/spread i.e. measure of spread or variation), etc. Here we will discuss descriptive statistics multivariate data set.

We shall rely most heavily on descriptive statistics which is a measure of location, variation, and linear association. For descriptive statistics multivariate data set, let us start with a measure of location, a measure of spread, sample covariance, and sample correlation coefficient.

### Measure of Location

The arithmetic Average of $n$ measurements $(x_{11}, x_{21}, x_{31},x_{41})$ on the first variable (defined in Multivariate Analysis: An Introduction) is

Sample Mean = $\bar{x}=\frac{1}{n} \sum _{j=1}^{n}x_{j1} \mbox{ where } j =1, 2,3,\cdots , n$

The sample mean for $n$ measurements on each of the p variables (there will be p sample means)

$\bar{x}_{k} =\frac{1}{n} \sum _{j=1}^{n}x_{jk} \mbox{ where } k = 1, 2, \cdots , p$

Measure of spread (variance) for $n$ measurements on the first variable can be found as
$s_{1}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )^{2}$ where $\bar{x}_{1}$ is sample mean of the $x_{j}$’s for p variables.

Measure of spread (variance) for $n$ measurements on all variable can be found as

$s_{k}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} \mbox{ where } k=1,2,\dots ,p \mbox{ and } j=1,2,\cdots ,p$

The Square Root of the sample variance is sample standard deviation i.e

$S_{l}^{2} =S_{kk} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} \mbox{ where } k=1,2,\cdots ,p$

### Sample Covariance

Consider n pairs of measurement on each of Variable 1 and Variable 2
$\left[\begin{array}{c} {x_{11} } \\ {x_{12} } \end{array}\right],\left[\begin{array}{c} {x_{21} } \\ {x_{22} } \end{array}\right],\cdots ,\left[\begin{array}{c} {x_{n1} } \\ {x_{n2} } \end{array}\right]$
That is $x_{j1}$ and $x_{j2}$ are observed on the jth experimental item $(j=1,2,\cdots ,n)$. So a measure of linear association between the measurements of  $V_1$ and $V_2$ is provided by the sample covariance
$s_{12} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )(x_{j2} -\bar{x}_{2} )$
(the average of product of the deviation from their respective means) therefore

$s_{ik} =\frac{1}{n} \sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )(x_{jk} -\bar{x}_{k} )$;  i=1,2,..,p and k=1,2,\… ,p.

It measures the association between the kth variable.

Variance is the most commonly used measure of dispersion (variation) in the data and it is directly proportional to the amount of variation or information available in the data.

### Sample Correlation Coefficient

The sample correlation coefficient for the ith and kth variable is

$r_{ik} =\frac{s_{ik} }{\sqrt{s_{ii} } \sqrt{s_{kk} } } =\frac{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{j} )(x_{jk} -\bar{x}_{k} ) }{\sqrt{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )^{2} } \sqrt{\sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} } }$
$\mbox{ where } i=1,2,..,p \mbox{ and} k=1,2,\dots ,p$

Note that $r_{ik} =r_{ki}$ for all $i$ and $k$, and $r$ lies between -1 and +1. $r$ measures the strength of the linear association. If $r=0$ the lack of linear association between the components exists. The sign of $r$ indicates the direction of the association.

# High Correlation does not Indicate Cause and Effect

The correlation is a measure of the co-variability of variables. It is used to measure the strength between two quantitative variables. It also tells the direction of a relationship between the variables. The positive value of the correlation coefficient indicates that there is a direct (supportive or positive) relationship between the variables while the negative value indicates there is negative (opposite or indirect) relationship between the variables. By definition, the correlation is interdependence between two quantitative variables. The causation (known as) cause and effect, is when an observed event or action appears to have caused a second event or action. Therefore, It does not necessarily imply any functional relationship between variables concerned. Correlation theory does not establish any causal relationship between the variables as it is interdependence between the variables. Knowledge of the value of the coefficient of correlation r alone will not enable us to predict the value of Y from X.

Sometimes there is the high correlation between unrelated variable such as the number of births and numbers of murders in a country. This is a spurious correlation.

For example, suppose there is a positive correlation between watching violent movies and violent behavior in adolescence. The cause of both these could be a third variable (extraneous variable) say, growing up in a violent environment which causes the adolescence to watch violence related movies and to have violent behavior.

Other Examples

• The number of absences from class lecture decreases the grades.
• As the weather gets colder, air conditioning costs decrease.
• As the speed of the train (car, bus, or any other vehicle) is increased the length of time to get to the final point will also decrease.
• As the age of a chicken increases the number of eggs it produces also decreases.

## Partial Correlation: Another measure of relationship

It measures the relationship between any two variables, where all other variables are kept constant i.e. controlling all other variables or removing the influence of all other variables. The purpose of partial correlation is to find the unique variance between two variables while eliminating the variance from the third variable. The technique of partial correlation is commonly used in “causal” modeling fewer variables. The partial correlation canadian pharmacy coefficient is determined in terms of the simple correlation coefficient among the various variables involved in multiple relationships. The assumption for partial correlation is the usual assumptions of Pearson Correlation:

1. Linearity of relationships
2. The same level of relationship throughout the range of the independent variable i.e. homoscedasticity
3. Interval or near-interval data, and
4. Data whose range is not truncated.

We typically conduct correlation analysis on all variables so that you can see whether there are significant relationships amongst the variables, including any “third variables” that may have a significant relationship to the variables under investigation.

This type of analysis helps to find the spurious correlations (i.e. correlations that are explained by the effect of some other variables) as well as to reveal hidden correlations – i.e correlations masked by the effect of other variables. The partial correlation coefficient $r_{xy.z}$ can also be defined as the correlation coefficient between residuals dx and dy in this model.

Suppose we have a sample of n observations $(x1_1,x2_1,x3_1),(x1_2,x2_2,x3_2),\cdots,(x1_n,x2_n,x3_n)$ from an unknown distribution of three random variables and we want to find the coefficient of partial correlation between $X_1$ and $X_2$ keeping $X_3$ constant which can be denoted by $r_{12.3}$ is the correlation between the residuals $x_{1.3}$ and $x_{2.3}$. The coefficient $r_{12.3}$ is a partial correlation of the 1st order.

$r_{12.3}=\frac{r_{12}-r_{13} r_{23}}{\sqrt{1-r_{13}^2 } \sqrt{1-r_{23}^2 } }$

The coefficient of partial correlation between three random variables X, Y and Z can be denoted by $r_{x,y,z}$ and also be defined as the coefficient of correlation between $\hat{x}_i$ and $\hat{y}_i$ with
\begin{align*}
\hat{x}_i&=\hat{\beta}_{0x}+\hat{\beta}_{1x}z_i\\
\hat{y}_i&=\hat{\beta}_{0y}+\hat{\beta}_{1y}z_i\\
\end{align*}
where $\hat{\beta}_{0x}$ and $\hat{\beta_{1x}}$ are the least square estimators obtained by regressing $x_i$ on $z_i$ and $\hat{\beta}_{0y}$ and $\hat{\beta}_{1y}$ are the least square estimators obtained by regressing $y_i$ on $z_i$. Therefore by definition, the partial correlation between of $x$ and $y$ by controlling $z$ is $r_{xy.z}=\frac{\sum(\hat{x}_i-\overline{x})(\hat{y}_i-\overline{y})}{\sqrt{\sum(\hat{x}_i-\overline{x})^2}\sqrt{\sum(\hat{y}_i-\overline{y})^2}}$

The partial correlation coefficient is determined in terms of the simple correlation coefficients among the various variables involved in a multiple relationship.

Reference
Yule, G. U. (1926). Why do we sometimes get non-sense correlation between time series? A study in sampling and the nature of time series. J. Roy. Stat. Soc. (2) 89, 1-64.