# Basic Statistics and Data Analysis

## Correlation Coeficient values lies between +1 and -1?

We know that the ratio of the explained variation to the total variation is called the coefficient of determination. This ratio is non-negative, therefore denoted by $r^2$, thus

\begin{align*}
r^2&=\frac{\text{Explained Variation}}{\text{Total Variation}}\\
&=\frac{\sum (\hat{Y}-\overline{Y})^2}{\sum (Y-\overline{Y})^2}
\end{align*}

It can be seen that if the total variation is all explained, the ratio $r^2$ (Coefficient of Determination) is one and if the total variation is all unexplained then the explained variation and the ratio r2 is zero.

The square root of the coefficient of determination is called the correlation coefficient, given by

\begin{align*}
r&=\sqrt{ \frac{\text{Explained Variation}}{\text{Total Variation}} }\\
&=\pm \sqrt{\frac{\sum (\hat{Y}-\overline{Y})^2}{\sum (Y-\overline{Y})^2}}
\end{align*}

and

$\sum (\hat{Y}-\overline{Y})^2=\sum(Y-\overline{Y})^2-\sum (Y-\hat{Y})^2$

therefore

\begin{align*}
r&=\sqrt{ \frac{\sum(Y-\overline{Y})^2-\sum (Y-\hat{Y})^2} {\sum(Y-\overline{Y})^2} }\\
&=\sqrt{1-\frac{\sum (Y-\hat{Y})^2}{\sum(Y-\overline{Y})^2}}\\
&=\sqrt{1-\frac{\text{Unexplained Variation}}{\text{Total Variation}}}=\sqrt{1-\frac{S_{y.x}^2}{s_y^2}}
\end{align*}

where $s_{y.x}^2=\frac{1}{n} \sum (Y-\hat{Y})^2$ and $s_y^2=\frac{1}{n} \sum (Y-\overline{Y})^2$

\begin{align*}
\Rightarrow r^2&=1-\frac{s_{y.x}^2}{s_y^2}\\
\Rightarrow s_{y.x}^2&=s_y^2(1-r^2)
\end{align*}

Since variances are non-negative

$\frac{s_{y.x}^2}{s_y^2}=1-r^2 \geq 0$

Solving for inequality we have

\begin{align*}
1-r^2 & \geq 0\\
\Rightarrow r^2 \leq 1\, \text{or}\, |r| &\leq 1\\
\Rightarrow & -1 \leq r\leq 1
\end{align*}

## Alternative Proof

Since $\rho(X,Y)=\rho(X^*,Y^*)$ where $X^*=\frac{X-\mu_X}{\sigma_X}$ and $Y^*=\frac{Y-Y^*}{\sigma_Y}$

and as covariance is bi-linear and X* ,Y* have zero mean and variance 1, therefore

\begin{align*}
\rho(X^*,Y^*)&=Cov(X^*,Y^*)=Cov\{\frac{X-\mu_X}{\sigma_X},\frac{Y-\mu_Y}{\sigma_Y}\}\\
&=\frac{Cov(X-\mu_X,Y-\mu_Y)}{\sigma_X\sigma_Y}\\
&=\frac{Cov(X,Y)}{\sigma_X \sigma_Y}=\rho(X,Y)
\end{align*}

We also know that the variance of any random variable is ≥0, it could be zero i.e .(Var(X)=0) if and only if X is a constant (almost surely), therefore

$V(X^* \pm Y^*)=V(X^*)+V(Y^*)\pm2Cov(X^*,Y^*)$

As Var(X*)=1 and Var(Y*)=1, the above equation would be negative if $Cov(X^*,Y^*)$ is either greater than 1 or less than -1. Hence $1\geq \rho(X,Y)=\rho(X^*,Y^*)\geq -1$.

If $\rho(X,Y )=Cov(X^*,Y^*)=1$ then $Var(X^*- Y ^*)=0$ making X* =Y* almost surely. Similarly, if $\rho(X,Y )=Cov(X^*,Y^*)=-1$ then X*=−Y* almost surely. In either case, Y would be a linear function of X almost surely.

We can see that Correlation Coefficient values lies between -1 and +1.

# Pearson Correlation Coefficient

The correlation coefficient or Pearson’s Correlation Coefficient was originated by Karl Pearson in 1900’s. Correlation coefficient is a measure of the (degree of) strength of the linear relationship between two continuous random variables denote by ρXY for population and for sample it is denoted by rXY.

Correlation coefficient can take values that occur in the interval [1,-1]. If coefficient values is 1 or -1, there will be perfect linear relationship between the variables. Positive sign with coefficient value shows positive (direct, or supportive), while negative sign with coefficient value show negative (indirect, opposite)  relationship between the variables. The Zero value implies the absence of a linear linear relation and it also shows that variables are independent. Zero value also shows that there may be some other sort of relationship between the variables of interest such as systematic or circular relation between the variables.

Mathematically, if two random variables such as X and Y follow an unknown joint distribution then the simple linear correlation coefficient is equal to covariance between X and Y divided by the product of their standard deviations i.e

$\rho=\frac{Cov(X, Y)}{\sigma_X \sigma_Y}$

where Cov(X, Y) is measure of covariance between X and Y, σXand σY are the respective standard deviation of the random variables.

For a sample of size n, (X1, Y1),(X2, Y2),…,(Xn, Yn) from the joint distribution, the quantity given bellow is an estimate of ρ, called sampling correlation and denoted by r.

\begin{eqnarray*}
r&=&\frac{\sum_{i=1}^{n}(X_i-\bar{X})(Y_i-\bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i-\bar{X})^2 \times \sum_{i=1}^{n}(Y_i-\bar{Y})^2}}\\
&=& \frac{Cov(X,Y)}{S_X  X_Y}
\end{eqnarray*}

Note that

• The existence of a statistical correlation does not means that there exists a cause and effect relation between the variables. Cause and effect means that change in one variable does cause a change in the other variable.
• The changes in the variables may be due to a common cause or random variations.
• There are many kind of correlation coefficient. The choice of which to use for a particular set of data depends on different factors such as
• Type of Scale (Measurement Scale) used to express the variables
• Nature of the underlying distribution (continuous or discrete)
• Characteristics of the distribution of the scores (linear or non linear)
• Correlation is perfect linear if a constant change in X is accompanied by a constant change in Y. In this case all the points in scatter diagram will lie on a straight line.
• High correlation coefficient does not necessarily imply a direct dependence of the variables. For example there may be a high correlation between number of crimes and shoe prices. Such kind of correlation referred as non-sense or spurious correlations.

## Properties of the Correlation Coefficient

1. The correlation coefficient is symmetrical with respect to X and Y i.e. rXY=rYX
2. The Correlation coefficient is a pure number and it does not depend upon the units in which the variables are measure.
3. The correlation coefficient is the geometric mean of the two regression coefficients. Thus if the two regression lines of Y on X and X on Y are written as Y=a+bx and X=c+dy respectively then bd=r2.
4. The correlation coefficient is independent of the choice of origin and scale of measurement of the variables, i.e. r remains unchanged if constants are added to or subtracted from the variables and if the variables having same size are multiplied or divided by the class interval size.
5. The correlation coefficient lies between -1 and +1, symbolically -1≤r≤1.

# High Correlation does not Indicates Cause and Effect

The correlation coefficient is a measure of the co-variability of variables. It does not necessarily imply any functional relationship between variables concerned. Correlation theory does not establish any causal relationship between the variables. Knowledge of the value of coefficient of correlation r alone will not enable us to predict the value of Y from X.

Sometimes their is high correlation between unrelated variable such as number of births and numbers of murders in a country. This is spurious correlation.

For example suppose there is a positive correlation between watching violence movies and violent behavior in adolescence. The cause of both these could be a third variable (extraneous variable) say, growing up in a violent environment which causes the adolescence to watch violence related movies and to have violent behavior.

# Pearson’s Correlation Coefficient SPSS

The Pearson’s correlation or correlation coefficient or simply correlation  is used to find the degree of linear relationship between two continuous variables. The value for a correlation coefficient lies between 0.00 (no correlation) and 1.00 (perfect correlation). Generally, correlations above 0.80 are considered pretty high.

Remember:

1. Correlation is interdependence of continuous variables, it does not refer to any cause and effect.
2. Correlation is used to determine linear relationship between variables.
3. Draw a scatter plot before performing/calculating the correlation (to check the assumptions of linearity)

How to Correlation Coefficient in SPSS

The command for correlation is found at Analyze –> Correlate –> Bivariate i.e.

The Bivariate Correlations dialog box will be there:

Select one of the variables that you want to correlate in the left hand pane of the Bivariate Correlations dialog box and shift it into the Variables pane on the right hand pan by clicking the arrow button. Now click on the other variable that you want to correlate in the left hand pane and move it into the Variables pane by clicking on the arrow button

Output

The Correlations table in output gives the values of the specified correlation tests, such as Pearson’s correlation. Each row of the table corresponds to one of the variables similarly each column also corresponds to one of the variables.

Interpreting Correlation Coefficient

In example, the cell at the bottom row of the right column represents the correlation of depression with depression having the correlation equal to 1.0. Likewise the cell at the middle row of the middle column represents the correlation of anxiety with anxiety having correlation value This in in both cases shows that anxiety is related with anxiety similarly depression is related to depression, so have perfect relationship.

The cell at middle row and right column (or cell at the bottom row at the middle column) is more interesting. This cell represents the correlation of anxiety and depression (or depression with anxiety). There are three numbers in these cells.

1. The top number is the correlation coefficient value which is 0.310.
2. The middle number is the significance of this correlation which is 0.018.
3. The bottom number, 46 is the number of observations that were used to calculate the correlation coefficient. between the variable of study.

Note that the significance tells us whether we would expect a correlation that was this large purely due to chance factors and not due to an actual relation. In this case, it is improbable that we would get an r (correlation coefficient) this big if there was not a relation between the variables.