## Pearson Correlation Coefficient (2012)

### Introduction to Pearson Correlation Coefficient

The correlation coefficient or Pearson Correlation Coefficient was originated by Karl Pearson in the 1900s. The Pearson Correlation Coefficient is a measure of the (degree of) strength of the linear relationship between two continuous random variables denoted by $\rho_{XY}$ for population and for sample it is denoted by $r_{XY}$.

The Pearson Correlation coefficient can take values that occur in the interval $[1,-1]$. If the coefficient value is $1$ or $-1$, there will be a perfect linear relationship between the variables. A positive sign with a coefficient value shows a positive (direct, or supportive), while a negative sign with a coefficient value shows a negative (indirect, opposite) relationship between the variables.

The zero-value implies the absence of a linear relation and it also shows that variables are independent. Zero value also shows that there may be some other sort of relationship between the variables of interest such as a systematic or circular relationship between the variables.

### Pearson’s Correlation Formula

Mathematically, if two random variables such as $X$ and $Y$ follow an unknown joint distribution then the simple linear correlation coefficient is equal to covariance between $X$ and $Y$ divided by the product of their standard deviations i.e

$\rho=\frac{Cov(X, Y)}{\sigma_X \sigma_Y}$

where $Cov(X, Y)$ is a measure of covariance between $X$ and $Y$, $\sigma_X$ and $\sigma_Y$ are the respective standard deviation of the random variables.

For a sample of size $n$, $(X_1, Y_1),(X_2, Y_2),\cdots,(X_n, Y_n)$ from the joint distribution, the quantity given below is an estimate of $\rho$, called sampling correlation and denoted by $r$.

\begin{eqnarray*}
r&=&\frac{\sum_{i=1}^{n}(X_i-\bar{X})(Y_i-\bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i-\bar{X})^2 \times \sum_{i=1}^{n}(Y_i-\bar{Y})^2}}\\
&=& \frac{Cov(X,Y)}{S_X  X_Y}
\end{eqnarray*}

Note that

• The existence of a statistical correlation does not mean that there exists a cause-and-effect relation between the variables. Cause and effect mean that a change in one variable does cause a change in the other variable.
• The changes in the variables may be due to a common cause or random variations.
• There are many kinds of correlation coefficients. The choice of which to use for a particular set of data depends on different factors such as
• Type of Scale (Level of Measurement or Measurement Scale) used to express the variables
• Nature of the underlying distribution (continuous or discrete)
• Characteristics of the distribution of the scores (linear or non-linear)
• Correlation is perfectly linear if a constant change in $X$ is accompanied by a constant change in $Y$. In this case, all the points in the scatter diagram will lie in a straight line.
• A high correlation coefficient does not necessarily imply a direct dependence on the variables. For example, there may be a high correlation between the number of crimes and shoe prices. Such a kind of correlation is referred to as a non-sense or spurious correlation.

### Properties of Pearson Correlation Coefficient

The following are important properties that a Pearson correlation coefficient can have:

1. The Pearson correlation coefficient is symmetrical for $X$ and $Y$ i.e. $r_{XY}=r_{YX}$.
2. The Correlation coefficient is a pure number and it does not depend upon the units in which the variables are measured.
3. The correlation coefficient is the geometric mean of the two regression coefficients. Thus if the two regression lines of $Y$ on $X$ and $X$ on $Y$ are written as $Y=a+bX$ and $X=c+dy$ respectively then $bd=r^2$.
4. The correlation coefficient is independent of the choice of origin and scale of measurement of the variables, i.e. $r$ remains unchanged if constants are added to or subtracted from the variables and if the variables having the same size are multiplied or divided by the class interval size.
5. The correlation coefficient lies between -1 and +1, symbolically $-1\le r \le 1$.

Take various Online MCQ quiz

Statistical Linear Models in R Language

## Correlation Coefficient: A Comprehensive Guide

The correlation is a measure of the co-variability of variables. It is used to measure the strength between two quantitative variables. It also tells the direction of a relationship between the variables. The positive value of the correlation coefficient indicates that there is a direct (supportive or positive) relationship between the variables while the negative value indicates there is a negative (opposite or indirect) relationship between the variables.

### Correlation as Interdependence Between Variables

By definition, Pearson’s correlation is the interdependence between two quantitative variables. The causation (known as) cause and effect, is when an observed event or action appears to have caused a second event or action. Therefore, It does not necessarily imply any functional relationship between the variables concerned. Correlation theory does not establish any causal relationship between the variables as it is interdependence between the variables. Knowledge of the value of Pearson’s correlation coefficient $r$ alone will not enable us to predict the value of $Y$ from $X$.

### High Correlation Coefficient does not Indicate Cause and Effect

Sometimes there is a high Relationship between unrelated variables such as the number of births and the number of murders in a country. This is a spurious correlation.

For example, suppose there is a positive correlation between watching violent movies and violent behavior in adolescence. The cause of both these could be a third variable (extraneous variable) say, growing up in a violent environment which causes the adolescents to watch violence-related movies and to have violent behavior.

Other Examples

• The number of absences from class lectures decreases the grades.
• As the weather gets colder, air conditioning costs decrease.
• As the speed of the train (car, bus, or any other vehicle) is increased the length of time to get to the final point will also decrease.
• As the age of a chicken increases the number of eggs it produces also decreases.

## Partial Correlation Coefficient (2012)

The Partial Correlation Coefficient measures the relationship between any two variables, where all other variables are kept constant i.e. controlling all other variables or removing the influence of all other variables. Partial correlation aims to find the unique variance between two variables while eliminating the variance from the third variable. The partial correlation technique is commonly used in “causal” modeling of fewer variables. The coefficient is determined in terms of the simple correlation coefficient among the various variables involved in multiple relationships.

#### Assumptions for computing the Partial Correlation Coefficient

The assumption for partial correlation is the usual assumption of Pearson Correlation:

1. Linearity of relationships
2. The same level of relationship throughout the range of the independent variable i.e. homoscedasticity
3. Interval or near-interval data, and
4. Data whose range is not truncated.

We typically conduct correlation analysis on all variables so that you can see whether there are significant relationships amongst the variables, including any “third variables” that may have a significant relationship to the variables under investigation.

This type of analysis helps to find the spurious correlations (i.e. correlations that are explained by the effect of some other variables) as well as to reveal hidden correlations – i.e. correlations masked by the effect of other variables. The partial correlation coefficient $r_{xy.z}$ can also be defined as the correlation coefficient between residuals $dx$ and $dy$ in this model.

Suppose we have a sample of $n$ observations $(x1_1,x2_1,x3_1),(x1_2,x2_2,x3_2),\cdots,(x1_n,x2_n,x3_n)$ from an unknown distribution of three random variables. We want to find the coefficient of partial correlation between $X_1$ and $X_2$ keeping $X_3$ constant which can be denoted by $r_{12.3}$ is the correlation between the residuals $x_{1.3}$ and $x_{2.3}$. The coefficient $r_{12.3}$ is a partial correlation of the 1st order.

$r_{12.3}=\frac{r_{12}-r_{13} r_{23}}{\sqrt{1-r_{13}^2 } \sqrt{1-r_{23}^2 } }$

The coefficient of partial correlation between three random variables $X$, $Y$, and $Z$ can be denoted by $r_{x,y,z}$ and also be defined as the coefficient of correlation between $\hat{x}_i$ and $\hat{y}_i$ with
\begin{align*}
\hat{x}_i&=\hat{\beta}_{0x}+\hat{\beta}_{1x}z_i\\
\hat{y}_i&=\hat{\beta}_{0y}+\hat{\beta}_{1y}z_i\\
\end{align*}
where $\hat{\beta}_{0x}$ and $\hat{\beta_{1x}}$ are the least square estimators obtained by regressing $x_i$ on $z_i$ and $\hat{\beta}_{0y}$ and $\hat{\beta}_{1y}$ are the least square estimators obtained by regressing $y_i$ on $z_i$. Therefore by definition, the partial correlation between of $x$ and $y$ by controlling $z$ is
$r_{xy.z}=\frac{\sum(\hat{x}_i-\overline{x})(\hat{y}_i-\overline{y})}{\sqrt{\sum(\hat{x}_i-\overline{x})^2}\sqrt{\sum(\hat{y}_i-\overline{y})^2}}$

The coefficient of partial correlation is determined in terms of the simple correlation coefficients among the various variables involved in a multiple relationship. It is a very helpful tool in the field of statistics for understanding the true underlying relationships between variables, especially when you are dealing with potentially confounding factors.

Reference
Yule, G. U. (1926). Why do we sometimes get non-sense correlations between time series? A study in sampling and the nature of time series. J. Roy. Stat. Soc. (2) 89, 1-64.

## Pearson’s Correlation Coefficient SPSS (2012)

### Pearson’s Correlation Coefficient SPSS

Pearson’s correlation coefficient (or correlation or simply correlation) is used to find the degree of linear relationship between two continuous variables. The value for a correlation coefficient lies between 0.00 (no correlation) and 1.00 (perfect correlation). Generally, correlations above 0.80 are considered pretty high.

Remember:

1. Correlation is the interdependence of continuous variables, it does not refer to cause and effect.
2. Correlation is used to determine the linear relationship between variables.
3. Draw a scatter plot before performing/calculating the correlation (to check the assumptions of linearity)

#### How to Perform Pearson’s Correlation Coefficient SPSS

The command for correlation is found at Analyze –> Correlate –> Bivariate i.e.

The Bivariate Correlation Coefficient SPSS dialog box will be there:

Select one of the variables that you want to correlate in the left-hand pane of the Bivariate Correlations dialog box and shift it into the Variables pane on the right-hand pan by clicking the arrow button. Now click on the other variable that you want to correlate in the left-hand pane and move it into the Variables pane by clicking on the arrow button

#### Correlation Coefficient SPSS Output

The Correlations table in the output gives the values of the specified correlation tests, such as Pearson’s correlation. Each row of the table corresponds to one of the variables similarly each column also corresponds to one of the variables.

#### Interpreting Correlation Coefficient

For example, the cell at the bottom row of the right column represents the correlation of depression with depression which is equal to 1.0. Likewise, the cell at the middle row of the middle column represents the correlation of anxiety with anxiety having a correlation value This in in both cases shows that anxiety is related to anxiety similarly depression is related to depression, so have the perfect relationship.

The cell in the middle row and right column (or the cell in the bottom row at the middle column) is more interesting. This cell represents the correlation between anxiety and depression (or depression with anxiety). There are three numbers in these cells.

1. The top number is the correlation coefficient value which is 0.310.
2. The middle number is the significance of this correlation which is 0.018.
3. The bottom number, 46 is the number of observations that were used to calculate the correlation coefficient. between the variables of the study.

Note that the significance tells us whether we would expect a correlation that was this large purely due to chance factors and not due to an actual relation. In this case, it is improbable that we would get an r (correlation coefficient) this big if there was not a relation between the variables.