Pearson Correlation Coefficient (2012)

Introduction to Pearson Correlation Coefficient

The correlation coefficient or Pearson Correlation Coefficient was originated by Karl Pearson in the 1900s. The Pearson Correlation Coefficient is a measure of the (degree of) strength of the linear relationship between two continuous random variables denoted by $\rho_{XY}$ for population and for sample it is denoted by $r_{XY}$.

The Pearson Correlation coefficient can take values that occur in the interval $[1,-1]$. If the coefficient value is $1$ or $-1$, there will be a perfect linear relationship between the variables. A positive sign with a coefficient value shows a positive (direct, or supportive), while a negative sign with a coefficient value shows a negative (indirect, opposite) relationship between the variables.

The zero-value implies the absence of a linear relation and it also shows that variables are independent. Zero value also shows that there may be some other sort of relationship between the variables of interest such as a systematic or circular relationship between the variables.

Pearson Correlation Coefficient Scatter Diagram

Pearson’s Correlation Formula

Mathematically, if two random variables such as $X$ and $Y$ follow an unknown joint distribution then the simple linear correlation coefficient is equal to covariance between $X$ and $Y$ divided by the product of their standard deviations i.e

\[\rho=\frac{Cov(X, Y)}{\sigma_X \sigma_Y}\]

where $Cov(X, Y)$ is a measure of covariance between $X$ and $Y$, $\sigma_X$ and $\sigma_Y$ are the respective standard deviation of the random variables.

For a sample of size $n$, $(X_1, Y_1),(X_2, Y_2),\cdots,(X_n, Y_n)$ from the joint distribution, the quantity given below is an estimate of $\rho$, called sampling correlation and denoted by $r$.

\begin{eqnarray*}
r&=&\frac{\sum_{i=1}^{n}(X_i-\bar{X})(Y_i-\bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i-\bar{X})^2 \times \sum_{i=1}^{n}(Y_i-\bar{Y})^2}}\\
&=& \frac{Cov(X,Y)}{S_X  X_Y}
\end{eqnarray*}

Note that

  • The existence of a statistical correlation does not mean that there exists a cause-and-effect relation between the variables. Cause and effect mean that a change in one variable does cause a change in the other variable.
  • The changes in the variables may be due to a common cause or random variations.
  • There are many kinds of correlation coefficients. The choice of which to use for a particular set of data depends on different factors such as
    • Type of Scale (Level of Measurement or Measurement Scale) used to express the variables
    • Nature of the underlying distribution (continuous or discrete)
    • Characteristics of the distribution of the scores (linear or non-linear)
  • Correlation is perfectly linear if a constant change in $X$ is accompanied by a constant change in $Y$. In this case, all the points in the scatter diagram will lie in a straight line.
  • A high correlation coefficient does not necessarily imply a direct dependence on the variables. For example, there may be a high correlation between the number of crimes and shoe prices. Such a kind of correlation is referred to as a non-sense or spurious correlation.

Properties of Pearson Correlation Coefficient

The following are important properties that a Pearson correlation coefficient can have:

  1. The Pearson correlation coefficient is symmetrical for $X$ and $Y$ i.e. $r_{XY}=r_{YX}$.
  2. The Correlation coefficient is a pure number and it does not depend upon the units in which the variables are measured.
  3. The correlation coefficient is the geometric mean of the two regression coefficients. Thus if the two regression lines of $Y$ on $X$ and $X$ on $Y$ are written as $Y=a+bX$ and $X=c+dy$ respectively then $bd=r^2$.
  4. The correlation coefficient is independent of the choice of origin and scale of measurement of the variables, i.e. $r$ remains unchanged if constants are added to or subtracted from the variables and if the variables having the same size are multiplied or divided by the class interval size.
  5. The correlation coefficient lies between -1 and +1, symbolically $-1\le r \le 1$.

Take various Online MCQ quiz

Statistical Linear Models in R Language