Introduction to Pearson Correlation Coefficient
The correlation coefficient or Pearson Correlation Coefficient was originated by Karl Pearson in the 1900s. The Pearson Correlation Coefficient is a measure of the (degree of) strength of the linear relationship between two continuous random variables denoted by $\rho_{XY}$ for population and for sample it is denoted by $r_{XY}$.
Table of Contents
The Pearson Correlation coefficient can take values that occur in the interval $[1,-1]$. If the coefficient value is $1$ or $-1$, there will be a perfect linear relationship between the variables. A positive sign with a coefficient value shows a positive (direct, or supportive), while a negative sign with a coefficient value shows a negative (indirect, opposite) relationship between the variables.
The zero-value implies the absence of a linear relation and it also shows that variables are independent. Zero value also shows that there may be some other sort of relationship between the variables of interest such as a systematic or circular relationship between the variables.
Pearson’s Correlation Formula
Mathematically, if two random variables such as $X$ and $Y$ follow an unknown joint distribution then the simple linear correlation coefficient is equal to covariance between $X$ and $Y$ divided by the product of their standard deviations i.e
\[\rho=\frac{Cov(X, Y)}{\sigma_X \sigma_Y}\]
where $Cov(X, Y)$ is a measure of covariance between $X$ and $Y$, $\sigma_X$ and $\sigma_Y$ are the respective standard deviation of the random variables.
For a sample of size $n$, $(X_1, Y_1),(X_2, Y_2),\cdots,(X_n, Y_n)$ from the joint distribution, the quantity given below is an estimate of $\rho$, called sampling correlation and denoted by $r$.
\begin{eqnarray*}
r&=&\frac{\sum_{i=1}^{n}(X_i-\bar{X})(Y_i-\bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i-\bar{X})^2 \times \sum_{i=1}^{n}(Y_i-\bar{Y})^2}}\\
&=& \frac{Cov(X,Y)}{S_X X_Y}
\end{eqnarray*}
Note that
- The existence of a statistical correlation does not mean that there exists a cause-and-effect relation between the variables. Cause and effect mean that a change in one variable does cause a change in the other variable.
- The changes in the variables may be due to a common cause or random variations.
- There are many kinds of correlation coefficients. The choice of which to use for a particular set of data depends on different factors such as
- Type of Scale (Level of Measurement or Measurement Scale) used to express the variables
- Nature of the underlying distribution (continuous or discrete)
- Characteristics of the distribution of the scores (linear or non-linear)
- Correlation is perfectly linear if a constant change in $X$ is accompanied by a constant change in $Y$. In this case, all the points in the scatter diagram will lie in a straight line.
- A high correlation coefficient does not necessarily imply a direct dependence on the variables. For example, there may be a high correlation between the number of crimes and shoe prices. Such a kind of correlation is referred to as a non-sense or spurious correlation.
Properties of Pearson Correlation Coefficient
The following are important properties that a Pearson correlation coefficient can have:
- The Pearson correlation coefficient is symmetrical for $X$ and $Y$ i.e. $r_{XY}=r_{YX}$.
- The Correlation coefficient is a pure number and it does not depend upon the units in which the variables are measured.
- The correlation coefficient is the geometric mean of the two regression coefficients. Thus if the two regression lines of $Y$ on $X$ and $X$ on $Y$ are written as $Y=a+bX$ and $X=c+dy$ respectively then $bd=r^2$.
- The correlation coefficient is independent of the choice of origin and scale of measurement of the variables, i.e. $r$ remains unchanged if constants are added to or subtracted from the variables and if the variables having the same size are multiplied or divided by the class interval size.
- The correlation coefficient lies between -1 and +1, symbolically $-1\le r \le 1$.
Take various Online MCQ quiz
Statistical Linear Models in R Language