# Pearson Correlation Coefficient use, Interpretation, Properties

The correlation coefficient or Pearson’s Correlation Coefficient was originated by Karl Pearson in the 1900s. The Pearson’s Correlation Coefficient is a measure of the (degree of) strength of the linear relationship between two continuous random variables denote by $\rho_{XY}$ for population and for sample it is denoted by $r_{XY}$.

The Correlation coefficient can take values that occur in the interval *[1,-1]*. If the coefficient value is 1 or -1, there will be a perfect linear relationship between the variables. A positive sign with a coefficient value shows a positive (direct, or supportive), while a negative sign with a coefficient value shows the negative (indirect, opposite) relationship between the variables. The zero-value implies the absence of a linear relation and it also shows that variables are independent. Zero value also shows that there may be some other sort of relationship between the variables of interest such as a systematic or circular relationship between the variables.

Mathematically, if two random variables such as $X$ and $Y$ follow an unknown joint distribution then the simple linear correlation coefficient is equal to covariance between $X$ and $Y$ divided by the product of their standard deviations i.e

\[\rho=\frac{Cov(X, Y)}{\sigma_X \sigma_Y}\]

where $Cov(X, Y)$ is a measure of covariance between $X$ and $Y$, $\sigma_X$ and $\sigma_Y$ are the respective standard deviation of the random variables.

For a sample of size $n$, $(X_1, Y_1),(X_2, Y_2),\cdots,(X_n, Y_n)$ from the joint distribution, the quantity given bellow is an estimate of $\rho$, called sampling correlation and denoted by *r*.

\begin{eqnarray*}

r&=&\frac{\sum_{i=1}^{n}(X_i-\bar{X})(Y_i-\bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i-\bar{X})^2 \times \sum_{i=1}^{n}(Y_i-\bar{Y})^2}}\\

&=& \frac{Cov(X,Y)}{S_X X_Y}

\end{eqnarray*}

Note that

- The existence of a statistical correlation does not mean that there exists a cause and effect relation between the variables. Cause and effect mean that change in one variable does cause a change in the other variable.
- The changes in the variables may be due to a common cause or random variations.
- There are many kinds of correlation coefficients. The choice of which to use for a particular set of data depends on different factors such as
- Type of Scale (Measurement Scale) used to express the variables
- Nature of the underlying distribution (continuous or discrete)
- Characteristics of the distribution of the scores (linear or non-linear)

- Correlation is perfectly linear if a constant change in $X$ is accompanied by a constant change in $Y$. In this case, all the points in the scatter diagram will lie in a straight line.
- A high correlation coefficient does not necessarily imply a direct dependence of the variables. For example, there may be a high correlation between the number of crimes and shoe prices. Such a kind of correlation referred to as non-sense or spurious correlations.

**Properties of the Correlation Coefficient**

- The correlation coefficient is symmetrical with respect to $X$ and $Y$ i.e. $r_{XY}=r_{YX}$.
- The Correlation coefficient is a pure number and it does not depend upon the units in which the variables are measure.
- The correlation coefficient is the geometric mean of the two regression coefficients. Thus if the two regression lines of $Y$ on $X$ and $X$ on $Y$ are written as $Y=a+bX$ and $X=c+dy$ respectively then $bd=r^2$
*.* - The correlation coefficient is independent of the choice of origin and scale of measurement of the variables, i.e. $r$ remains unchanged if constants are added to or subtracted from the variables and if the variables having the same size are multiplied or divided by the class interval size.
- The correlation coefficient lies between -1 and +1, symbolically $-1\le r \le 1$
*.*