Correlation Analysis - Strength of Relationship

Spearman Rank Correlation Test

Mar 25, 2025Jan 25, 2021 by Muhammad Imdad Ullah

Consider the following data for the illustration of the detection of heteroscedasticity using the Spearman Rank correlation test. The Data file is available to download.

Y	X2	X3
11	20	8.1
16	18	8.4
11	22	8.5
14	21	8.5
13	27	8.8
17	26	9
14	25	8.9
15	27	9.4
12	30	9.5
18	28	9.5

The estimated multiple linear regression model is:

$$Y_i = -34.936 -0.75X_{2i} + 7.611X_{3i}$$

The Residuals with the data table are:

Y	X2	X3	Residuals
11	20	8.1	-0.63302
16	18	8.4	0.575564
11	22	8.5	-2.16954
14	21	8.5	0.076455
13	27	8.8	1.317102
17	26	9	3.040825
14	25	8.9	0.047951
15	27	9.4	-1.2497
12	30	9.5	-2.74881
18	28	9.5	1.743171

We need to find the rank of absolute values of $u_i$ and the expected heteroscedastic variable $X_2$.

$Y$	$X_2$	$X_3$	Residuals	Rank of \|$u_i$\|	Rank of $X_2$	$d$	$d^2$
11	20	8.1	-0.633	4	2	2	4
16	18	8.4	0.576	3	1	2	4
11	22	8.5	-2.170	8	4	4	16
14	21	8.5	0.076	2	3	-1	1
13	27	8.8	1.317	6	7.5	-1.5	2.25
17	26	9	3.041	10	6	4	16
14	25	8.9	0.048	1	5	-4	16
15	27	9.4	-1.250	5	7.5	-2.5	6.25
12	30	9.5	-2.749	9	10	-1	1
18	28	9.5	1.743	7	9	-2	4
					Total =	0	70.5

Calculating the Spearman Rank Correlation

\begin{align}
r_s&=1-\frac{6\sum d^2}{n(n-1)}\\
&=1-\frac{6\times 70.5)}{10(100-1)}=0.5727
\end{align}

Let us perform the statistical significance of $r_s$ by t-test

\begin{align}
t&=\frac{r_s \sqrt{n}}{\sqrt{1-r_s^2}}\\
&=\frac{0.5727\sqrt{8}}{\sqrt{1-(0.573)^2}}=1.977
\end{align}

The value of $t$ from the table at a 5% level of significance at 8 degrees of freedom is 2.306.

Since $t_{cal} \ngtr t_{tab}$, there is no evidence of the systematic relationship between the explanatory variables, $X_2$, and the absolute value of the residuals ($|u_i|$), and hence,e there is no evidence of heteroscedasticity.

Since there is more than one regressor (the example is from the multiple regression model), therefore, Spearman’s Rank Correlation test should be repeated for each of the explanatory variables.

As an assignment, perform the Spearman Rank Correlation between |$u_i$| and $X_3$ for the data above. Test the statistical significance of the coefficient in the above manner to explore evidence about heteroscedasticity.

Read about Pearson’s Correlation Coefficient

R Lan guage Interview Questions

Covariance and Correlation

Mar 25, 2025Aug 20, 2015 by Muhammad Imdad Ullah

Introduction to Covariance and Correlation

Covariance and correlation are very important terminologies in statistics. Covariance measures the degree to which two variables co-vary (i.e., vary/change together). If the greater values of one variable (say, $X_i$) correspond with the greater values of the other variable (say, $X_j$), i.e., if the variables tend to show similar behavior, then the covariance between two variables ($X_i$, $X_j$) will be positive.

Similarly, if the smaller values of one variable correspond with the smaller values of the other variable, then the covariance between the two variables will be positive. In contrast, if the greater values of one variable (say, $X_i$) mainly correspond to the smaller values of the other variables (say, $X_j$), i.e., both of the variables tend to show opposite behavior, then the covariance will be negative.

In other words, positive covariance between two variables means they (both of the variables) vary/change together in the same direction relative to their expected values (averages). It means that if one variable moves above its average value, the other variable tends to be above its average value.

Similarly, if the covariance is negative between the two variables, then one variable tends to be above its expected value, while the other variable tends to be below its expected value. If covariance is zero then it means that there is no linear dependency between the two variables.

Mathematical Representation of Covariance

Mathematically covariance between two random variables $X_i$ and $X_j$ can be represented as
\[COV(X_i, X_j)=E[(X_i-\mu_i)(X_j-\mu_j)]\]
where
$\mu_i=E(X_i)$ is the average of the first variable
$\mu_j=E(X_j)$ is the average of the second variable

\begin{aligned}
COV(X_i, X_j)&=E[(X_i-\mu_i)(X_j-\mu_j)]\\
&=E[X_i X_j – X_i E(X_j)-X_j E(X_i)+E(X_i)E(X_j)]\\
&=E(X_i X_j)-E(X_i)E(X_j) – E(X_j)E(X_i)+E(X_i)E(X_j)\\
&=E(X_i X_j)-E(X_i)E(X_j)
\end{aligned}

Note that the covariance of a random variable with itself is the variance of the random variable, i.e. $COV(X_i, X_i)=VAR(X)$. If $X_i$ and $X_j$ are independent, then $E(X_i X_j)=E(X_i)E(X_j)$ and $COV(X_i, X_j)=E(X_i X_j)-E(X_i) E(X_j)=0$.

Covariance and Correlation

Correlation and covariance are related measures but not equivalent statistical measures.

Equation of Correlation (Normalized Covariance)

The correlation between two variables (Let, $X_i$ and $X_j$) is their normalized covariance, defined as
\begin{aligned}
\rho_{i,j}&=\frac{E[(X_i-\mu_i)(X_j-\mu_j)]}{\sigma_i \sigma_j}\\
&=\frac{n \sum XY – \sum X \sum Y}{\sqrt{(n \sum X^2 -(\sum X)^2)(n \sum Y^2 – (\sum Y)^2)}}
\end{aligned}
where $\sigma_i$ is the standard deviation of $X_i$ and $\sigma_j$ is the standard deviation of $X_j$.

Note that correlation is dimensionless, i.e. a number that is free of the measurement unit and its values lie between -1 and +1 inclusive. In contrast, covariance has a unit of the product of the units of two variables.

When to Use Covariance and Correlation

The covariance and correlation should be used as described below:

Covariance: Useful in portfolio theory (finance).
Correlation: Preferred in most cases (e.g., psychology, medicine, ML) due to standardized interpretation.

For example, the correlation between study hours & exam scores can be used to measure the strength of the relationship (e.g.,$ r = 0.7$ shows a strong positive link between study hours and exam scores).

Similarly, the Covariance between stock returns Helps in diversification.

The Sign of Covariance

The Sign Matters covariance matters:

Positive Covariance: Variables move together (↑X → ↑Y).
Negative Covariance: Variables move inversely (↑X → ↓Y).

Limitation of Covariance

The value of covariance depends on units (for example, covariance of “hours vs. scores” $\ne$ “minutes vs. scores”). For unitless measures, use correlation for standardized interpretation.

For further reading about Correlation, follow these postsThe

R Frequently Asked Questions

Correlation Coefficient Range

Jul 6, 2025Oct 12, 2012 by Muhammad Imdad Ullah

The coefficient of correlation (r) measures the strength and direction of a linear relationship between two variables. In this post, we will discuss about coefficient of correlation and the coefficient of determination.

Correlation Coefficient Ranges

The correlation coefficient ranges from -1 to +1, where a value of +1 indicates the perfect positive correlation (as one variable increases, the other increases proportionally), the -1 value indicates the perfect negative correlation (as one variable increases, the other decreases proportionally), and the value of 0 indicates no linear correlation (no relationship between the variables).

The coefficient of correlation values between -1 and +1 indicate the degree of strength and direction of rthe elationship:

The strength of correlation depends on the absolute value of r:

Range of Correlation Value	Interpretation
0.90 to 1.00	Very strong correlation
0.70 to 0.89	Strong correlation
0.40 to 0.69	Moderate correlation
0.10 to 0.39	Weak correlation
0.00 to 0.09	No or negligible correlation

The closer the value of the correlation coefficient is to ±1, the stronger the linear relationship.

Coefficient of Determination

We know that the ratio of the explained variation to the total variation is called the coefficient of determination, which is the square of the Correlation Coefficient Range and lies between $-1$ and $+1$. This ratio (coefficient of determination) is non-negative; therefore, denoted by $r^2$, thus

\begin{align*}
r^2&=\frac{\text{Explained Variation}}{\text{Total Variation}}\\
&=\frac{\sum (\hat{Y}-\overline{Y})^2}{\sum (Y-\overline{Y})^2}
\end{align*}

It can be seen that if the total variation is all explained, the ratio $r^2$ (Coefficient of Determination) is one, and if the total variation is all unexplained, then the explained variation and the ratio $r^2$ are zero.

The square root of the coefficient of determination is called the correlation coefficient, given by

\begin{align*}
r&=\sqrt{ \frac{\text{Explained Variation}}{\text{Total Variation}} }\\
&=\pm \sqrt{\frac{\sum (\hat{Y}-\overline{Y})^2}{\sum (Y-\overline{Y})^2}}
\end{align*}

and

\[\sum (\hat{Y}-\overline{Y})^2=\sum(Y-\overline{Y})^2-\sum (Y-\hat{Y})^2\]

Therefore

\begin{align*}
r&=\sqrt{ \frac{\sum(Y-\overline{Y})^2-\sum (Y-\hat{Y})^2} {\sum(Y-\overline{Y})^2} }\\
&=\sqrt{1-\frac{\sum (Y-\hat{Y})^2}{\sum(Y-\overline{Y})^2}}\\
&=\sqrt{1-\frac{\text{Unexplained Variation}}{\text{Total Variation}}}=\sqrt{1-\frac{S_{y.x}^2}{s_y^2}}
\end{align*}

where $s_{y.x}^2=\frac{1}{n} \sum (Y-\hat{Y})^2$ and $s_y^2=\frac{1}{n} \sum (Y-\overline{Y})^2$

\begin{align*}
\Rightarrow r^2&=1-\frac{s_{y.x}^2}{s_y^2}\\
\Rightarrow s_{y.x}^2&=s_y^2(1-r^2)
\end{align*}

Since variances are non-negative

\[\frac{s_{y.x}^2}{s_y^2}=1-r^2 \geq 0\]

Solving for inequality, we have

\begin{align*}
1-r^2 & \geq 0\\
\Rightarrow r^2 \leq 1\, \text{or}\, |r| &\leq 1\\
\Rightarrow & -1 \leq r\leq 1
\end{align*}

Therefore, the Correlation Coefficient Range lies between $-1$ and $+1$ inclusive.

Alternative Proof: Correlation Coefficient Range

Since $\rho(X,Y)=\rho(X^*,Y^*)$ where $X^*=\frac{X-\mu_X}{\sigma_X}$ and $Y^*=\frac{Y-Y^*}{\sigma_Y}$

and as covariance is bi-linear and $X^*, Y^*$ have zero mean and variance 1, therefore

\begin{align*}
\rho(X^*,Y^*)&=Cov(X^*,Y^*)=Cov\{\frac{X-\mu_X}{\sigma_X},\frac{Y-\mu_Y}{\sigma_Y}\}\\
&=\frac{Cov(X-\mu_X,Y-\mu_Y)}{\sigma_X\sigma_Y}\\
&=\frac{Cov(X,Y)}{\sigma_X \sigma_Y}=\rho(X,Y)
\end{align*}

We also know that the variance of any random variable is $\ge 0$; it could be zero, i.e., $(Var(X)=0)$ if and only if $X$ is a constant (almost surely), therefore

\[V(X^* \pm Y^*)=V(X^*)+V(Y^*)\pm2Cov(X^*,Y^*)\]

As $Var(X^*)=1$ and $Var(Y^*)=1$, the above equation would be negative if $Cov(X^*,Y^*)$ is either greater than 1 or less than -1. Hence \[1\geq \rho(X,Y)=\rho(X^*,Y^*)\geq -1\].

If $\rho(X,Y)=Cov(X^*,Y^*)=1$ then $Var(X^*- Y^*)=0$ making $X^*= Y^*$ almost surely. Similarly, if $\rho(X,Y )=Cov(X^*,Y^*)=-1$ then $X^* = – Y^*$ almost surely. In either case, $Y$ would be a linear function of $X$ almost surely.

For proof of Cauchy-Schwarz Inequality, please follow the link

We can see that the Correlation Coefficient range lies between $-1$ and $+1$.

Real-Life Example

Variable 1	Variable 2	Coefficient Value	Interpretation
Study hours	Exam scores	+0.85	Strong positive
Screen time	Sleep duration	-0.70	Strong negative
Age	Shoe size	~0.00	No linear correlation

FAQs about Correlation Coefficient

What is a coefficient of correlation?
What does a positive or negative correlation mean?
What is a strong or weak correlation?
Can correlation imply causation?
What are the types of correlation coefficients?
When should I use Pearson vs. Spearman correlation?
What are the assumptions of the Pearson correlation?
Can correlation be used for more than two variables?
How is correlation different from regression?
How is the correlation coefficient calculated?
What does a zero correlation mean?
Can correlation be misleading?

Learn more about

Spearman Rank Correlation Test

Calculating the Spearman Rank Correlation

Covariance and Correlation

Introduction to Covariance and Correlation

Table of Contents