Heteroscedasticity in Regression (2020)

Heteroscedasticity in Regression

Heteroscedasticity in Regression: The term heteroscedasticity refers to the violation of the assumption of homoscedasticity in linear regression models (LRM). In the case of heteroscedasticity, the errors have unequal variances for different levels of the regressors, which leads to biased and inefficient estimators of the regression coefficients. The disturbances in the Classical Linear Regression Model (CLRM) appearing in the population regression function should be homoscedastic; that is they all have the same variance.

Mathematical Proof of $E(\hat{\sigma}^2)\ne \sigma^2$ when there is some presence of hetero in the data.

For the proof of $E(\hat{\sigma}^2)\ne \sigma^2$, consider the two-variable linear regression model in the presence of heteroscedasticity,

\begin{align}
Y_i=\beta_1 + \beta_2 X+ u_i, \quad\quad (eq1)
\end{align}

where $Var(u_i)=\sigma_i^2$ (Case of heteroscedasticity)

as

\begin{align}
\hat{\sigma^2} &= \frac{\sum \hat{u}_i^2 }{n-2}\\
&= \frac{\sum (Y_i – \hat{Y}_i)^2 }{n-2}\\
&=\frac{(\beta_1 + \beta_2 X_i + u_i – \hat{\beta}_1 -\hat{\beta}_2 X_i )^2}{n-2}\\
&=\frac{\sum \left( -(\hat{\beta}_1-\beta_1) – (\hat{\beta}_2 – \beta_2)X_i + u_i \right)^2 }{n-2}\quad\quad (eq2)
\end{align}

Noting that

\begin{align*}
(Y_i-\hat{Y}_i)&=0\\
\beta_1 + \beta_2 X + u_i\, – \,\hat{\beta}_1 – \hat{\beta}_2X &=0\\
-(\hat{\beta}_1 -\beta_1) – X(\hat{\beta}_2-\beta_2) – u_i & =0\\
(\hat{\beta}_1 -\beta_1) &= – X (\hat{\beta}_2-\beta_2) + u_i\\
\text{Applying summation on both side}&\\
\sum (\hat{\beta}_1-\beta_1) &= -(\hat{\beta}_2-\beta_2)\sum X + \sum u_i\\
(\hat{\beta}_1 – \beta_1) &= -(\hat{\beta}_2-\beta_2)\overline{X}+\overline{u}
\end{align*}

Substituting it in (eq2) and taking expectation on both sides:

\begin{align}
\hat{\sigma}^2 &= \frac{1}{n-2} \left[ -(-(\hat{\beta}_2 – \beta_2) \overline{X} + \overline{u} ) – (\hat{\beta}_2-\beta_2)X_i + u_i  \right]^2\\
&=\frac{1}{n-2}E\left[(\hat{\beta}_2-\beta_2)\overline{X} -\overline{u} – (\hat{\beta}_2-\beta_2)X_i-u_i \right]^2\\
&=\frac{1}{n-2} E\left[ -(\hat{\beta}_2 – \beta_2)(X_i-\overline{X}) + (u_i-\overline{u})\right]^2\\
&= \frac{1}{n-2}\left[-\sum x_i^2 Var(\hat{\beta}_2) + E[\sum(u_i-\overline{u}]^2 \right]\\
&=\frac{1}{n-2} \left[ -\frac{\sum x_i^2 \sigma_i^2}{(\sum x_i^2)} + \frac{(n-1)\sum \sigma_i^2}{n} \right]
\end{align}

If there is homoscedasticity, then $\sigma_i^2=\sigma^2$ for each $i$, $E(\hat{\sigma}_i^2)=\sigma^2$.

The expected value of the $\hat{\sigma}^2=\frac{\hat{u}_i^2}{n-2}$ will not be equal to the true $\sigma^2$ in the presence of heteroscedasticity.


Heteroscedasticity in regression

To address heteroscedasticity in regression analysis, several techniques can be used to stabilize the variance of the errors:

  1. Transformations: Transforming the variables (such as using logarithmic or square root transformations) can sometimes help stabilize the variance of the errors.
  2. Weighted Least Squares (WLS): WLS is a method that assigns different weights to observations based on their variances, thereby giving more weight to observations with smaller variances. This may also help to mitigate the impact of heteroscedasticity on the estimation of parameters.
  3. Robust Standard Errors: heteroscedasticity-consistent standard errors also known as Robust standard errors, provide a way to correct standard errors and hypothesis tests in the presence of heteroscedasticity without requiring assumptions about the specific form of heteroscedasticity.
  4. Generalized Least Squares (GLS): The GLS method allows to estimation of regression coefficients under a broader range of assumptions about the variance-covariance structure of the errors, including heteroscedasticity.

Overall, detecting and addressing heteroscedasticity is important for ensuring the validity and reliability of regression analysis results.

Read more on the Remedy of Heteroscedasticity

More on heteroscedasticity on Wikipedia

MCQs General Knowledge

R Programming Language

Goldfeld-Quandt Test Example (2020)

Data is taken from the Economic Survey of Pakistan 1991-1992. The data file link is at the end of the post “Goldfeld-Quandt Test Example for the Detection of Heteroscedasticity”.

Read about the Goldfeld-Quandt Test in detail by clicking the link “Goldfeld-Quandt Test: Comparison of Variances of Error Terms“.

Goldfeld-Quandt Test Example

For an illustration of the Goldfeld-Quandt Test Example, the data given in the file should be divided into two sub-samples after dropping (removing/deleting) the middle five observations.

Sub-sample 1 consists of data from 1959-60 to 1970-71.

Sub-sample 2 consists of data from 1976-77 to 1987-1988.

The sub-sample 1 is highlighted in green colour, and sub-sample 2 is highlighted in blue color, while the middle observation that has to be deleted is highlighted in red.

Goldfeld-Quandt Test Example

The Step-by-Step Procedure to Conduct the Goldfeld Quandt Test

Step 1: Order or Rank the observations according to the value of $X_i$. (Note that observations are already ranked.)

Step 2: Omit $c$ central observations. We selected 1/6 observations to be removed from the middle of the observations. 

Step 3: Fit OLS regression on both samples separately and obtain the Residual Sum of Squares (RSS) for each sub-sample.

The Estimated regression for the two sub-samples are:

Sub-sample 1: $\hat{C}_1 = 1010.096 + 0.849 \text{Income}$

Sub-sample 2: $\hat{C}_2 = -244.003 + 0.88067 \text{Income}$

Now compute the Residual Sum of Squares for both sub-samples.

The residual Sum of Squares for Sub-Sample 1 is $RSS_1=2532224$

The residual Sum of Squares for Sub-Sample 2 is $RSS_2=10339356$

The F-Statistic is $ \lambda=\frac{RSS_2/n_2}{RSS_1/n_1}=\frac{10339356}{2532224}=4.083$

The critical value of $F(n_1=10, n_2=10$ at a 5% level of significance is 2.98.

Since the computed F value is greater than the critical value, heteroscedasticity exists in this case, that is, the variance of the error term is not consistent, rather it depends on the independent variable, GNP.

Your assignment is to perform the Goldfeld-Quandt Test Example using any statistical software and confirm the results.

Download the data file by clicking the link “GNP and consumption expenditure data“.

Learn about White’s Test of Heteroscedasticity

Goldfeld-Quandt Test Example

Learn R Programming

Online Test Preparation MCQS with Answers

Heteroscedasticity Residual Plot (2020)

The post is about Heteroscedasticity Residual Plot.

Heteroscedasticity and Heteroscedasticity Residual Plot

One of the assumptions of the classical linear regression model is that there is no heteroscedasticity (error terms have constant error terms) meaning that ordinary least square (OLS) estimators are (BLUE, best linear unbiased estimator) and their variances are the lowest of all other unbiased estimators (Gauss Markov Theorem).

If the assumption of constant variance does not hold then this means that the Gauss Markov Theorem does not apply. For heteroscedastic data, regression analysis provides an unbiased estimate of the relationship between the predictors and the outcome variables.

As we have discussed heteroscedasticity occurs when the error variance has non-constant variance.  In this case, we can think of the disturbance for each observation as being drawn from a different distribution with a different variance.  Stated equivalently, the variance of the observed value of the dependent variable around the regression line is non-constant. 

We can think of each observed value of the dependent variable as being drawn from a different conditional probability distribution with a different conditional variance. A general linear regression model with the assumption of heteroscedasticity can be expressed as follows

\begin{align*}
y_i & = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_ip + \varepsilon_i\\
Var(\varepsilon_i)&=E(\varepsilon_i^2)\\
&=\sigma_i^2; \cdots i=1,2,\cdots, n
\end{align*}

Note that we have a $i$ subscript attached to sigma squared.  This indicates that the disturbance for each of the $ n$ units is drawn from a probability distribution that has a different variance.

If the error term has non-constant variance, but all other assumptions of the classical linear regression model are satisfied, then the consequences of using the OLS estimator to obtain estimates of the population parameters are:

  • The OLS estimator is still unbiased
  • The OLS estimator is inefficient; that is, it is not BLUE
  • The estimated variances and covariances of the OLS estimates are biased and inconsistent
  • Hypothesis tests are not valid

Detection of Heteroscedasticity Residual Plot

The residual for the $i$th observation, $\hat{\varepsilon_i}$, is an unbiased estimate of the unknown and unobservable error for that observation, $\hat{\varepsilon_i}$. Thus the squared residuals, $\hat{\varepsilon_i^2} $, can be used as an estimate of the unknown and unobservable error variance,  $\sigma_i^2=E(\hat{\varepsilon_i})$. 

One can calculate the squared residuals and then plot them against an explanatory variable that you believe might be related to the error variance.  If you believe that the error variance may be related to more than one of the explanatory variables, you can plot the squared residuals against each one of these variables.  Alternatively, you could plot the squared residuals against the fitted value of the dependent variable obtained from the OLS estimates.  Most statistical programs (software) have a command to do these residual plots.  It must be emphasized that this is not a formal test for heteroscedasticity.  It would only suggest whether heteroscedasticity may exist.

Below there are residual plots showing the three typical patterns. The first plot shows a random pattern that indicates a good fit for a linear model. The other two plot patterns of residual plots are non-random (U-shaped and inverted U), suggesting a better fit for a non-linear model, than a linear regression model.

Heteroscedasticity Regression Residual Plot 3
Heteroscedasticity Residual Plot 1
Heteroscedasticity Residual Plot 1
Heteroscedasticity Residual Residual Plot 2
Heteroscedasticity Residual Plot 2
Heteroscedasticity Residual Plot 3

Learn R Language from R Frequently Asked Questions

Heteroscedasticity Consequences

Heteroscedasticity refers to a situation in which the variability of the errors (residuals) in a regression model is not constant across all levels of the independent variable(s). It refers to the violation of the assumption of homoscedasticity in linear regression models (LRM).

Heteroscedasticity Consequences

A short detail about the Heteroscedasticity Consequences is described below:

  • The OLS estimators and regression predictions based on them remain unbiased and consistent.
  • The OLS estimators are no longer the BLUE (Best Linear Unbiased Estimators) because they are no longer efficient, so the regression predictions will be inefficient too.
  • Because of the inconsistency of the covariance matrix of the estimated regression coefficients, the tests of hypotheses, (t-test, F-test) are no longer valid.

A detailed discussion about the Heteroscedasticity Consequences are:

Heteroscedasticity Consequences
  1. Inefficient Estimates: As a result of a violation of the homoscedasticity assumption, the OLS estimates become inefficient, that is, the estimators are not more Best Linear Unbiased Estimators (BLUE) and therefore, could have larger standard errors. The large standard errors may lead to incorrect conclusions about the statistical significance of the regression coefficients.
  2. Biased Estimates: Heteroscedasticity may lead to biased estimates of regression coefficients. In the case of heteroscedasticity, the ordinary least squares estimators (OLSE) are still unbiased, but they are no longer the most efficient estimators, as estimators may have larger possible variances. The estimated coefficients for the regressors may not accurately reflect the true population parameters.
  3. Incorrect Standard Errors: The standard errors of the regression coefficients are biased in the presence of heteroscedasticity, which leads to inaccurate inference in hypothesis testing, including incorrect t-test, F-test, and p-values. Researchers may mistakenly conclude that a variable is not statistically significant when it is, or vice versa.
  4. Invalid Inference: Larger standard errors may also lead to invalid inferences about the population parameters, it is because the confidence intervals and hypothesis tests based on these estimates may be unreliable and become wider to include the population parameter.
  5. Model Misspecification: Heteroscedasticity may indicate a misspecification of the underlying model. If the assumption of constant variance is violated, it suggests that there may be unaccounted-for factors or omitted variables influencing the variability of the errors. It suggests that the model may not be capturing all the variability in the data adequately.
  6. Inflated Type I Errors: Heteroscedasticity can lead to inflated Type I errors (false positives) in hypothesis tests. Researchers might mistakenly reject null hypotheses when they should not, leading to incorrect conclusions.
  7. Suboptimal Forecasting: Models affected by heteroscedasticity may provide suboptimal forecasts since the variability of the errors is not accurately captured. This can impact the model’s ability to make reliable predictions.
  8. Robustness Issues: Heteroscedasticity can make regression models less robust, meaning that their performance deteriorates when applied to different datasets or when the underlying assumptions are not met.

The Test of heteroscedasticity, such as the Breusch-Pagan test, or the White test of heteroscedasticity, and consider corrective measures like weighted least squares regression or transforming the data.

Learn about Remedial Measures of Heteroscedasticity

R Programming Language

Test Preparation MCQs