# Basic Statistics and Data Analysis

## Breusch-Pagan Test for Heteroscedasticity

Breusch–Pagan test (named after Trevor Breusch and Adrian Pagan) is used to test for heteroscedasticity in a linear regression model.

Assume our regression model is $Y_i = \beta_1 + \beta_2 X_{2i} + \mu_i$ i.e we have simple linear regression model, and $E(\mu_i^2)=\sigma_i^2$, where $\sigma_i^2=f(\alpha_1 + \alpha_2 Z_{2i})$

That is $\sigma_i^2$ is some function of the non-stochastic variable Z‘s. f() allows for both the linear and non-linear forms of the model. The variable Z is the independent variable X or it could represent a group of independent variables other than X.

Step to Perform Breusch-Pagan test

1. Estimate the model by OLS and obtain the residuals $\hat{\mu}_1, \hat{\mu}_2+\cdots$
2. Estimate the variance of the residuals i.e. $\hat{\sigma}^2=\frac{\sum e_i^2}{(n-2)}$
3. Run the regression $\frac{e_i^2}{\hat{\sigma^2}}=\beta_1+\beta_2 Z_i + \mu_i$ and compute explained sum of squares (ESS) from this regression
4. Test the statistical significance of ESS/2 by $\chi^2$-test with 1 df at appropriate level of significance (α).
5. Reject the hypothesis of homoscedasticity in favour of heteroscedasticity if $\frac{ESS}{2} > \chi^2_{(1)}$ at appropriate level of α.

Note that the

• Breusch-Pagan test is valid only if μi‘s are normally distributed.
• For k independent variables, ESS/2 have ($\chi^2$) Chi-square distribution with k degree of freedom.
• If the μi‘s (error term) are not normally distributed, White test is used.

References:

• Breusch, T.S.; Pagan, A.R. (1979). “Simple test for heteroscedasticity and random coefficient variation”. Econometrica (The Econometric Society) 47 (5): 1287–1294.

# Heteroscedasticity

An important assumption of OLS is that the disturbances μi appearing in the population regression function are homoscedastic (Error term have same variance).
i.e. The variance of each disturbance term μi, conditional on the chosen values of explanatory variables is some constant number equal to $\sigma^2$. $E(\mu_{i}^{2})=\sigma^2$; where $i=1,2,\cdots, n$.
Homo means equal and scedasticity means spread.

Consider the general linear regression model
$y_i=\beta_1+\beta_2 x_{2i}+ \beta_3 x_{3i} +\cdots + \beta_k x_{ki} + \varepsilon$

If $E(\varepsilon_{i}^{2})=\sigma^2$ for all $i=1,2,\cdots, n$ then the assumption of constant variance of the error term or homoscedasticity is satisfied.

If $E(\varepsilon_{i}^{2})\ne\sigma^2$ then assumption of homoscedasticity is violated and heteroscedasticity is said to be present. In case of heteroscedasticity the OLS estimators are unbiased but inefficient.

Examples:

1. The range in family income between the poorest and richest family in town is the classical example of heteroscedasticity.
2. The range in annual sales between a corner drug store and general store.

## Reasons of Heteroscedasticity

There are several reasons when the variances of error term μi may be variable, some of which are:

1. Following the error learning models, as people learn their error of behaviors becomes smaller over time. In this case $\sigma_{i}^{2}$ is expected to decrease. For example the number of typing errors made in a given time period on a test to the hours put in typing practice.
2. As income grow, people have more discretionary income and hence $\sigma_{i}^{2}$ is likely to increase with income.
3. As data collecting techniques improves, $\sigma_{i}^{2}$ is likely to decrease.
4. Heteroscedasticity can also arises as a result of the presence of outliers. The inclusion or exclusion of such observations, especially when the sample size is small, can substantially alter the results of regression analysis.
5. Heteroscedasticity arises from violating the assumption of CLRM (classical linear regression model), that the regression model is not correctly specified.
6. Skewness in the distribution of one or more regressors included in the model is another source of heteroscedasticity.
7. Incorrect data transformation, incorrect functional form (linear or log-linear model) is also the source of heteroscedasticity

# Consequences of Heteroscedasticity

1. The OLS estimators and regression predictions based on them remains unbiased and consistent.
2. The OLS estimators are no longer the BLUE (Best Linear Unbiased Estimators) because they are no longer efficient, so the regression predictions will be inefficient too.
3. Because of the inconsistency of the covariance matrix of the estimated regression coefficients, the tests of hypotheses, (t-test, F-test) are no longer valid.

Note: Problems of heteroscedasticity is likely to be more common in cross-sectional than in time series data.

Reference
Greene, W.H. (1993) Econometric Analysis, Prentice–Hall, ISBN 0-13-013297-7.
Verbeek, Marno (2004) A Guide to Modern Econometrics, 2. ed., Chichester: John Wiley & Sons