# Basic Statistics and Data Analysis

## Assumptions about Linear Regression Models or Error Term

The linear regression model (LRM) is based on certain statistical assumption, some of which are related to the distribution of random variable (error term) $\mu_i$, some are about the relationship between error term $\mu_i$ and the explanatory variables (Independent variables, X’s) and some are related to the independent variable themselves. We can divide the assumptions into two categories

1. Stochastic Assumption
2. None Stochastic Assumptions

These assumptions about linear regression models (or ordinary least square method: OLS) are extremely critical to the interpretation of the regression coefficients.

• The error term ($\mu_i$) is a random real number i.e. $\mu_i$ may assume any positive, negative or zero value upon chance. Each value has a certain probability, therefore error term is a random variable.
• The mean value of $\mu$ is zero, i.e $E(\mu_i)=0$ i.e. the mean value of $\mu_i$ is conditional upon the given $X_i$ is zero. It means that for each value of variable $X_i$, $\mu$ may take various values, some of them greater than zero and some smaller than zero. Considering the all possible values of $\mu$ for any particular value of $X$, we have zero mean value of disturbance term $\mu_i$.
• The variance of $\mu_i$ is constant i.e. for the given value of X, the variance of $\mu_i$ is the same for all observations. $E(\mu_i^2)=\sigma^2$. The variance of disturbance term ($\mu_i$) about its mean is at all values of X will show the same dispersion about their mean.
• The variable $\mu_i$ has a normal distribution i.e. $\mu_i\sim N(0,\sigma_{\mu}^2$. The value of $\mu$ (for each $X_i$) have a bell shaped symmetrical distribution.
• The random term of different observation ($\mu_i,\mu_j$) are independent i..e $E(\mu_i,\mu_j)=0$, i.e. there is no autocorrelation between the disturbances. It means that random term assumed in one period does not depend of the values in any other period.
• $\mu_i$ and $X_i$ have zero covariance between them i.e. $\mu$ is independent of the explanatory variable or $E(\mu_i X_i)=0$ i.e. $Cov(\mu_i, X_i)=0$. The disturbance term $\mu$ and explanatory variable X are uncorrelated. The $\mu$’s and $X$’s do not tend to vary together as their covariance is zero. This assumption is automatically fulfilled if X variable is nonrandom or non-stochastic or if mean of random term is zero.
• All the explanatory variables are measured without error. It means that we will assume that the regressors are error free while y (dependent variable) may or may not include error of measurements.
• The number of observations n must be greater than the number of parameters to be estimated or alternatively the number of observation must be greater than the number of explanatory (independent) variables.
• The should be variability in the X values. That is X values in a given sample must not be same. Statistically, $Var(X)$ must be a finite positive number.
• The regression model must be correctly specified, meaning that there is no specification bias or error in the model used in empirical analysis.
• There is no perfect or near to perfect multicollinearity or collinearity among the two or more explanatory (independent) variables.
• Values taken by the regressors X are considered to be fixed in repeating sampling i.e. X is assumed to non-stochastic. Regression analysis is conditional on the given values of the regressor(s) X.
• Linear regression model is linear in the parameters, e.g. $y_i=\beta_1+\beta_2x_i +\mu_i$

## Homoscedasticity: Assumption of constant variance of random variable μ (error term)

The assumptions about the random variable μ (error term) is that its probability distribution remains the same for all observations of X and in particular that the variance of each μ is the same for all values of the explanatory variables, i.e the variance of errors is the same across all levels of the independent variables. Symbolically it can be represented as

$Var(\mu) = E\{\mu_i – E(\mu)\}^2 = E(\mu_i)^2 = \sigma_\mu^2 = \mbox(Constant)$

This assumption is known as the assumption of homoscedasticity or the assumption of constant variance of the error term μ‘s. It means that the variation of each μi around its zero means does not depend on the values of X (independent) because error term expresses the influence on the dependent variables due to

• Errors in measurement
The errors of measurement tend to be cumulative over time. It is also difficult to collect the data and check its consistency and reliability. So the variance of μi increases with increasing the values of X.
• Omitted variables
Omitted variables from the function (regression model) tends to change in the same direction with X, causing an increase of the variance of the observation from the regression line.

The variance of each μi remains the same irrespective of small or large values of the explanatory variable i.e. $\sigma_\mu^2$ is not function of Xi i.e $\sigma_{\mu_i^2} \ne f(X_i)$.

## Consequences if Homoscedasticity is not meet

If the assumption of homoscedastic disturbance (Constant Variance) is not fulfilled, following consequence we have

1. We cannot apply the formula of the variance of the coefficient to conduct tests of significance and construct confidence intervals. The tests are inapplicable $Var(\hat{\beta}_0)=\sigma_\mu^2 \{\frac{\sum X^2}{n \sum X^2}\}$ and $Var(\hat{\beta}_1) = \sigma_\mu^2 \{\frac{1}{\sum X^2}\}$
2. If μ (error term) is heteroscedastic the OLS (Ordinary Least Square) estimates do not have minimum variance property in the class of Unbiased Estimators i.e they are inefficient in small samples. Furthermore they are inefficient in large samples (asymptotically inefficient).
3. The coefficient estimates would still be statistically unbiased even if the μ‘s are heteroscedastic. The $\hat{\beta}$’s will have no statistical bias i.e $E(\beta_i)=\beta_i$ (coefficient’s expected values will be equal to the true parameter value).
4. The prediction would be inefficient, because of the variance of prediction includes the variance of μ and of the parameter estimates which are not minimal due to the incidence of heteroscedasticity i.e. The prediction of Y for a given value of X based on the estimates $\hat{\beta}$’s from the original data, would have a high variance.

## Tests for Homoscedasticity

Some tests commonly used for testing the assumption of homoscedasticity are:

• Spearman Rank-Correlation test
• Goldfeld and Quandt test
• Glejser test
• Breusch–Pagan test
• Bartlett’s test of Homoscedasticity

Reference:
A. Koutsoyiannis (1972). “Theory of Econometrics”. 2nd Ed.

## Breusch-Pagan Test for Heteroscedasticity

Breusch–Pagan test (named after Trevor Breusch and Adrian Pagan) is used to test for heteroscedasticity in a linear regression model.

Assume our regression model is $Y_i = \beta_1 + \beta_2 X_{2i} + \mu_i$ i.e we have simple linear regression model, and $E(\mu_i^2)=\sigma_i^2$, where $\sigma_i^2=f(\alpha_1 + \alpha_2 Z_{2i})$

That is $\sigma_i^2$ is some function of the non-stochastic variable Z‘s. f() allows for both the linear and non-linear forms of the model. The variable Z is the independent variable X or it could represent a group of independent variables other than X.

Step to Perform Breusch-Pagan test

1. Estimate the model by OLS and obtain the residuals $\hat{\mu}_1, \hat{\mu}_2+\cdots$
2. Estimate the variance of the residuals i.e. $\hat{\sigma}^2=\frac{\sum e_i^2}{(n-2)}$
3. Run the regression $\frac{e_i^2}{\hat{\sigma^2}}=\beta_1+\beta_2 Z_i + \mu_i$ and compute explained sum of squares (ESS) from this regression
4. Test the statistical significance of ESS/2 by $\chi^2$-test with 1 df at appropriate level of significance (α).
5. Reject the hypothesis of homoscedasticity in favour of heteroscedasticity if $\frac{ESS}{2} > \chi^2_{(1)}$ at appropriate level of α.

Note that the

• Breusch-Pagan test is valid only if μi‘s are normally distributed.
• For k independent variables, ESS/2 have ($\chi^2$) Chi-square distribution with k degree of freedom.
• If the μi‘s (error term) are not normally distributed, White test is used.

References:

• Breusch, T.S.; Pagan, A.R. (1979). “Simple test for heteroscedasticity and random coefficient variation”. Econometrica (The Econometric Society) 47 (5): 1287–1294.

# Heteroscedasticity

An important assumption of OLS is that the disturbances μi appearing in the population regression function are homoscedastic (Error term have same variance).
i.e. The variance of each disturbance term μi, conditional on the chosen values of explanatory variables is some constant number equal to $\sigma^2$. $E(\mu_{i}^{2})=\sigma^2$; where $i=1,2,\cdots, n$.
Homo means equal and scedasticity means spread.

Consider the general linear regression model
$y_i=\beta_1+\beta_2 x_{2i}+ \beta_3 x_{3i} +\cdots + \beta_k x_{ki} + \varepsilon$

If $E(\varepsilon_{i}^{2})=\sigma^2$ for all $i=1,2,\cdots, n$ then the assumption of constant variance of the error term or homoscedasticity is satisfied.

If $E(\varepsilon_{i}^{2})\ne\sigma^2$ then assumption of homoscedasticity is violated and heteroscedasticity is said to be present. In case of heteroscedasticity the OLS estimators are unbiased but inefficient.

Examples:

1. The range in family income between the poorest and richest family in town is the classical example of heteroscedasticity.
2. The range in annual sales between a corner drug store and general store.

## Reasons of Heteroscedasticity

There are several reasons when the variances of error term μi may be variable, some of which are:

1. Following the error learning models, as people learn their error of behaviors becomes smaller over time. In this case $\sigma_{i}^{2}$ is expected to decrease. For example the number of typing errors made in a given time period on a test to the hours put in typing practice.
2. As income grow, people have more discretionary income and hence $\sigma_{i}^{2}$ is likely to increase with income.
3. As data collecting techniques improves, $\sigma_{i}^{2}$ is likely to decrease.
4. Heteroscedasticity can also arises as a result of the presence of outliers. The inclusion or exclusion of such observations, especially when the sample size is small, can substantially alter the results of regression analysis.
5. Heteroscedasticity arises from violating the assumption of CLRM (classical linear regression model), that the regression model is not correctly specified.
6. Skewness in the distribution of one or more regressors included in the model is another source of heteroscedasticity.
7. Incorrect data transformation, incorrect functional form (linear or log-linear model) is also the source of heteroscedasticity

# Consequences of Heteroscedasticity

1. The OLS estimators and regression predictions based on them remains unbiased and consistent.
2. The OLS estimators are no longer the BLUE (Best Linear Unbiased Estimators) because they are no longer efficient, so the regression predictions will be inefficient too.
3. Because of the inconsistency of the covariance matrix of the estimated regression coefficients, the tests of hypotheses, (t-test, F-test) are no longer valid.

Note: Problems of heteroscedasticity is likely to be more common in cross-sectional than in time series data.

Reference
Greene, W.H. (1993) Econometric Analysis, Prentice–Hall, ISBN 0-13-013297-7.
Verbeek, Marno (2004) A Guide to Modern Econometrics, 2. ed., Chichester: John Wiley & Sons