Category: OLS Assumptions


For a classical linear regression model with multiple regressors (explanatory variables), there should be no exact linear relationship between the explanatory variables. The collinearity or multicollinearity term is used if there is/are one or more linear relationship exists among the variables.

The term multicollinearity is considered as the violation of the assumption of “no exact linear relationship between the regressors.

Ragnar Frisch introduced this term, originally it means the existence of a “perfect” or “exact” linear relationship among some or all regressors of a regression model.

Consider a $k$-variable regression model involving explanatory variables $X_1, X_2, \cdots, X_k$. An exact linear relationship is said to exist if the following condition is satisfied.

\[\lambda_1 X_1 + \lambda_2  X_2 + \cdots + \lambda_k X_k=0,\]

where $\lambda_1, \lambda_2, \cdots, \lambda_k$ are constant and all of them all are non-zero, simultaneously, and $X_1=1$ for all observations for intercept term.

Now a day, multicollinearity term is not only being used for the case of perfect multicollinearity but also in case of not perfect collinearity (the case where the $X$ variables are intercorrelated but not perfectly). Therefore,

\[\lambda_1X_1 + \lambda_2X_2 + \cdots \lambda_kX_k + \upsilon_i,\]

where $\upsilon_i$ is a stochastic error term.

In case of a perfect linear relationship (correlation coefficient will be one in this case) among explanatory variables, the parameters become indeterminate (it is impossible to obtain values for each parameter separately) and the method of least square breaks down. However, if regressors are not intercorrelated at all, the variables are called orthogonal and there is no problem concerning the estimation of coefficients.

Note that

  • Multicollinearity is not a condition that either exists or does not exist, but rather a phenomenon inherent in most relationships.
  • Multicollinearity refers to the only a linear relationship among the $X$ variables. It does not rule out the non-linear relationships among them.

See use of mctest R package for diagnosing collinearity

Checking Normality of the Error Term

Normality of the Error Term

In multiple linear regression models, the sum of squared residuals (SSR) is divided by $n-p$ (degrees of freedom, where $n$ is the total number of observations, and $p$ is the number of the parameter in the model) is a good estimate of the error variance. In the multiple linear regression model, the residual vector is

e &=(I-H)y\\

where $H$ is the hat matrix for the regression model.

Each component $e_i=\varepsilon – \sum\limits_{i=1}^n h_{ij} \varepsilon_i$. Therefore, In multiple linear regression models, the normality of the residual is not simply the normality of the error term.

Note that:

\[Cov(\mathbf{e})=(I-H)\sigma^2 (I-H)’ = (I-H)\sigma^2\]

We can write $Var(e_i)=(1-h_{ii})\sigma^2$.

If the sample size ($n$) is much larger than the number of the parameters ($p$) in the model (i.e. $n > > p$), in other words, if sample size ($n$) is large enough, $h_{ii}$ will be small as compared to 1, and $Var(e_i) \approx \sigma^2$.

In multiple regression models, a residual behaves like an error if the sample size is large. However, this is not true for a small sample size.

It is unreliable to check the normality of error term assumption using residuals from multiple linear regression models when the sample size is small.

Learn more about Hat matrix: Role of Hat matrix in Diagnostics of Regression Analysis.

Assumptions about Linear Regression Models or Error Term

The linear regression model (LRM) is based on certain statistical assumption, some of which are related to the distribution of random variable (error term) $\mu_i$, some are about the relationship between error term $\mu_i$ and the explanatory variables (Independent variables, X’s) and some are related to the independent variable themselves. We can divide the assumptions about linear regression into two categories

  1. Stochastic Assumption
  2. None Stochastic Assumptions

These assumptions about linear regression models (or ordinary least square method: OLS) are extremely critical to the interpretation of the regression coefficients.

  • The error term ($\mu_i$) is a random real number i.e. $\mu_i$ may assume any positive, negative or zero value upon chance. Each value has a certain probability, therefore error term is a random variable.
  • The mean value of $\mu$ is zero, i.e $E(\mu_i)=0$ i.e. the mean value of $\mu_i$ is conditional upon the given $X_i$ is zero. It means that for each value of variable $X_i$, $\mu$ may take various values, some of them greater than zero and some smaller than zero. Considering the all possible values of $\mu$ for any particular value of $X$, we have zero mean value of disturbance term $\mu_i$.
  • The variance of $\mu_i$ is constant i.e. for the given value of X, the variance of $\mu_i$ is the same for all observations. $E(\mu_i^2)=\sigma^2$. The variance of disturbance term ($\mu_i$) about its mean is at all values of X will show the same dispersion about their mean.
  • The variable $\mu_i$ has a normal distribution i.e. $\mu_i\sim N(0,\sigma_{\mu}^2$. The value of $\mu$ (for each $X_i$) have a bell shaped symmetrical distribution.
  • The random term of different observation ($\mu_i,\mu_j$) are independent i..e $E(\mu_i,\mu_j)=0$, i.e. there is no autocorrelation between the disturbances. It means that random term assumed in one period does not depend of the values in any other period.
  • $\mu_i$ and $X_i$ have zero covariance between them i.e. $\mu$ is independent of the explanatory variable or $E(\mu_i X_i)=0$ i.e. $Cov(\mu_i, X_i)=0$. The disturbance term $\mu$ and explanatory variable X are uncorrelated. The $\mu$’s and $X$’s do not tend to vary together as their covariance is zero. This assumption is automatically fulfilled if X variable is nonrandom or non-stochastic or if mean of random term is zero.
  • All the explanatory variables are measured without error. It means that we will assume that the regressors are error free while y (dependent variable) may or may not include error of measurements.
  • The number of observations n must be greater than the number of parameters to be estimated or alternatively the number of observation must be greater than the number of explanatory (independent) variables.
  • The should be variability in the X values. That is X values in a given sample must not be same. Statistically, $Var(X)$ must be a finite positive number.
  • The regression model must be correctly specified, meaning that there is no specification bias or error in the model used in empirical analysis.
  • There is no perfect or near to perfect multicollinearity or collinearity among the two or more explanatory (independent) variables.
  • Values taken by the regressors X are considered to be fixed in repeating sampling i.e. X is assumed to non-stochastic. Regression analysis is conditional on the given values of the regressor(s) X.
  • Linear regression model is linear in the parameters, e.g. $y_i=\beta_1+\beta_2x_i +\mu_i$

Homoscedasticity: Assumption of constant variance of a random variable

The assumption about the random variable μ (error term) is that its probability distribution remains the same for all observations of X and in particular that the variance of each μ is the same for all values of the explanatory variables, i.e the variance of errors is the same across all levels of the independent variables. Symbolically it can be represented as

$Var(\mu) = E\{\mu_i – E(\mu)\}^2 = E(\mu_i)^2 = \sigma_\mu^2 = \mbox(Constant)$

This assumption is known as the assumption of homoscedasticity or the assumption of constant variance of the error term μ‘s. It means that the variation of each μi around its zero means does not depend on the values of X (independent) because the error term expresses the influence on the dependent variables due to

  • Errors in measurement
    The errors of measurement tend to be cumulative over time. It is also difficult to collect the data and check its consistency and reliability. So the variance of μi increases with increasing the values of X.
  • Omitted variables
    Omitted variables from the function (regression model) tend to change in the same direction as X, causing an increase in the variance of the observation from the regression line.

The variance of each μi remains the same irrespective of small or large values of the explanatory variable i.e. $\sigma_\mu^2$ is not a function of Xi i.e $\sigma_{\mu_i^2} \ne f(X_i)$.

Homoscedasticity or Constant Variance

Consequences if Homoscedasticity is not meet

If the assumption of homoscedastic disturbance (Constant Variance) is not fulfilled, the following are the consequence

  1. We cannot apply the formula of the variance of the coefficient to conduct tests of significance and construct confidence intervals. The tests are inapplicable $Var(\hat{\beta}_0)=\sigma_\mu^2 \{\frac{\sum X^2}{n \sum X^2}\}$ and $Var(\hat{\beta}_1) = \sigma_\mu^2 \{\frac{1}{\sum X^2}\}$
  2. If μ (error term) is heteroscedastic the OLS (Ordinary Least Square) estimates do not have minimum variance property in the class of Unbiased Estimators i.e they are inefficient in small samples. Furthermore, they are inefficient in large samples (that is, asymptotically inefficient).
  3. The coefficient estimates would still be statistically unbiased even if the μ‘s are heteroscedastic. The $\hat{\beta}$’s will have no statistical bias i.e $E(\beta_i)=\beta_i$ (coefficient’s expected values will be equal to the true parameter value).
  4. The prediction would be inefficient because the variance of prediction includes the variance of μ and of the parameter estimates which are not minimal due to the incidence of heteroscedasticity i.e. The prediction of Y for a given value of X based on the estimates $\hat{\beta}$’s from the original data, would have a high variance.

Tests for Homoscedasticity

Some tests commonly used for testing the assumption of homoscedasticity are:

A. Koutsoyiannis (1972). “Theory of Econometrics”. 2nd Ed.