OLS Assumptions - Statistics for Data Science & Analytics

Homoscedasticity: Constant Variance of a Random Variable (2020)

Sep 11, 2024Aug 21, 2020 by Muhammad Imdad Ullah

The term “Homoscedasticity” is the assumption about the random variable $u$ (error term) that its probability distribution remains the same for all observations of $X$ and in particular that the variance of each $u$ is the same for all values of the explanatory variables, i.e the variance of errors is the same across all levels of the independent variables (Homoscedasticity: assumption about the constant variance of a random variable). Symbolically it can be represented as

$$Var(u) = E\{u_i – E(u)\}^2 = E(u_i)^2 = \sigma_u^2 = \mbox(Constant)$$

This assumption is known as the assumption of homoscedasticity or the assumption of constant variance of the error term $u$’s. It means that the variation of each $u_i$ around its zero means does not depend on the values of $X$ (independent) because the error term expresses the influence on the dependent variables due to

Errors in measurement
The errors of measurement tend to be cumulative over time. It is also difficult to collect the data and check its consistency and reliability. So the variance of $u_i$ increases with increasing the values of $X$.
Omitted variables
Omitted variables from the function (regression model) tend to change in the same direction as $X$, causing an increase in the variance of the observation from the regression line.

The variance of each $u_i$ remains the same irrespective of small or large values of the explanatory variable i.e. $\sigma_u^2$ is not a function of $X_i$ i.e $\sigma_{u_i^2} \ne f(X_i)$.

Consequences if Homoscedasticity is not meet

If the assumption of homoscedastic disturbance (Constant Variance) is not fulfilled, the following are the Heteroscedasticity consequences:

We cannot apply the formula of the variance of the coefficient to conduct tests of significance and construct confidence intervals. The tests are inapplicable $Var(\hat{\beta}_0)=\sigma_u^2 \{\frac{\sum X^2}{n \sum X^2}\}$ and $Var(\hat{\beta}_1) = \sigma_u^2 \{\frac{1}{\sum X^2}\}$
If $u$ (error term) is heteroscedastic the OLS (Ordinary Least Square) estimates do not have minimum variance property in the class of Unbiased Estimators i.e. they are inefficient in small samples. Furthermore, they are inefficient in large samples (that is, asymptotically inefficient).
The coefficient estimates would still be statistically unbiased even if the $u$’s are heteroscedastic. The $\hat{\beta}$’s will have no statistical bias i.e. $E(\beta_i)=\beta_i$ (coefficient’s expected values will be equal to the true parameter value).
The prediction would be inefficient because the variance of prediction includes the variance of $u$ and of the parameter estimates which are not minimal due to the incidence of heteroscedasticity i.e. The prediction of $Y$ for a given value of $X$ based on the estimates $\hat{\beta}$’s from the original data, would have a high variance.

Tests for Homoscedasticity

Some tests commonly used for testing the assumption of homoscedasticity are:

Spearman Rank-Correlation test
Goldfeld and Quandt test
Park Glejser test
Breusch–Pagan test
Bartlett’s test of Homoscedasticity

Reference:
A. Koutsoyiannis (1972). “Theory of Econometrics”. 2nd Ed.

Conducting Statistical Models in R Language

Checking Normality of Error Term (2019)

May 24, 2024Aug 11, 2019 by Muhammad Imdad Ullah

Normality of Error Term

In multiple linear regression models, the sum of squared residuals (SSR) is divided by $n-p$ (degrees of freedom, where $n$ is the total number of observations, and $p$ is the number of the parameter in the model) is a good estimate of the error variance. In the multiple linear regression model, the residual vector is

\begin{align*}
e &=(I-H)y\\
&=(I-H)(X\beta+e)\\
&=(I-H)\varepsilon
\end{align*}

where $H$ is the hat matrix for the regression model.

Each component $e_i=\varepsilon – \sum\limits_{i=1}^n h_{ij} \varepsilon_i$. Therefore, In multiple linear regression models, the normality of the residual is not simply the normality of the error term.

Note that:

\[Cov(\mathbf{e})=(I-H)\sigma^2 (I-H)’ = (I-H)\sigma^2\]

We can write $Var(e_i)=(1-h_{ii})\sigma^2$.

If the sample size ($n$) is much larger than the number of the parameters ($p$) in the model (i.e. $n > > p$), in other words, if sample size ($n$) is large enough, $h_{ii}$ will be small as compared to 1, and $Var(e_i) \approx \sigma^2$.

In multiple regression models, a residual behaves like an error if the sample size is large. However, this is not true for a small sample size.

It is unreliable to check the normality of error term assumption using residuals from multiple linear regression models when the sample size is small.

Learn more about Hat matrix: Role of Hat matrix in Diagnostics of Regression Analysis.

Learn R Programming Language

Regression Model Assumptions

Jun 23, 2024Nov 3, 2013 by Muhammad Imdad Ullah

Linear Regression Model Assumptions

The linear regression model (LRM) is based on certain statistical assumptions, some of which are related to the distribution of a random variable (error term) $u_i$, some are about the relationship between error term $u_i$ and the explanatory variables (Independent variables, $X$‘s) and some are related to the independent variable themselves. The linear regression model assumptions can be classified into two categories

Stochastic Assumption
None Stochastic Assumptions

These linear regression model assumptions (or assumptions about the ordinary least square method: OLS) are extremely critical to interpreting the regression coefficients.

The error term ($u_i$) is a random real number i.e. $u_i$ may assume any positive, negative, or zero value upon chance. Each value has a certain probability, therefore, the error term is a random variable.
The mean value of $u$ is zero, i.e. $E(u_i)=0$ i.e. the mean value of $u_i$ is conditional upon the given $X_i$ is zero. It means that for each value of variable $X_i$, $u$ may take various values, some of them greater than zero and some smaller than zero. Considering all possible values of $u$ for any particular value of $X$, we have zero mean value of disturbance term $u_i$.
The variance of $u_i$ is constant i.e. for the given value of $X$, the variance of $u_i$ is the same for all observations. $E(u_i^2)=\sigma^2$. The variance of disturbance term ($u_i$) about its mean is at all values of $X$ will show the same dispersion about their mean.
The variable $u_i$ has a normal distribution i.e. $u_i\sim N(0,\sigma_{u}^2$. The value of $u$ (for each $X_i$) has a bell-shaped symmetrical distribution.
The random terms of different observations ($u_i,u_j$) are independent i..e $E(u_i,u_j)=0$, i.e. there is no autocorrelation between the disturbances. It means that the random term assumed in one period does not depend on the values in any other period.
$u_i$ and $X_i$ have zero covariance between them i.e. $u$ is independent of the explanatory variable or $E(u_i X_i)=0$ i.e. $Cov(u_i, X_i)=0$. The disturbance term $u$ and explanatory variable $X$ are uncorrelated. The $u$’s and $X$’s do not tend to vary together as their covariance is zero. This assumption is automatically fulfilled if the $X$ variable is nonrandom or non-stochastic or if the mean of the random term is zero.
All the explanatory variables are measured without error. It means that we will assume that the regressors are error-free while $y$ (dependent variable) may or may not include measurement errors.
The number of observations $n$ must be greater than the number of parameters to be estimated or the number of observations must be greater than the number of explanatory (independent) variables.
The should be variability in the $X$ values. That is $X$ values in a given sample must not be the same. Statistically, $Var(X)$ must be a finite positive number.
The regression model must be correctly specified, meaning there is no specification bias or error in the model used in empirical analysis.
No perfect or near-perfect multicollinearity or collinearity exists among the two or more explanatory (independent) variables.
Values taken by the regressors $X$ are considered to be fixed in repeating sampling i.e. $X$ is assumed to be non-stochastic. Regression analysis is conditional on the given values of the regressor(s) $X$.
The linear regression model is linear in the parameters, e.g. $y_i=\beta_1+\beta_2x_i +u_i$

Visit MCQs Site: https://gmstat.com

Consequences if Homoscedasticity is not meet

Tests for Homoscedasticity

Share this:

Normality of Error Term

Share this:

Linear Regression Model Assumptions

Share this: