# Category: Regression Diagnostics

Regression Analysis, Hat Matrix, Residual Analysis, Regression Diagnostics

## Multicollinearity in Linear Regression Models

Multicollinearity in Linear Regression Models

The objective of multiple regression analysis is to approximate the relationship of individual parameters of a dependency, but not of interdependency. It is assumed that the dependent variable $y$ and regressors $X$’s are linearly related to each other (Graybill, 1980; Johnston, 1963 and Malinvaud, 1968). Therefore, inferences depicted from any regression bmodel are

(i) identify relative influence of regressors
(ii) prediction and/or estimation and
(iii) selection of an appropriate set of regressors for the model.

From all these inferences, one of the purposes of the regression model is to ascertain what extent to the dependent variable can be predicted by the regressors in the model. However, to draw some suitable inferences, the regressors should be orthogonal, i.e., there should be no linear dependencies among regressors. However, in most of the applications of regression analysis, regressors are not orthogonal, which leads to misleading and erroneous inferences, especially, in cases when regressors are perfectly or nearly perfectly collinear to each other. The condition of non-orthogonality is also referred to as the problem of multicollinearity or collinear data, for example, see Gunst and Mason, 1977;  Mason et al., 1975 and Ragnar, 1934). Multicollinearity is also synonymous with ill-conditioning of the $X’X$ matrix.

The presence of interdependence or the lack of independence is signified by high order inter-correlation ($R=X’X$) within a set of regressors ({Dorsett et al, 1983; Farrar and Glauber1967; Gunst and Mason, 1977; Mason et al., 1975). The perfect multicollinearity situation is a pathological extreme and it can easily be detected and resolved by dropping one of the regressors causing multicollinearity (Belsley et al., 1980). In the case of perfect multicollinearity, the regression coefficients remain indeterminate and their standard errors are infinite. Similarly, perfectly collinear regressors destroy the uniqueness of the least square estimators (Belsley et al., 1980 and Belsley, 1991). Many explanatory variables (regressors/ predictors) are highly collinear, making it very difficult to infer the separate influence of collinear regressors on the response variable ($y$), that is, estimation of regression coefficients becomes difficult because coefficient(s) measures the effect of the corresponding regressor while holding all other regressors as constant. The problem of not perfect multicollinearity is extremely hard to detect (Chatterjee and Hadi, 2006) as it is not a specification or modeling error, actually, it is a condition of deficit data (Hadi and Chatterjee, 1988). On the other hand, the existence of multicollinearity has no impact on the overall regression model and associated statistics such as $R^2$, $F$-ratio, and $p$-value. Multicollinearity does not also lessen the predictive or reliability of the regression model as a whole, it only affects the individual regressors (Koutsoyiannis, 1977). Note that, multicollinearity refers only to the linear relationships among the regressors, it does not rule out the nonlinear relationships among them.

To draw suitable inferences from the model, the existence of (multi)collinearity should always be tested when examining a data set as an initial step in multiple regression analysis. On the other hand, high collinearity is rare, but some degree of collinearity always exists.

A distinction between collinearity and multicollinearity should be made. Strictly speaking, multicollinearity usually refers to the existence of more than one exact linear relationship among regressors, while collinearity refers to the existence of a single linear relationship. However, multicollinearity refers to both of the cases nowadays.

There are many methods for the detection/ testing of multi(collinearity) among regressors. However, these methods can destroy the usefulness of the model, since relevant regressor(s) may be removed by these methods. Note that, if there are two predictors then it is sufficient to detect the problem of collinearity using pairwise correlation. However, to check the severity of the collinearity problem, VIF/TOL, eigenvalues, or other diagnostic measures can be used.

For further detail about “Multicollinearity in Linear Regression Models” see:

• Belsley, D., Kuh, E., and Welsch, R. (1980). Diagnostics: Identifying Influential Data and Sources of Collinearity. John Willey & Sons, New York. chap. 3.
• Belsley, D. A. (1991). A Guide to Using the Collinearity Diagnostics. Computer Science in Economics and Management, 4(1), 3350.
• Chatterjee, S. and Hadi, A. S. (2006). Regression Analysis by Example. Wiley and Sons, 4th edition.
• Dorsett, D., Gunst, R. F., and Gartland, E. C. J. (1983). Multicollinear Effects of Weighted Least Squares Regression. Statistics & Probability Letters, 1(4), 207211.
• Graybill, F. (1980). An Introduction to Linear Statistical Models. McGraw Hill.
• Gunst, R. and Mason, R. (1977). Advantages of examining multicollinearities in regression analysis. Biometrics, 33, 249260.
• Hadi, A. and Chatterjee, S. (1988). Sensitivity Analysis in Linear Regression. John Willey & Sons.
• Imdadullah, M., Aslam, M. and Altaf, S. (2916) mctest: An R Package for Detection of Collinearity Among Regressors
• Imdadullah, M., Aslam, M. (2016). mctest: An R Package for Detection of Collinearity Among Regressors
• Johnston, J. (1963). Econometric Methods. McGraw Hill, New York.
• Koutsoyiannis, A. (1977). Theory of Econometrics. Macmillan Education Limited.
• Malinvaud, E. (1968). Statistical Methods of Econometrics. Amsterdam, North Holland. pp. 187192.
• Mason, R., Gunst, R., and Webster, J. (1975). Regression Analysis and Problems of Multicollinearity. Communications in Statistics, 4(3), 277292.
• Ragnar, F. (1934). Statistical Consequence Analysis by means of complete regression systems. Universitetets Ã˜konomiske Instituut. Publ. No. 5.

## Multicollinearity

For a classical linear regression model with multiple regressors (explanatory variables), there should be no exact linear relationship between the explanatory variables. The collinearity or multicollinearity term is used if there is/are one or more linear relationship exists among the variables.

The term multicollinearity is considered as the violation of the assumption of “no exact linear relationship between the regressors.

Ragnar Frisch introduced this term, originally it means the existence of a “perfect” or “exact” linear relationship among some or all regressors of a regression model.

Consider a $k$-variable regression model involving explanatory variables $X_1, X_2, \cdots, X_k$. An exact linear relationship is said to exist if the following condition is satisfied.

$\lambda_1 X_1 + \lambda_2 X_2 + \cdots + \lambda_k X_k=0,$

where $\lambda_1, \lambda_2, \cdots, \lambda_k$ are constant and all of them all are non-zero, simultaneously, and $X_1=1$ for all observations for intercept term.

Now a day, multicollinearity term is not only being used for the case of perfect multicollinearity but also in case of not perfect collinearity (the case where the $X$ variables are intercorrelated but not perfectly). Therefore,

$\lambda_1X_1 + \lambda_2X_2 + \cdots \lambda_kX_k + \upsilon_i,$

where $\upsilon_i$ is a stochastic error term.

In case of a perfect linear relationship (correlation coefficient will be one in this case) among explanatory variables, the parameters become indeterminate (it is impossible to obtain values for each parameter separately) and the method of least square breaks down. However, if regressors are not intercorrelated at all, the variables are called orthogonal and there is no problem concerning the estimation of coefficients.

Note that

• Multicollinearity is not a condition that either exists or does not exist, but rather a phenomenon inherent in most relationships.
• Multicollinearity refers to the only a linear relationship among the $X$ variables. It does not rule out the non-linear relationships among them.

See use of mctest R package for diagnosing collinearity

## Checking Normality of the Error Term

Normality of the Error Term

In multiple linear regression models, the sum of squared residuals (SSR) is divided by $n-p$ (degrees of freedom, where $n$ is the total number of observations, and $p$ is the number of the parameter in the model) is a good estimate of the error variance. In the multiple linear regression model, the residual vector is

\begin{align*}
e &=(I-H)y\\
&=(I-H)(X\beta+e)\\
&=(I-H)\varepsilon
\end{align*}

where $H$ is the hat matrix for the regression model.

Each component $e_i=\varepsilon – \sum\limits_{i=1}^n h_{ij} \varepsilon_i$. Therefore, In multiple linear regression models, the normality of the residual is not simply the normality of the error term.

Note that:

$Cov(\mathbf{e})=(I-H)\sigma^2 (I-H)’ = (I-H)\sigma^2$

We can write $Var(e_i)=(1-h_{ii})\sigma^2$.

If the sample size ($n$) is much larger than the number of the parameters ($p$) in the model (i.e. $n > > p$), in other words, if sample size ($n$) is large enough, $h_{ii}$ will be small as compared to 1, and $Var(e_i) \approx \sigma^2$.

In multiple regression models, a residual behaves like an error if the sample size is large. However, this is not true for a small sample size.

It is unreliable to check the normality of error term assumption using residuals from multiple linear regression models when the sample size is small.

## Role of Hat Matrix in Regression Analysis

Hat matrix is a $n\times n$ symmetric and idempotent matrix with many special properties play an important role in diagnostics of regression analysis by transforming the vector of observed responses Y into the vector of fitted responses $\hat{Y}$.

The model $Y=X\beta+\varepsilon$ with solution $b=(X’X)^{-1}X’Y$ provided that $(X’X)^{-1}$ is non-singular. The fitted values are ${\hat{Y}=Xb=X(X’X)^{-1} X’Y=HY}$.

Like fitted values ($\hat{Y}$), the residual can be expressed as linear combinations of the response variable Yi.

\begin{align*}
e&=Y-\hat{Y}\\
&=Y-HY\\&=(I-H)Y
\end{align*}

• Hat matrix only involves the observation in the predictor variable X  as $H=X(X’X)^{-1}X’$. It plays an important role in diagnostics for regression analysis.
• The hat matrix plays an important role in determining the magnitude of a studentized deleted residual and therefore in identifying outlying Y observations.
• The hat matrix is also helpful in directly identifying outlying X observation.
• In particular the diagonal elements of the hat matrix are indicator of in a multi-variable setting of whether or not a case is outlying with respect to X values.
• The elements of hat matrix have their values between 0 and 1 always and their sum is p i.e. $0 \le h_{ii}\le 1$  and  $\sum _{i=1}^{n}h_{ii} =p$
where p is number of regression parameter with intercept term.
• hii is a measure of the distance between the X values for the ith case and the means of the X values for all n cases.

## Mathematical Properties of Hat Matrix

• HX=X
• (I-H)X=0
• HH=H2=H=Hp
• H(I-H)=0
• $Cov(\hat{e},\hat{Y})=Cov\left\{HY,(I-H)Y\right\}=\sigma ^{2} H(I-H)=0$
• 1-H is also symmetric and idempotent.
• H1=1 with intercept term. i.e. every row of H adds  upto 1. 1’=1H’=1’H  & 1’H1=n
• The elements of H are denoted by hii  i.e.
$H=\begin{pmatrix}{h_{11} } & {h_{12} } & {\cdots } & {h_{1n} } \\ {h_{21} } & {h_{22} } & {\cdots } & {h_{2n} } \\ {\vdots } & {\vdots } & {\ddots } & {\vdots } \\ {h_{n1} } & {h_{n2} } & {\vdots } & {h_{nn} }\end{pmatrix}$
The large value of hii indicates that the ith case is distant from the center for all n cases. The diagonal element hii in this context is called leverage of the ith case.hii is a function of only the X values, so hii measures the role of the X values in determining how important Yi is affecting the fitted $\hat{Y}_{i}$ values.
The larger the hii, the smaller the variance of the residuals ei for hii =1, σ2(ei)=0.
• Variance, Covariance of e
\begin{align*}
e-E(e)&=(I-H)Y(Y-X\beta )=(I-H)\varepsilon \\
E(\varepsilon \varepsilon ‘)&=V(\varepsilon )=I\sigma ^{2} \,\,\text{and} \,\, E(\varepsilon )=0\\
(I-H)’&=(I-H’)=(I-H)\\
V(e) & =  E\left[e-E(e_{i} )\right]\left[e-E(e_{i} )\right]^{{‘} } \\
& = (I-H)E(\varepsilon \varepsilon ‘)(I-H)’ \\
& = (I-H)I\sigma ^{2} (I-H)’ \\
& =(I-H)(I-H)I\sigma ^{2} =(I-H)\sigma ^{2}
\end{align*}V(ei) is given by the ith diagonal element 1-hii  and Cov(ei ,ej ) is given by the (i, j)th  element of −hij of the matrix (I-H)σ2.\begin{align*}
\rho _{ij} &=\frac{Cov(e_{i} ,e_{j} )}{\sqrt{V(e_{i} )V(e_{j} )} } \\
&=\frac{-h_{ij} }{\sqrt{(1-h_{ii} )(1-h_{jj} )} }\\
SS(b) & = SS({\rm all\; parameter)=}b’X’Y \\
& = \hat{Y}’Y=Y’H’Y=Y’HY=Y’H^{2} Y=\hat{Y}’\hat{Y}
\end{align*}The average $V(\hat{Y}_{i} )$ to all data points is
\begin{align*}
\sum _{i=1}^{n}\frac{V(\hat{Y}_{i} )}{n} &=\frac{trace(H\sigma ^{2} )}{n}=\frac{p\sigma ^{2} }{n} \\
\hat{Y}_{i} &=h_{ii} Y_{i} +\sum _{j\ne 1}h_{ij} Y_{j}
\end{align*}

## Role of Hat Matrix in Regression Diagnostic

### Internally Studentized Residuals

$V(e_i)=(1-h_{ii}\sigma^2$ where σ2 is estimated by s2

i.e. $s^{2} =\frac{e’e}{n-p} =\frac{\Sigma e_{i}^{2} }{n-p}$  (RMS)

we can studentized the residual as $s_{i} =\frac{e_{i} }{s\sqrt{(1-h_{ii} )} }$

These studentized residuals are said to be internally studentized because s has within it ei itself.

### Extra Sum of Squares attributable to $e_i$

\begin{align*}
e&=(1-H)Y\\
e_{i} &=-h_{i1} Y_{1} -h_{i2} Y_{2} -\cdots +(1-h_{ii} )Y_{i} -h_{in} Y_{n} =c’Y\\
c’&=(-h_{i1} ,-h_{i2} ,\cdots ,(1-h_{ii} )\cdots -h_{in} )\\
c’c&=\sum _{i=1}^{n}h_{i1}^{2}  +(1-2h_{ii} )=(1-h_{ii} )\\
SS(e_{i})&=\frac{e_{i}^{2} }{(1-h_{ii} )}\\
S_{(i)}^{2}&=\frac{(n-p)s^{2} -\frac{e_{i}^{2}}{e_{i}^{2}  (1-h_{ii} )}}{n-p-1}
\end{align*}
provides an estimate of σ2 after deletion of the contribution of ei.

### Externally Studentized Residuals

$t_{i} =\frac{e_{i} }{s(i)\sqrt{(1-h_{ii} )} }$ are externally studentized residuals. Here if ei is large, it is thrown into emphases even more by the fact that si has excluded it. The ti follows a tn-p-1 distribution under the usual normality of errors assumptions.