Eigenvalue Multicollinearity Detection

In this post, we learn about the role of eigenvalue multicollinearity detection. In the context of the detection of multicollinearity, eigenvalues are used to assess the degree of linear dependence among explanatory (regressors, independent) variables in a regression model. Therefore, by understanding the role of eigenvalue multicollinearity detection, one can take appropriate steps to improve the reliability and interpretability of the regression models.

Decomposition of Eigenvalues and Eigenvectors

The pair-wise correlation matrix of explanatory variables is decomposed into eigenvalues and eigenvectors. Whereas Eigenvalues represent the variance explained by each Principal Component and Eigenvectors represent the directions of maximum variance.

The Decomposition Process

Firstly, compute the correlation coefficients between each pair of variables in the dataset.

Secondly, find the Eigenvalues and Eigenvectors: solve the following equation for each eigenvalue ($\lambda$) and eigenvector ($vV)

$$A v = \lambda v$$

where $A$ is the correlation matrix, $v$ is the eigenvector, and $\lambda$ is the eigenvalue.

The above equation essentially means that multiplying the correlation matrix ($A$) by the eigenvector ($v$) results in a scaled version of the eigenvector, where the scaling factor is the eigenvalue. This can be solved using various numerical methods, such as the power method or QR algorithm.

Interpreting Eigenvalue Multicollinearity Detection

A set of eigenvalues of relatively equal magnitudes indicates little multicollinearity (Freund and Littell 2000: 99). A small number of large eigenvalues suggests that a small number of component variables describe most of the variability of the original observed variables ($X$). Because of the score constraint, a number of large eigenvalues implies that there will be some small eigenvalues or some small variances of component variables.

A zero eigenvalue means perfect multicollinearity among independent/explanatory variables and very small eigenvalues imply severe multicollinearity. Conventionally, an eigenvalue close to zero (less than 0.01) or condition number greater than 50 (30 for conservative persons) indicates significant multicollinearity. The condition index, calculated as the ratio of the largest eigenvalue to the smallest eigenvalue $\left(\frac{\lambda_{max}}{\lambda_{min}}\right)$, is a more sensitive measure of multicollinearity. A high condition index (often above 30) signals severe multicollinearity.

Eigenvalue Multicollinearity Detection

The proportion of variances tells how much percentage of the variance of parameter estimate (coefficient) is associated with each eigenvalue. A high proportion of variance of an independent variable coefficient reveals a strong association with the eigenvalue. If an eigenvalue is small enough and some independent variables show a high proportion of variation with respect to the eigenvalues then one may conclude that these independent variables have significant linear dependency (correlation).

Presence of Multicollinearity in Regression Model

Since Multicollinearity is a statistical phenomenon where two or more independent/explanatory variables in a regression model are highly correlated, the existence/presence of multicollinearity may result in

  • Unstable Coefficient Estimates: Estimates of regression coefficients become unstable in the presence of multicollinearity. A small change in the data can lead to large changes in the estimates of the regression coefficients.
  • Inflated Standard Errors: The standard errors of the regression coefficients inflated due to the presence of multicollinearity, making it difficult to assess the statistical significance of the coefficients.
  • Difficulty in Interpreting Coefficients: It becomes challenging to interpret the individual effects of the independent variables on the dependent variable when they are highly correlated.

How to Mitigate the Effects of Multicollinearity

If multicollinearity is detected, several strategies can be employed to mitigate the effects of multicollinearity. By examining the distribution of eigenvalues, researchers (statisticians and data analysts) can identify potential issues and take appropriate steps to address them, such as feature selection or regularization techniques.

  • Feature Selection: Remove redundant or highly correlated variables from the model.
  • Principal Component Regression (PCR): Transform the original variables into a smaller set of uncorrelated principal components.
  • Partial Least Squares Regression (PLSR): It is similar to PCR but also considers the relationship between the independent variables and the dependent variable.
  • Ridge Regression: Introduces a bias-variance trade-off to stabilize the coefficient estimates.
  • Lasso Regression: Shrinks some coefficients to zero, effectively performing feature selection.
https://itfeature.com eigenvalue for multicollinearity detection

https://rfaqs.com, https://gmstat.com

Multicollinearity in Regression Models

The post is about Multicollinearity in Regression Models.

The objective of multiple regression analysis is to approximate the relationship of individual parameters of a dependency, but not of interdependency. It is assumed that the dependent variable $y$ and regressors $X$’s are linearly related to each other (Graybill, 1980; Johnston, 1963; and Malinvaud, 1968). Therefore, inferences depicted from any regression model are

(i) Identify the relative influence of regressors
(ii) Prediction and/or estimation and
(iii) Selection of an appropriate set of regressors for the model.

Multicollinearity in Regression Models

From all these inferences, one of the purposes of the regression model is to ascertain what extent the dependent variable can be predicted by the regressors in the model. However, to draw some suitable inferences, the regressors should be orthogonal, i.e., there should be no linear dependencies among regressors. However, in most of the applications of regression analysis, regressors are not orthogonal, which leads to misleading and erroneous inferences, especially, in cases when regressors are perfectly or nearly perfectly collinear to each other.

Regarding the multicollinearity in Regression, the condition of non-orthogonality is also referred to as the problem of multicollinearity or collinear data, for example, see Gunst and Mason, 1977;  Mason et al., 1975 and Ragnar, 1934). Multicollinearity is also synonymous with ill-conditioning of the $X’X$ matrix.

The presence of interdependence or the lack of independence is signified by high-order inter-correlation ($R=X’X$) within a set of regressors (Dorsett et al, 1983; Farrar and Glauber1967; Gunst and Mason, 1977; Mason et al., 1975). The perfect multicollinearity situation is a pathological extreme and it can easily be detected and resolved by dropping one of the regressors causing multicollinearity (Belsley et al., 1980). In the case of perfect multicollinearity, the regression coefficients remain indeterminate and their standard errors are infinite. Similarly, perfectly collinear regressors destroy the uniqueness of the least square estimators (Belsley et al., 1980; and Belsley, 1991).

Many explanatory variables (regressors/ predictors) are highly collinear, making it very difficult to infer the separate influence of collinear regressors on the response variable ($y$), that is, estimation of regression coefficients becomes difficult because coefficient(s) measures the effect of the corresponding regressor while holding all other regressors as constant. The problem of not perfect multicollinearity is extremely hard to detect (Chatterjee and Hadi, 2006) as it is not a specification or modeling error it is a condition of deficit data (Hadi and Chatterjee, 1988). On the other hand, the existence of multicollinearity has no impact on the overall regression model and associated statistics such as $R^2$, $F$-ratio, and $p$-value.

Multicollinearity does not also lessen the predictive or reliability of the regression model as a whole, it only affects the individual regressors (Koutsoyiannis, 1977). Note that, multicollinearity refers only to the linear relationships among the regressors, it does not rule out the nonlinear relationships among them. To draw suitable inferences from the model, the existence of (multi)collinearity should always be tested when examining a data set as an initial step in multiple regression analysis. On the other hand, high collinearity is rare, but some degree of collinearity always exists.

Multicollinearity in Linear Regression Models

A distinction between collinearity and multicollinearity should be made. Strictly speaking, multicollinearity usually refers to the existence of more than one exact linear relationship among regressors, while collinearity refers to the existence of a single linear relationship. However, multicollinearity refers to both of the cases nowadays.

There are many methods for the detection/ testing of multi(collinearity) among regressors. However, these methods can destroy the usefulness of the model, since relevant regressor(s) may be removed by these methods. Note that, if there are two predictors then it is sufficient to detect the problem of collinearity using pairwise correlation. However, to check the severity of the collinearity problem, VIF/TOL, eigenvalues, or other diagnostic measures can be used.

For further details about “Multicollinearity in Regression Models” see:

  • Belsley, D., Kuh, E., and Welsch, R. (1980). Diagnostics: Identifying Influential Data and Sources of Collinearity. John Willey & Sons, New York. chap. 3.
  • Belsley, D. A. (1991). A Guide to Using the Collinearity Diagnostics. Computer Science in Economics and Management, 4(1), 3350.
  • Chatterjee, S. and Hadi, A. S. (2006). Regression Analysis by Example. Wiley and Sons, 4th edition.
  • Dorsett, D., Gunst, R. F., and Gartland, E. C. J. (1983). Multicollinear Effects of Weighted Least Squares Regression. Statistics & Probability Letters, 1(4), 207211.
  • Graybill, F. (1980). An Introduction to Linear Statistical Models. McGraw Hill.
  • Gunst, R. and Mason, R. (1977). Advantages of examining multicollinearities in regression analysis. Biometrics, 33, 249260.
  • Hadi, A. and Chatterjee, S. (1988). Sensitivity Analysis in Linear Regression. John Willey & Sons.
  • Imdadullah, M., Aslam, M. and Altaf, S. (2916) mctest: An R Package for Detection of Collinearity Among Regressors
  • Imdadullah, M., Aslam, M. (2016). mctest: An R Package for Detection of Collinearity Among Regressors
  • Johnston, J. (1963). Econometric Methods. McGraw Hill, New York.
  • Koutsoyiannis, A. (1977). Theory of Econometrics. Macmillan Education Limited.
  • Malinvaud, E. (1968). Statistical Methods of Econometrics. Amsterdam, North Holland. pp. 187192.
  • Mason, R., Gunst, R., and Webster, J. (1975). Regression Analysis and Problems of Multicollinearity. Communications in Statistics, 4(3), 277292.
  • Ragnar, F. (1934). Statistical Consequence Analysis by means of complete regression systems. Universitetets Økonomiske Instituut. Publ. No. 5.

Learn about Data analysis of Statistical Models in R

Multicollinearity Introduction Explained Easy (2019)

For a classical linear regression model with multiple regressors (explanatory variables), there should be no exact linear relationship between the explanatory variables. The collinearity or multicollinearity term is used if there is/are one or more linear relationship exists among the variables.

Multicollinearity Term

The term multicollinearity is considered as the violation of the assumption of “no exact linear relationship between the regressors.

Ragnar Frisch introduced this term, originally it means the existence of a “perfect” or “exact” linear relationship among some or all regressors of a regression model.

Consider a $k$-variable regression model involving explanatory variables $X_1, X_2, \cdots, X_k$. An exact linear relationship is said to exist if the following condition is satisfied.

\[\lambda_1 X_1 + \lambda_2  X_2 + \cdots + \lambda_k X_k=0,\]

where $\lambda_1, \lambda_2, \cdots, \lambda_k$ are constant and all of them are non-zero, simultaneously, and $X_1=1$ for all observations for intercept term.

Nowadays, the multicollinearity term is not only being used for the case of perfect multicollinearity but also in the case of not perfect collinearity (the case where the $X$ variables are intercorrelated but not perfectly). Therefore,

\[\lambda_1X_1 + \lambda_2X_2 + \cdots \lambda_kX_k + \upsilon_i,\]

where $\upsilon_i$ is a stochastic error term.

Multicollinearity

In the case of a perfect linear relationship (correlation coefficient will be one in this case) among explanatory variables, the parameters become indeterminate (it is impossible to obtain values for each parameter separately) and the method of least square breaks down. However, if regressors are not intercorrelated at all, the variables are called orthogonal and there is no problem concerning the estimation of coefficients.

Note that

  • Multicollinearity is not a condition that either exists or does not exist, but rather a phenomenon inherent in most relationships.
  • Multicollinearity refers to only a linear relationship among the $X$ variables. It does not rule out the non-linear relationships among them.

See use of mctest R package for diagnosing collinearity