# Category: Correlation and Regression Analysis

## Multicollinearity in Linear Regression Models

Multicollinearity in Linear Regression Models

The objective of multiple regression analysis is to approximate the relationship of individual parameters of a dependency, but not of interdependency. It is assumed that the dependent variable $y$ and regressors $X$’s are linearly related to each other (Graybill, 1980; Johnston, 1963 and Malinvaud, 1968). Therefore, inferences depicted from any regression bmodel are

(i) identify relative influence of regressors
(ii) prediction and/or estimation and
(iii) selection of an appropriate set of regressors for the model.

From all these inferences, one of the purposes of the regression model is to ascertain what extent to the dependent variable can be predicted by the regressors in the model. However, to draw some suitable inferences, the regressors should be orthogonal, i.e., there should be no linear dependencies among regressors. However, in most of the applications of regression analysis, regressors are not orthogonal, which leads to misleading and erroneous inferences, especially, in cases when regressors are perfectly or nearly perfectly collinear to each other. The condition of non-orthogonality is also referred to as the problem of multicollinearity or collinear data, for example, see Gunst and Mason, 1977;  Mason et al., 1975 and Ragnar, 1934). Multicollinearity is also synonymous with ill-conditioning of the $X’X$ matrix.

The presence of interdependence or the lack of independence is signified by high order inter-correlation ($R=X’X$) within a set of regressors ({Dorsett et al, 1983; Farrar and Glauber1967; Gunst and Mason, 1977; Mason et al., 1975). The perfect multicollinearity situation is a pathological extreme and it can easily be detected and resolved by dropping one of the regressors causing multicollinearity (Belsley et al., 1980). In the case of perfect multicollinearity, the regression coefficients remain indeterminate and their standard errors are infinite. Similarly, perfectly collinear regressors destroy the uniqueness of the least square estimators (Belsley et al., 1980 and Belsley, 1991). Many explanatory variables (regressors/ predictors) are highly collinear, making it very difficult to infer the separate influence of collinear regressors on the response variable ($y$), that is, estimation of regression coefficients becomes difficult because coefficient(s) measures the effect of the corresponding regressor while holding all other regressors as constant. The problem of not perfect multicollinearity is extremely hard to detect (Chatterjee and Hadi, 2006) as it is not a specification or modeling error, actually, it is a condition of deficit data (Hadi and Chatterjee, 1988). On the other hand, the existence of multicollinearity has no impact on the overall regression model and associated statistics such as $R^2$, $F$-ratio, and $p$-value. Multicollinearity does not also lessen the predictive or reliability of the regression model as a whole, it only affects the individual regressors (Koutsoyiannis, 1977). Note that, multicollinearity refers only to the linear relationships among the regressors, it does not rule out the nonlinear relationships among them.

To draw suitable inferences from the model, the existence of (multi)collinearity should always be tested when examining a data set as an initial step in multiple regression analysis. On the other hand, high collinearity is rare, but some degree of collinearity always exists.

A distinction between collinearity and multicollinearity should be made. Strictly speaking, multicollinearity usually refers to the existence of more than one exact linear relationship among regressors, while collinearity refers to the existence of a single linear relationship. However, multicollinearity refers to both of the cases nowadays.

There are many methods for the detection/ testing of multi(collinearity) among regressors. However, these methods can destroy the usefulness of the model, since relevant regressor(s) may be removed by these methods. Note that, if there are two predictors then it is sufficient to detect the problem of collinearity using pairwise correlation. However, to check the severity of the collinearity problem, VIF/TOL, eigenvalues, or other diagnostic measures can be used.

For further detail about “Multicollinearity in Linear Regression Models” see:

• Belsley, D., Kuh, E., and Welsch, R. (1980). Diagnostics: Identifying Influential Data and Sources of Collinearity. John Willey & Sons, New York. chap. 3.
• Belsley, D. A. (1991). A Guide to Using the Collinearity Diagnostics. Computer Science in Economics and Management, 4(1), 3350.
• Chatterjee, S. and Hadi, A. S. (2006). Regression Analysis by Example. Wiley and Sons, 4th edition.
• Dorsett, D., Gunst, R. F., and Gartland, E. C. J. (1983). Multicollinear Effects of Weighted Least Squares Regression. Statistics & Probability Letters, 1(4), 207211.
• Graybill, F. (1980). An Introduction to Linear Statistical Models. McGraw Hill.
• Gunst, R. and Mason, R. (1977). Advantages of examining multicollinearities in regression analysis. Biometrics, 33, 249260.
• Hadi, A. and Chatterjee, S. (1988). Sensitivity Analysis in Linear Regression. John Willey & Sons.
• Imdadullah, M., Aslam, M. and Altaf, S. (2916) mctest: An R Package for Detection of Collinearity Among Regressors
• Imdadullah, M., Aslam, M. (2016). mctest: An R Package for Detection of Collinearity Among Regressors
• Johnston, J. (1963). Econometric Methods. McGraw Hill, New York.
• Koutsoyiannis, A. (1977). Theory of Econometrics. Macmillan Education Limited.
• Malinvaud, E. (1968). Statistical Methods of Econometrics. Amsterdam, North Holland. pp. 187192.
• Mason, R., Gunst, R., and Webster, J. (1975). Regression Analysis and Problems of Multicollinearity. Communications in Statistics, 4(3), 277292.
• Ragnar, F. (1934). Statistical Consequence Analysis by means of complete regression systems. Universitetets Ã˜konomiske Instituut. Publ. No. 5.

## The Spearman Rank Correlation Test (Numerical Example)

Consider the following data for the illustration of the detection of heteroscedasticity using the Spearman Rank correlation test. The Data file available to download.

The estimated multiple linear regression model is:

$$Y_i = -34.936 -0.75X_{2i} + 7.611X_{3i}$$

The Residuals with data table are:

We need to find the rank of absolute values of $u_i$ and the expected heteroscedastic variable $X_2$.

Let us compute the Spearman’s Rank correlation

\begin{align}
r_s&=1-\frac{6\sum d^2}{n(n-1)}\\
&=1-\frac{6\times 70.5)}{100(100-1)}=0.5727
\end{align}

Let perform the statistical significance of $r_s$ by t-test

\begin{align}
t&=\frac{r_s \sqrt{n}}{\sqrt{1-r_s^2}}\\
&=\frac{0.5727\sqrt{8}}{\sqrt{1-(0.573)^2}}=1.977
\end{align}

The value of $t$ from the table at 5% level of significance at 8 degrees of freedom is 2.306.

Since $t_{cal} \ngtr t_{tab}$, there is no evidence of the systematic relationship between the explanatory variables, $X_2$ and the absolute value of the residuals ($|u_i|$) and hence there is no evidence of heteroscedasticity.

Since there is more than one regressor (it example is from the multiple regression model), therefore, Spearman’s Rank Correlation test should be repeated for each of the explanatory variables.

As an assignment perform the Spearman Rank Correlation between |$u_i$| and $X_3$  for the data above. Test the statistical significance of the coefficient in the above manner to explore evidence about heteroscedasticity.

## MCQs Econometrics-1

This quiz is about Econometrics, which covers the topics of Regression analysis, correlation, dummy variable, multicollinearity, heteroscedasticity, autocorrelation, and many other topics. Let start with MCQs Econometric test

MCQs about Multicollinearity, Dummy Variable, Selection of Variables, Error in Variables, Autocorrelation, Time Series, Heteroscedasticity, Simultaneous Equations, and Regression analysis

1. In a regression model with three explanatory variables, there will be _______ auxiliary regressions

2. Autocorrelation may occur due to

3. Choose a true statement about Durbin-Watson test

4. If a Durbin Watson statistic takes a value close to zero what will be the value of first-order autocorrelation coefficient

5. When measurement errors are present in the explanatory variable(s) they make

6. The value of Durbin Watson $d$ lie between

7. Heteroscedasticity is more common in

8. Which one assumption is not related to error in explanatory variables?

9. Which of the following statements is true about autocorrelation?

10. Which of the action does not make sense to take in order to struggle against multicollinearity?

11. Heteroscedasticity can be detected by plotting the estimated $\hat{u}_i^2$ against

12. Negative autocorrelation can be indicated by which of the following?

13. Which one is not the rule of thumb?

14. For the presence and absence of first-order autocorrelation valid tests are

15. The AR(1) process is stationary if

An application of different statistical methods applied to the economic data used to find empirical relationships between economic data is called Econometrics. In other words, Econometrics is “the quantitative analysis of actual economic phenomena based on the concurrent development of theory and observation, related by appropriate methods of inference”.

## Hierarchical Multiple Regression in SPSS

In this tutorial, we will learn how to perform hierarchical multiple regression analysis in SPSS, which is a variant of the basic multiple regression analysis that allows specifying a fixed order of entry for variables (regressors) in order to control for the effects of covariates or to test the effects of certain predictors independent of the influence of other.

The basic command for hierarchical multiple regression analysis in SPSS is “regression -> linear”:

In the main dialog box of linear regression (as given below), input the dependent variable. For example “income” variable from the sample file of customer_dbase.sav available in the SPSS installation directory.

Next, enter a set of predictors variables into independent(s) pan. These variables that you want SPSS to put into the regression model first (that you want to control for when testing the variables). For example, in this analysis, we want to find out whether “Number of people in the house” predicts the “Household income in thousands”. We also concerned that other variables like age, education, gender, union member, or retired might be associated with both “number of people in the house” and “household income in thousands”. To make sure that these variables (age, education, gender, union member, and retired) do not explain away the entire association between the “number of people in the house” and “Household income in thousands”, let put them into the model first. This ensures that they will get credit for any shared variability that they may have with the predictor that we are really interested in, “Number of people in the house”. any observed effect of “Number of people in the house” can then be said to be “independent of the effects of these variables that already have been controlled for. See the figure below

In the next step put the variable that we are really interested in, which is the “number of people in the house”. To include it into the model click the “NEXT” button. You will see all of the predictors (that were entered previously) disappear. Note that they are still in the model, just not on the current screen (block). You will also see Block 2 of 2 above the “independent(s)” pan.

Now click the “OK” button to run the analysis.

Note you can also hit the “NEXT” button again if you are interested to enter a third or fourth (and so on) block of variables.

Often researchers enter variables as related sets. For example demographic variables in the first step, all potentially confounding variables in the second step, and then the variables that you are most interested in as a third step. However, it is not necessary to follow. One can also enter each variable as a separate step if that seems more logical based on the design of your experiment.

Using just the default “Enter” method, with all the variables in Block 1 (demographics) entered together, followed by “number of peoples in the house” as a predictor in Block 2, we get the following output:

The first table of output windows confirms that variables entered in each step.

The summary table shows the percentage of explained variation in the dependent variable that can be accounted for by all the predictors together. The change in $R^2$ (R-Squared) is a way to evaluate how much predictive power was added to the model by the addition of another variable in STEP 2. In our example, predictive power does not improve by the addition of another predictor in STEP 2.

The overall significance of the model can be checked from this ANOVA table. In this case, both models are statistically significant.

The coefficient table is used to check the individual significance of predictors. For model 2, the Number of people in the household is statistically non-significant, therefore excluded from the model.

Learn about Multiple Regression Analysis