Multiple Regression Analysis

Introduction to Multiple Regression Analysis

Francis Galton (a biometrician) examines the relationship between fathers’ and sons’ height. He analyzed the similarities between the parent and child generation of 700 sweet peas. Galton found that the offspring of tall parents tended to be shorter and offspring of shorter parents tended to be taller. The height of the children depends ($Y$) upon the height of the parents ($X$). In case, there is more than one independent variable (IV), we need multiple regression analysis (MRA), also called multiple linear regression (MLR).

Multiple Linear Regression Model

The linear regression model (equation) for two independent variables (regressors) is

$$Y_{ij} = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \varepsilon_{ij}$$

The general linear regression model (equation) for $k$ independent variables is

$$Y_{ij} = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3X_{3i} + \cdots + \varepsilon_{ij}$$

The $\beta$s are all regression coefficients (partial slopes) and the $\alpha$ is the intercept.

The sample linear regression model is

$$\hat{y} = \hat{\alpha} + \hat{\beta}_1 x_{1i} + \hat{\beta}_2x_{2i} + \hat{\varepsilon}_{ij}$$

Multiple Regression Coefficients Formula

To fit the MLR equation for two variables, one needs to compute the values of $\hat{\beta}_1, \hat{\beta}_2$, and $\alpha$.

Multiple Regression Analysis Partial Coefficient 1

The yellow part of the above formula is the (“sum of the product of 1st independent and dependent variables”) multiplied by the (“sum of the square of 2nd independent variable).

The red part of the above formula is the (“Sum of the product of 2nd independent and dependent variables”) multiplied by the (“sum of the product of two independent variables”).

The green part of the above formula is the (“sum of the square of 1st independent variable”) multiplied by the (“sum of the square of 2nd independent variable”).

The blue part of the above formula is the (“square of the sum of the product of two independent variables”).

The formula for 2nd regression coefficient is

Multiple Regression Analysis Partial Coefficient 1

In short, note that the $S$ stands for the sum of squares and the sum of products.

Multiple Linear Regression Example

Consider the following data about two regressors ($X_1, X_2$) and one regressand variable ($Y$).

$Y$$X_1$$X_2$$X_1 y$$X_2 y$$X_1 X_2$$X_1^2$$X_2^2$
301015300450150100225
2258110176402564
161012160192120100144
737214921949
1421028140204100
8930526191007351238582

\begin{align*}
S_{x_1Y} &= \sum X_1 y – \frac{\sum X_1 \sum Y}{n} = 619 – \frac{30\times 59}{5} = 265\\
S_{x_1x_2} &= \sum X_1 X_2 – \frac{\sum X_1 \sum X_2}{n} = 351 – \frac{30 \times 52}{5} = 39\\
S_{X_1^2} &= \sum X_1^2 – \frac{(\sum X_1)^2}{n} = 238 -\frac{30^2}{5} = 58\\
S_{X_2^2} &= \sum X_2^2 – \frac{(\sum X_2)^2}{n} = 582 – \frac{52^2}{5} = 41.2\\
S_{X_2 y} &= \sum X_2 Y – \frac{\sum X_2 \sum Y}{n} =1007 – \frac{52 \times 89}{5} = 81.4
\end{align*}

\begin{align*}
\hat{\beta}_1 &= \frac{(S_{X_1 Y})(S_{X_2^2}) – (S_{X_2Y})(S_{X_1 X_2}) }{(S_{X_1^2})(S_{X_2^2}) – (S_{X_1X_2})^2} = \frac{(265)(41.2) – (81.4)(39)}{(58)(41.2) – (39)^2} = 8.91\\
\hat{\beta}_2 &= \frac{(S_{X_2 Y})(S_{X_1^2}) – (S_{X_1Y})(S_{X_1 X_2}) }{(S_{X_1^2})(S_{X_2^2}) – (S_{X_1X_2})^2} = \frac{(81.4)(58) – (265)(39)}{(58)(41.2) – (39)^2} = -6.46\\
\hat{\alpha} &= \overline{Y} – \hat{\beta}_1 \overline{X}_1 – \hat{\beta}_2 \overline{X}_2\\
&=31.524 + 8.91X_1 – 6.46X_2
\end{align*}

Important Key Points of Multiple Regression

  • Independent variables (predictors, regressors): These are the variables that one believes to influence the dependent variable. One can have two or more independent variables in a multiple-regression model.
  • Dependent variable (outcome, response): This is the variable one is trying to predict or explain using the independent variables.
  • Linear relationship: The core assumption is that the relationship between the independent variables and dependent variable is linear. This means the dependent variable changes at a constant rate for a unit change in the independent variable, holding all other variables constant.

The main goal of multiple regression analysis is to find a linear equation that best fits the data. The multiple regression analysis also allows one to:

  • Predict the value of the dependent variable based on the values of the independent variables.
  • Understand how changes in the independent variables affect the dependent variable while considering the influence of other independent variables.

Interpreting the Multiple Regression Coefficient

https://rfaqs.com

https://gmstat.com

Hierarchical Multiple Regression SPSS

In this tutorial, we will learn how to perform hierarchical multiple regression analysis SPSS, which is a variant of the basic multiple regression analysis that allows specifying a fixed order of entry for variables (regressors) to control for the effects of covariates or to test the effects of certain predictors independent of the influence of other.

Step By Step Procedure of Hierarchical Multiple Regression SPSS

The basic command for hierarchical multiple regression analysis SPSS is “regression -> linear”:

Hierarchical Multiple Regression SPSS

In the main dialog box of linear regression (as given below), input the dependent variable. For example “income” variable from the sample file of customer_dbase.sav available in the SPSS installation directory.

Next, enter a set of predictor variables into an independent(s) pan. These variables that you want SPSS to put into the regression model first (that you want to control for when testing the variables). For example, in this analysis, we want to find out whether the “Number of people in the house” predicts the “Household income in thousands”.

We are also concerned that other variables like age, education, gender, union member, or retirement might be associated with both the “number of people in the house” and “household income in thousands”. To make sure that these variables (age, education, gender, union member, and retired) do not explain away the entire association between the “number of people in the house” and “Household income in thousands”, let’s put them into the model first.

This ensures that they will get credit for any shared variability that they may have with the predictor that we are interested in, “Number of people in the house”. any observed effect of “Number of people in the house” can then be said to be “independent of the effects of these variables that already have been controlled for. See the figure below

Linear Regression Variable

In the next step put the variable that we are interested in, which is the “number of people in the house”. To include it in the model click the “NEXT” button. You will see all of the predictors (that were entered previously) disappear. Note that they are still in the model, just not on the current screen (block). You will also see Block 2 of 2 above the “independent(s)” pan.

Hierarchical Regression

Now click the “OK” button to run the analysis.

Note you can also hit the “NEXT” button again if you are interested in entering a third or fourth (and so on) block of variables.

Often researchers enter variables as related sets. For example demographic variables in the first step, all potentially confounding variables in the second step, and then the variables that you are most interested in in the third step. However, it is not necessary to follow. One can also enter each variable as a separate step if that seems more logical based on the design of your experiment.

Output Hierarchical Multiple Regression Analysis

Using just the default “Enter” method, with all the variables in Block 1 (demographics) entered together, followed by “number of people in the house” as a predictor in Block 2, we get the following output:

Output Hierarchical Regression

The first table of output windows confirms that variables are entered in each step.

The summary table shows the percentage of explained variation in the dependent variable that can be accounted for by all the predictors together. The change in $R^2$ (R-squared) is a way to evaluate how much predictive power was added to the model by the addition of another variable in STEP 2. In our example, predictive power does not improve with the addition of another predictor in STEP 2.

Hierarchical Regression Output

The overall significance of the model can be checked from this ANOVA table. In this case, both models are statistically significant.

Hierarchical Regression Output

The coefficient table is used to check the individual significance of predictors. For model 2, the Number of people in the household is statistically non-significant, therefore excluded from the model.

Learn about Multiple Regression Analysis

R Language Frequently Asked Questions

Multiple Regression Model Introduction (2015)

Introduction to Multiple Regression Model

A multiple regression model (a regression having multi-variable) is referred to as a regression model having more than one predictor (independent and explanatory variable) to explain a response (dependent) variable. We know that simple regression models have one predictor used to explain a single response while for the case of multiple (multivariable) regression models, more than one predictor in the models. Simple regression models and multiple (multivariable) regression models can further be categorized as linear or non-linear regression models.

Note that linearity is not based on predictors or the addition of more predictors in the simple regression model, it is referred to as the parameter of variability (parameters attached with predictors). If the parameters of variability have a constant rate of change then the models are referred to as linear models either it is a simple regression model or multiple (multivariable) regression models. It is assumed that the relationship between variables is considered linear, though this assumption can never be confirmed in the case of multiple linear regression.

However, as a rule, it is better to look at a bivariate scatter diagram of the variable of interest, you check that there should be no curvature in the relationship. A scatter matrix plot is a more useful visualization between variables of interest.

The multiple regression model also allows us to determine the overall fit (which is known as variance explained) of the model and the relative contribution of each of the predictors to the total variance explained (overall fit of the model). For example, one may be interested to know how much of the variation in exam performance can be explained by the following predictors such as revision time, test anxiety, lecture attendance, and gender “as a whole”, but also the “relative contribution” of each independent variable in explaining the variance.

General Form of Multiple Regression Model

A multiple regression model has the form

\[y=\alpha+\beta_1 x_1+\beta_2 x_2+\cdots+\beta_k x_k+\varepsilon\]

Here $y$ is continuous variables and $x$’s are known as predictors which may be continuous, categorical, or discrete. The above model is referred to as a linear multiple (multivariable) regression model.

Multiple Regression Model

Example of Multiple Regression Model

For example prediction of college GPA by using, high school GPA, test scores, time given to study, and rating of high school as predictors.

  • How rainfall, temperature, and amount of fertilizer impact and affect crop growth
  • Influence of various factors (such as cholesterol, blood pressure, or diabetes) on health outcomes
  • Blood pressure depends on variables, for example, gender, age, height, weight, exercise, diet, and medication.
  • The Weight of a person is linearly related to their height and age.
  • Studying the effect of education, gender, and profession on income.
  • The price of a house depends on the size of the house, number of rooms, community, facilities available, etc.

Assumptions of the Multiple Regression Model

Multiple regression models also have some assumptions that need to be followed or fulfilled. For example, the residuals should be normally distributed. There should be no collinearity/ multicollinearity among the regressors/ independent variables. The variance of error terms should be homoscedastic, and error terms should be not correlated (no autocorrelation).

Common Applications of Multiple Regression Models

  • Marketing: Predicting customer spending based on factors like income, gender, age, and advertising exposure.
  • Social Science: Analyzing the factors that influence voting behavior, such as gender, education level, income, and political party affiliation.
  • Finance: Estimating stock prices based on company earnings, economic indicators, and market trends.
  • Predicting house prices: One can use factors like square area, number of bedrooms, and location to predict the selling price of a house.
  • Identifying risk factors for diseases: Researchers can use multiple regression to see how lifestyle choices, genetics, and environmental factors contribute to the risk of developing a particular disease.

Read Assumptions of Multiple Regression Model

Learn R Programming Language

Application of Regression in Medical: A Quick Guide (2024)

The application of Regression cannot be ignored, as regression is a powerful statistical tool widely used in medical research to understand the relationship between variables. It helps identify risk factors, predict outcomes, and optimize treatment strategies.

Considering the application of regression analysis in medical sciences, Chan et al. (2006) used multiple linear regression to estimate standard liver weight for assessing adequacies of graft size in live donor liver transplantation and remnant liver in major hepatectomy for cancer. Standard liver weight (SLW) in grams, body weight (BW) in kilograms, gender (male=1, female=0), and other anthropometric data of 159 Chinese liver donors who underwent donor right hepatectomy were analyzed. The formula (fitted model)

 \[SLW = 218 + 12.3 \times BW + 51 \times gender\]

 was developed with a coefficient of determination $R^2=0.48$.

Application of Regression Analysis

These results mean that in Chinese people, on average, for each 1-kg increase of BW, SLW increases about 12.3 g, and, on average, men have a 51-g higher SLW than women. Unfortunately, SEs and CIs for the estimated regression coefficients were not reported. Using Formula 6 in their article, the SLW for Chinese liver donors can be estimated if BW and gender are known. About 50% of the variance of SLW is explained by BW and gender.

The regression analysis helps in:

  • Identifying risk factors: Determine which factors contribute to the development of a disease (For example, gender, age, smoking, and blood pressure for heart disease).
  • Predicting disease occurrence: Estimate the likelihood of a patient developing a disease based on specific risk factors. for example, logistic regression is used to predict the risk of diabetes based on factors like BMI, age, and family history.

The following types of regression models are widely used in medical sciences:

  • Linear regression: Used when the outcome variable is continuous (e.g., blood pressure, cholesterol levels).
  • Logistic regression: Used when the outcome variable is binary (e.g., disease present/absent, survival/death).
  • Cox proportional hazards regression: Used for survival analysis (time to event data)

 Some other related articles (Application of Regression Analysis in Medical Sciences)

Reference of Article

  • Chan SC, Liu CL, Lo CM, et al. (2006). Estimating liver weight of adults by body weight and gender. World J Gastroenterol 12, 2217–2222.

R Programming Lectures