Regression Archives - Page 6 of 13 -

Simple Linear Regression Model

Apr 4, 2025May 15, 2024 by Muhammad Imdad Ullah

A simple Linear Regression model is one of the most fundamental techniques in machine learning and statistics. Whether you are a data science newbie or just brushing up on the basics, understanding linear regression is essential.

Introduction

Frequently, we measure two or more variables on each individual and try to express the nature of the relationship between these variables (for example, in the simple linear regression model and correlation analysis). Using the regression technique, we estimate the relationship of one variable with another by expressing the one in terms of a linear (or more complex) function of another. We also predict the values of one variable in terms of the other. The variables involved in regression and correlation analysis are continuous. In this post, we will learn about the Simple Linear Regression Model.

Functional Relationship Between Variables

We are interested in establishing significant functional relationships between two (or more) variables. For example, the function $Y=f(X)=a+bx$ (read as $Y$ is a function of $X$) establishes a relationship to predict the values of variable $Y$ for the given values of variable $X$. In statistics (biostatistics), the function is a simple linear regression model or the regression equation.

The variable $Y$ is called the dependent (response) variable, and $X$ is called the independent (regressor or explanatory) variable.

In biology, many relationships can be appropriate over only a limited range of values of $X$. Negative values are meaningless in many cases, such as age, height, weight, and body temperature.

The method of linear regression is used to estimate the best-fitting straight line to describe the relationship between variables. The linear regression gives the equation of the straight line that best describes how the outcome of $Y$ increases/decreases with an increase/decrease in the explanatory variable $X$. The equation of the regression line is
$$Y=\beta_0 + \beta_1 X,$$
where $\beta_0$ is the intercept (value of $Y$ when $X=0$) and $\beta_1$ is the slope of the line. Both $\beta_0$ and $\beta_1$ are the parameters (or regression coefficients) of the linear equation.

Estimation of Regression Coefficients in Simple Linear Regression Model

The best-fitting line is derived using the method of the \textit{Least Squares} by finding the values of the parameters $\beta_0$ and $\beta_1$ that minimize the sum of the squared vertical distances of the points from the regression line,

The dotted-line (best-fit) line passes through the point ($\overline{X}, \overline{Y}$).

The regression line $Y=\beta_0+\beta_1X$ is fit by the least-squares methods. The regression coefficients $\beta_0$ and $\beta_1$ are both calculated to minimize the sum of squares of the vertical deviations of the points about the regression line. Each deviation equals the difference between the observed value of $Y$ and the estimated value of $Y$ (the corresponding point on the regression.

The following table shows the \textit{body weight} and \textit{plasma volume} of eight healthy men.

Subject	Body Weight (KG)	Plasma Volume (liters)
1	58.0	2.75
2	70.0	2.86
3	74.0	3.37
4	63.5	2.76
5	62.0	2.62
6	70.5	3.49
7	71.0	3.05
8	66.0	3.12

Simple Linear Regression Models: Scatter plot with regression line

Estimation of Paramters

The parameters $\beta_0$ and $\beta_1$ are estimated using the following formula (for simple linear regression model):

\begin{align}
\beta_1 &= \frac{n\sum\limits_{i=1}^{n} x_iy_i -\sum\limits_{i=1}^{n} x_i \sum\limits_{i=1}^{n} y_i} {n \sum\limits_{i=1}^{n} x_i^2 – \left(\sum\limits_{i=1}^{n} x_i \right)^2}\\
\beta_0 &= \overline{Y} – \beta_1 \overline{X}
\end{align}

Regression coefficients are sometimes known as “beta-coefficients”. When the slope ($\beta_1=0$, then there is no relationship between $X$ and $Y$ variables. For the data above, the best-fitting straight line describing the relationship between plasma volume with body weight is
$$Plasma\, Volume = 0.0857 +0.0436\times Weight$$
Note that the calculated values for $\beta_0$ and $\beta_1$ are estimates of the population values and, therefore, subject to sampling variations.

Real-Life Examples: Simple Linear Regression Models

Real Estate: Predicting House Prices (Estimate home prices based on size to guide buyers and sellers.)
Independent Variable ($X$): Size of the house (sq ft)
Dependent Variable ($Y$): Price of the house
Education: Predicting Student Scores (Teachers or students can predict likely outcomes based on study habits.)
$X$: Hours studied
$Y$: Exam scores
Healthcare: Predicting Blood Pressure (Understand how blood pressure tends to rise with age, aiding diagnosis.)
$X$: Age of patient
$Y$: Systolic blood pressure
Energy: Predicting Electricity Usage (Power companies use this to forecast demand and manage resources)
$X$: Temperature (°C or °F)
$Y$: Electricity consumption (kWh)
Manufacturing: Predicting Machine Failures
$X$: Hours a machine has been in use (Predict maintenance schedules and avoid production delays.)
$Y$: Number of breakdowns or wear percentage
Business: Predicting Sales Based on Advertising Spend (Helps businesses decide how much to invest in advertising.)
$X$: Advertising expenditure (in $\$$)
$Y$: Product sales (in units)
Agriculture: Predicting Crop Yield (Estimate yield based on expected rainfall to plan for food production.)
$X$: Amount of rainfall (mm)
$Y$: Crop yield (kg per acre)
Finance: Predicting Stock Prices (Although basic, it helps in forecasting trends over time (note: simple linear regression has limits in volatile markets))
$X$: Time (days or months)
$Y$: Stock closing price
Transportation: Estimating Fuel Consumption (Predict fuel needs and optimize transportation costs.)
$X$: Distance traveled (km)
$Y$: Fuel used (liters)
E-commerce: Predicting Customer Spending (Analyze user behavior and optimize website experience for better conversion.)
$X$: Time spent on the website
$Y$: Amount spent on a purchase

https://gmstat.com, https://rfaqs.com

Leverage Influential Point and Outlier: Diagnostics (2024)

Jul 17, 2024Apr 26, 2024 by Muhammad Imdad Ullah

In this post, a discussion about diagnostics for a Leverage Influential point and outlier will be made. In a regression analysis, certain observations may play a role in influencing the outcomes of the fitted model and its estimates. These observations may be classified as outliers, leverage, and influential points.

Outlier Leverage Influential Point

The explanation of outlier leverage influential point is described as under:

Outliers: An outlier is an extreme observation that differs considerably from the other observations. An outlier may be due to the recording error and the model cannot explain them. However, outlier(s) may contain some important information. An outlier may be in $x$-space, $y$-space, or both.
Leverage: An unusual $x$ value is called a leverage point. The leverage point affects the model summary statistics (such as $R^2$, standard error, etc.), but has little impact on the estimates of the regression coefficients. A leverage point has an unusual predictor value and is different from the bulk of the observations.
Influence: An unusual $y$ value (and may be an extreme $x$ value), is called an influence point. An influence point has a noticeable impact on the estimated regression coefficients and may change the direction of the slope.

Diagnostics for Outliers leverage and influential points — image taken from: https://www.cbsd.org/

Diagnostics for Outlier Leverage Influential Point

There are some methods to detect/ identify the outlier leverage influential point

Outliers

Outliers must be treated very carefully. Outliers may be detected by examining the

Normal Quantile Plots (departer from normality)
Residual Plots (magnitude of the residuals)
Scaled residuals (a potential outlier if magnitudes > 3)

Leverage Point

The diagonal elements of the “hat matrix” have an important role in detecting influential observations. $$h_{ii} = x’_i (X’X)^{-1}x_i,$$ where $X$ is matrix of regressors and $x’_i$ is the ith row of the $X$ matrix.

A large diagonal element is an indicator of influential observation as they are remote in $x$-space. Any observation exceeding the average size of the diagonal element of the hat matrix ($\overline{h} = \frac{p}{n}=2h$) is considered as a leverage point, where $p$ is the number of parameters in the model.
It is also useful to observe the studentized residuals in conjunction with $h_{ii}$ (that is, look for large hat diagonal and large residual values).

Note that not all of the leverage points are influential unless they have large residuals. Therefore, observations having large $h_{ii}$ values and large residuals are likely to be R.

Influential Points

Cook’s Distance: The Cook’s Distance is the Deletion Diagnostic that is used to measure the influence of the $i$th observation by removing it from the regression analysis. It is based on all $n$ points, $\hat{\beta}, and the estimates based on the deletion of the $i$th point, $\hat{\beta}_{(i)}$.

DFBETAS is another Deletion Diagnostic used to measure how the change in each of the $\hat{\beta}j$ is due to influential observation. A large value of DFBETAS indicates that the $i$th observation is considerably an influential observation on the $j$th regression coefficient. If $|DFBETAS{j, i} > \frac{2}{\sqrt{n}}$ then the $i$th observation warrants further examination.
DFFITS is another deletion diagnostic measure used to measure the deletion influence of the $i$th observation on the predicted or fitted values. DFFITS is the number of standard deviations that the fitted values change if ith observations are removed. If $|DFFITS_i|>\frac{2}{\sqrt{\frac{p}{n}}}$ then the $i$th observation warrants further examination.

Note that the case deletion diagnostics do not provide any information about the overall prediction of the estimation. However, the performance of the model can be measured by using the Generalized Variance (GV) and Covariance Ratio.

In summary, the Outliers, Leverage Points, and Influential Observations are certain data points (observations) that deviate (distant) from the expected patterns. On the other hand, the outliers are extreme values that lie far away from the other data points, while leverage points exert a strong influence on the regression models.

Online Correlation and Regression Quiz

Apr 15, 2025Dec 3, 2023 by Muhammad Imdad Ullah

This Post contains an Online Correlation and Regression Quiz, Multiple Regression Analysis, Coefficient of Determination (Explained Variation), Unexplained Variation, Model Selection Criteria, Model Assumptions, Interpretation of results, Intercept, Slope, Partial Correlation, Significance tests, Multicollinearity, Heteroscedasticity, Autocorrelation, etc. Click the links below to start with the MCQs on the Online Correlation and Regression Quiz.

MCQs Online Correlation and Regression Quiz

Regression Analysis Quiz 12	Evaluating Regression Models Quiz 11	MCQs Correlation and Regression 10
Linear Regression and Correlation Quiz 9	MCQs Correlation & Regression – 8	MCQs Correlation & Regression – 7
MCQs Correlation & Regression – 6	MCQs Correlation & Regression – 5	MCQs Correlation & Regression – 4
MCQs Correlation & Regression – 3	MCQs Correlation & Regression – 2	MCQs Correlation & Regression – 1

Correlation Analysis

Correlation analysis is a statistical measure used to determine the strength and direction of the mutual relationship between two quantitative variables. The value of the correlation lies between $-1$ and $+1$. The regression analysis describes how an explanatory variable numerically relates to the dependent variables.

Correlation Coefficient Formula

The formula to compute the correlation coefficient is:

$$r = \frac{n\sum X_i Y_i – \sum X_i \sum Y_i}{\sqrt{[n\sum X_i^2 – (\sum X_i)^2][n\sum Y_i^2 – (\sum Y_i)^2]}} $$

Regression Model

The general regression equation is $Y_i = a + bX_i$. The slope coefficient and intercept of the regression model can be computed as

$$\begin{align*}
b &= \frac{n\sum X_i Y_i – \sum X_i \sum Y_i}{n\sum X_i^2 – (\sum X_i)^2}\\
a &= \overline{Y} – b\overline{X}
\end{align*}$$

Both tools represent the linear relationship between the two quantitative variables. The relationship between variables can be observed using a graphical representation. We can also compute the strength of the relationship between variables by performing numerical calculations using appropriate computational formulas.

Note that neither regression nor correlation analyses can be interpreted as establishing some cause-and-effect relationships. Both correlation and regression are used to indicate how or to what extent the variables under study are associated (or mutually related) with each other. The correlation coefficient measures only the degree (strength) and direction of linear association between the two variables. Any conclusions about a cause-and-effect relationship must be based on the judgment of the analyst.

Learn R Programming Language RFAQS.com

Simple Linear Regression Model

Table of Contents

Introduction

Functional Relationship Between Variables

Estimation of Regression Coefficients in Simple Linear Regression Model

Estimation of Paramters

Real-Life Examples: Simple Linear Regression Models

Leverage Influential Point and Outlier: Diagnostics (2024)

Table of Contents

Outlier Leverage Influential Point

Diagnostics for Outlier Leverage Influential Point

Outliers

Leverage Point

Influential Points

Online Correlation and Regression Quiz

Table of Contents

MCQs Online Correlation and Regression Quiz

Correlation Analysis

Correlation Coefficient Formula

Regression Model

Table of Contents

Introduction

Functional Relationship Between Variables

Estimation of Regression Coefficients in Simple Linear Regression Model

Estimation of Paramters

Real-Life Examples: Simple Linear Regression Models

Share this:

Table of Contents

Outlier Leverage Influential Point

Diagnostics for Outlier Leverage Influential Point

Outliers

Leverage Point

Influential Points

Share this:

Table of Contents

MCQs Online Correlation and Regression Quiz

Correlation Analysis

Correlation Coefficient Formula

Regression Model

Share this: