Leverage Influential Point and Outlier: Diagnostics (2024)

In this post, a discussion about diagnostics for a Leverage Influential point and outlier will be made. In a regression analysis, certain observations may play a role in influencing the outcomes of the fitted model and its estimates. These observations may be classified as outliers, leverage, and influential points.

Outlier Leverage Influential Point

The explanation of outlier leverage influential point is described as under:

  • Outliers: An outlier is an extreme observation that differs considerably from the other observations. An outlier may be due to the recording error and the model cannot explain them. However, outlier(s) may contain some important information. An outlier may be in $x$-space, $y$-space, or both.
  • Leverage: An unusual $x$ value is called a leverage point. The leverage point affects the model summary statistics (such as $R^2$, standard error, etc.), but has little impact on the estimates of the regression coefficients. A leverage point has an unusual predictor value and is different from the bulk of the observations.
  • Influence: An unusual $y$ value (and may be an extreme $x$ value), is called an influence point. An influence point has a noticeable impact on the estimated regression coefficients and may change the direction of the slope.
Diagnostics for Outliers leverage and influential points
image taken from: https://www.cbsd.org/

Diagnostics for Outlier Leverage Influential Point

There are some methods to detect/ identify the outlier leverage influential point

Outliers

Outliers must be treated very carefully. Outliers may be detected by examining the

  • Normal Quantile Plots (departer from normality)
  • Residual Plots (magnitude of the residuals)
  • Scaled residuals (a potential outlier if magnitudes > 3)
Outlier Detection using Box Plot

Leverage Point

The diagonal elements of the “hat matrix” have an important role in detecting influential observations. $$h_{ii} = x’_i (X’X)^{-1}x_i,$$ where $X$ is matrix of regressors and $x’_i$ is the ith row of the $X$ matrix.

A large diagonal element is an indicator of influential observation as they are remote in $x$-space. Any observation exceeding the average size of the diagonal element of the hat matrix ($\overline{h} = \frac{p}{n}=2h$) is considered as a leverage point, where $p$ is the number of parameters in the model.
It is also useful to observe the studentized residuals in conjunction with $h_{ii}$ (that is, look for large hat diagonal and large residual values).

Note that not all of the leverage points are influential unless they have large residuals. Therefore, observations having large $h_{ii}$ values and large residuals are likely to be R.

Influential Points

  • Cook’s Distance: The Cook’s Distance is the Deletion Diagnostic that is used to measure the influence of the $i$th observation by removing it from the regression analysis. It is based on all $n$ points, $\hat{\beta}, and the estimates based on the deletion of the $i$th point, $\hat{\beta}_{(i)}$.
  • DFBETAS is another Deletion Diagnostic used to measure how the change in each of the $\hat{\beta}j$ is due to influential observation. A large value of DFBETAS indicates that the $i$th observation is considerably an influential observation on the $j$th regression coefficient. If $|DFBETAS{j, i} > \frac{2}{\sqrt{n}}$ then the $i$th observation warrants further examination.
  • DFFITS is another deletion diagnostic measure used to measure the deletion influence of the $i$th observation on the predicted or fitted values. DFFITS is the number of standard deviations that the fitted values change if ith observations are removed. If $|DFFITS_i|>\frac{2}{\sqrt{\frac{p}{n}}}$ then the $i$th observation warrants further examination.

Note that the case deletion diagnostics do not provide any information about the overall prediction of the estimation. However, the performance of the model can be measured by using the Generalized Variance (GV) and Covariance Ratio.

In summary, the Outliers, Leverage Points, and Influential Observations are certain data points (observations) that deviate (distant) from the expected patterns. On the other hand, the outliers are extreme values that lie far away from the other data points, while leverage points exert a strong influence on the regression models.

Read more about Regression Diagnostics

R Programming Language

Online Correlation and Regression Quiz

This Post contains an Online Correlation and Regression Quiz, Multiple Regression AnalysisCoefficient of Determination (Explained Variation), Unexplained Variation, Model Selection Criteria, Model Assumptions, Interpretation of results, Intercept, Slope, Partial Correlation, Significance tests, Multicollinearity, Heteroscedasticity, Autocorrelation, etc. Click the links below to start with the MCQs on the Online Correlation and Regression Quiz.

MCQs Online Correlation and Regression Quiz

Regression Analysis Quiz 12Evaluating Regression Models Quiz 11MCQs Correlation and Regression 10
Linear Regression and Correlation Quiz 9MCQs Correlation & Regression – 8MCQs Correlation & Regression – 7
MCQs Correlation & Regression – 6MCQs Correlation & Regression – 5MCQs Correlation & Regression – 4
MCQs Correlation & Regression – 3MCQs Correlation & Regression – 2MCQs Correlation & Regression – 1
Application or Regression

Correlation Analysis

Correlation analysis is a statistical measure used to determine the strength and direction of the mutual relationship between two quantitative variables. The value of the correlation lies between $-1$ and $+1$. The regression analysis describes how an explanatory variable numerically relates to the dependent variables.

Correlation Coefficient Formula

The formula to compute the correlation coefficient is:

$$r = \frac{n\sum X_i Y_i – \sum X_i \sum Y_i}{\sqrt{[n\sum X_i^2 – (\sum X_i)^2][n\sum Y_i^2 – (\sum Y_i)^2]}} $$

Regression Model

The general regression equation is $Y_i = a + bX_i$. The slope coefficient and intercept of the regression model can be computed as

$$\begin{align*}
b &= \frac{n\sum X_i Y_i – \sum X_i \sum Y_i}{n\sum X_i^2 – (\sum X_i)^2}\\
a &= \overline{Y} – b\overline{X}
\end{align*}$$

Both tools represent the linear relationship between the two quantitative variables. The relationship between variables can be observed using a graphical representation. We can also compute the strength of the relationship between variables by performing numerical calculations using appropriate computational formulas.

Online Correlation and Regression Quiz

Note that neither regression nor correlation analyses can be interpreted as establishing some cause-and-effect relationships. Both correlation and regression are used to indicate how or to what extent the variables under study are associated (or mutually related) with each other. The correlation coefficient measures only the degree (strength) and direction of linear association between the two variables. Any conclusions about a cause-and-effect relationship must be based on the judgment of the analyst.

Learn R Programming Language RFAQS.com

Homoscedasticity: Constant Variance of a Random Variable (2020)

The term “Homoscedasticity” is the assumption about the random variable $u$ (error term) that its probability distribution remains the same for all observations of $X$ and in particular that the variance of each $u$ is the same for all values of the explanatory variables, i.e the variance of errors is the same across all levels of the independent variables (Homoscedasticity: assumption about the constant variance of a random variable). Symbolically it can be represented as

$$Var(u) = E\{u_i – E(u)\}^2 = E(u_i)^2 = \sigma_u^2 = \mbox(Constant)$$

This assumption is known as the assumption of homoscedasticity or the assumption of constant variance of the error term $u$’s. It means that the variation of each $u_i$ around its zero means does not depend on the values of $X$ (independent) because the error term expresses the influence on the dependent variables due to

  • Errors in measurement
    The errors of measurement tend to be cumulative over time. It is also difficult to collect the data and check its consistency and reliability. So the variance of $u_i$ increases with increasing the values of $X$.
  • Omitted variables
    Omitted variables from the function (regression model) tend to change in the same direction as $X$, causing an increase in the variance of the observation from the regression line.

The variance of each $u_i$ remains the same irrespective of small or large values of the explanatory variable i.e. $\sigma_u^2$ is not a function of $X_i$ i.e $\sigma_{u_i^2} \ne f(X_i)$.

Homoscedasticity

Consequences if Homoscedasticity is not meet

If the assumption of homoscedastic disturbance (Constant Variance) is not fulfilled, the following are the Heteroscedasticity consequences:

  1. We cannot apply the formula of the variance of the coefficient to conduct tests of significance and construct confidence intervals. The tests are inapplicable $Var(\hat{\beta}_0)=\sigma_u^2 \{\frac{\sum X^2}{n \sum X^2}\}$ and $Var(\hat{\beta}_1) = \sigma_u^2 \{\frac{1}{\sum X^2}\}$
  2. If $u$ (error term) is heteroscedastic the OLS (Ordinary Least Square) estimates do not have minimum variance property in the class of Unbiased Estimators i.e. they are inefficient in small samples. Furthermore, they are inefficient in large samples (that is, asymptotically inefficient).
  3. The coefficient estimates would still be statistically unbiased even if the $u$’s are heteroscedastic. The $\hat{\beta}$’s will have no statistical bias i.e. $E(\beta_i)=\beta_i$ (coefficient’s expected values will be equal to the true parameter value).
  4. The prediction would be inefficient because the variance of prediction includes the variance of $u$ and of the parameter estimates which are not minimal due to the incidence of heteroscedasticity i.e. The prediction of $Y$ for a given value of $X$ based on the estimates $\hat{\beta}$’s from the original data, would have a high variance.
Homoscedasticity

Tests for Homoscedasticity

Some tests commonly used for testing the assumption of homoscedasticity are:

Reference:
A. Koutsoyiannis (1972). “Theory of Econometrics”. 2nd Ed.

https://itfeature.com Statistics Help

Conducting Statistical Models in R Language