Outliers and Influential Observations

Here we will focus on the difference between the outliers and influential observations.

Outliers

The cases (observations or data points) that do not follow the model as the rest of the data are called outliers. In Regression, the cases with large residuals are a candidate for outliers. So an outlier is a data point that diverges from an overall pattern in a sample. Therefore, an outlier can certainly influence the relationship between the variables and may also exert an influence on the slope of the regression line.

An outlier can be created by a shift in the location (mean) or in the scale (variability) of the process. An outlier may be due to recording errors (may be correctable), or due to the sample not being entirely from the same population. This may also be due to the values from the same population but from the non-normal (heavy-tailed) population. That is, outliers may be due to incorrect specifications that are based on the wrong distributional assumptions.

Outliers and Influential Observations

Inferential Observations

An influential observation is often an outlier in the x-direction. Influential observation may arise from

  1. observations that are unusually large or otherwise deviate in unusually extreme forms from the center of a reference distribution,
  2. the observation may be associated with a unit that has a low probability and thus has a high probability weight.
  3. the observation may have a very large weight (relative to the weights of other units in the specified sub-population) due to problems with stratum jumping; sampling of birth units or highly seasonal units; large nonresponse adjustment factors arising from unusually low response rates within a given adjustment cell; unusual calibration-weighting effects; or other factors.

Importance of Outliers and Influential Observations

Outliers and Influential observations are important because:

  • Both outliers and influential observations can potentially mislead the interpretation of the regression model.
  • Outliers might indicate errors in the data or a non-linear relationship that the model cannot capture.
  • Influential observations can make the model seem more accurate than it is, masking underlying issues.

Both outliers and influential observations can be identified by using:

How to identify them?

  • Visual inspection: Scatterplots can reveal outliers as distant points.
  • Residual plots: Plotting residuals against predicted values or independent variables can show patterns indicative of influential observations.
  • Statistical diagnostics: Measures like Cook’s distance or leverage can quantify the influence of each data point.

By being aware of outliers and influential observations, one can ensure that the regression analysis provides a more reliable picture of the relationship between variables.

Learn R Programming Language

Leave a Comment

Discover more from Statistics for Data Analyst

Subscribe now to keep reading and get access to the full archive.

Continue reading