Outliers and Influential Observations

Here we will focus on the difference between the outliers and influential observations.

Outliers

The cases (observations or data points) that do not follow the model as the rest of the data are called outliers. In Regression, the cases with large residuals are a candidate for outliers. So an outlier is a data point that diverges from an overall pattern in a sample. Therefore, an outlier can certainly influence the relationship between the variables and may also exert an influence on the slope of the regression line.

An outlier can be created by a shift in the location (mean) or in the scale (variability) of the process. An outlier may be due to recording errors (may be correctable), or due to the sample not being entirely from the same population. This may also be due to the values from the same population but from the non-normal (heavy-tailed) population. That is, outliers may be due to incorrect specifications that are based on the wrong distributional assumptions.

Outliers and Influential Observations

Inferential Observations

An influential observation is often an outlier in the x-direction. Influential observation may arise from

  1. observations that are unusually large or otherwise deviate in unusually extreme forms from the center of a reference distribution,
  2. the observation may be associated with a unit that has a low probability and thus has a high probability weight.
  3. the observation may have a very large weight (relative to the weights of other units in the specified sub-population) due to problems with stratum jumping; sampling of birth units or highly seasonal units; large nonresponse adjustment factors arising from unusually low response rates within a given adjustment cell; unusual calibration-weighting effects; or other factors.

Learn R Programming Language