Truth about Bias in Statistics

Bias in Statistics is defined as the difference between the expected value of a statistic and the true value of the corresponding parameter. Therefore, the bias is a measure of the systematic error of an estimator. The bias indicates the distance of the estimator from the true value of the parameter. For example, if we calculate the mean of a large number of unbiased estimators, we will find the correct value.

Bias in Statistics: The Difference between Expected and True Value

In other words, the bias (sampling error) is a systematic error in measurement or sampling and it tells how far off on the average the model is from the truth.

Gauss, C.F. (1821) during his work on the least-squares method gave the concept of an unbiased estimator.

The bias of an estimator of a parameter should not be confused with its degree of precision as the degree of precision is a measure of the sampling error. The bias is favoring one group or outcome intentionally or unintentionally over other groups or outcomes available in the population under study. Unlike random errors, bias is a serious problem and bias can be reduced by increasing the sample size and averaging the outcomes.

Bias in Statistics

Several types of bias should not be considered mutually exclusive

  • Selection Bias (arise due to systematic differences between the groups compared)
  • Exclusion Bias (arises due to the systematic exclusion of certain individuals from the study)
  • Analytical Bias (arise due to the way that the results are evaluated)

Mathematically Bias can be defined as

Let statistics $T$ used to estimate a parameter $\theta$, if $E(T) = \theta$+ bias$(\theta)$ then bias$(\theta)$ is called the bias of the statistic $T$, where $E(T)$ represents the expected value of the statistics $T$.

Note: that if bias$(\theta)=0$, then $E(T)=\theta$. So, $T$ is an unbiased estimator of the true parameter, say $\Theta$.

Types of Sample Selection Bias

Reference:
Gauss, C.F. (1821, 1823, 1826). Theoria Combinations Observationum Erroribus Minimis Obnoxiae, Parts 1, 2 and suppl. Werke 4, 1-108.

For further reading about Statistical Bias visit: Bias in Statistics.

Learn about Estimation and Types of Estimation

Outliers and Influential Observations

Here we will focus on the difference between the outliers and influential observations.

Outliers

The cases (observations or data points) that do not follow the model as the rest of the data are called outliers. In Regression, the cases with large residuals are a candidate for outliers. So an outlier is a data point that diverges from an overall pattern in a sample. Therefore, an outlier can certainly influence the relationship between the variables and may also exert an influence on the slope of the regression line.

An outlier can be created by a shift in the location (mean) or in the scale (variability) of the process. An outlier may be due to recording errors (may be correctable), or due to the sample not being entirely from the same population. This may also be due to the values from the same population but from the non-normal (heavy-tailed) population. That is, outliers may be due to incorrect specifications that are based on the wrong distributional assumptions.

Outliers and Influential Observations

Inferential Observations

An influential observation is often an outlier in the x-direction. Influential observation may arise from

  1. observations that are unusually large or otherwise deviate in unusually extreme forms from the center of a reference distribution,
  2. the observation may be associated with a unit that has a low probability and thus has a high probability weight.
  3. the observation may have a very large weight (relative to the weights of other units in the specified sub-population) due to problems with stratum jumping; sampling of birth units or highly seasonal units; large nonresponse adjustment factors arising from unusually low response rates within a given adjustment cell; unusual calibration-weighting effects; or other factors.

Importance of Outliers and Influential Observations

Outliers and Influential observations are important because:

  • Both outliers and influential observations can potentially mislead the interpretation of the regression model.
  • Outliers might indicate errors in the data or a non-linear relationship that the model cannot capture.
  • Influential observations can make the model seem more accurate than it is, masking underlying issues.

How to Identify Outliers and Influential Observations

Both outliers and influential observations can be identified by using:

  • Visual inspection: Scatterplots can reveal outliers as distant points.
  • Residual plots: Plotting residuals against predicted values or independent variables can show patterns indicative of influential observations.
  • Statistical diagnostics: Measures like Cook’s distance or leverage can quantify the influence of each data point.

By being aware of outliers and influential observations, one can ensure that the regression analysis provides a more reliable picture of the relationship between variables.

Learn R Programming Language

Error and Residual in Regression

Error and Residual in Regression

In Statistics and Optimization, Statistical Errors and Residuals are two closely related and easily confused measures of “Deviation of a sample from the mean”.

Error is a misnomer; an error is the amount by which an observation differs from its expected value. The errors e are unobservable random variables, assumed to have zero mean and uncorrelated elements each with common variance  σ2.

A Residual, on the other hand, is an observable estimate of the unobservable error. The residuals $\hat{e}$ are computed quantities with mean ${E(\hat{e})=0}$ and variance ${V(\hat{e})=\sigma^2 (1-H)}$.

Like the errors, each of the residuals has zero mean, but each residual may have a different variance. Unlike the errors, the residuals are correlated. The residuals are linear combinations of the errors. If the errors are normally distributed so are the errors.

regression: Error and Residual in Regression

Note that the sum of the residuals is necessarily zero, and thus the residuals are necessarily not independent. The sum of the errors need not be zero; the errors are independent random variables if the individuals are chosen from the population independently.

The differences between errors and residuals in Regression are:

Sr. No.ErrorsResiduals
1)Error represents the unobservable difference between an actual value $y$ of the dependent variable and its true population mean.Residuals represent the observable difference between an actual value $y$ of the dependent variable and its predicted value according to the regression model.
2)Error is a theoretical concept because the true population mean is usually unknown.One can calculate residuals because we have the data and the fitted model.
3)Errors are assumed to be random and independent, with a mean of zero.Residuals are considered estimates of the errors for each data point.

Residuals are used in various ways to evaluate the regression model, including:

  • Residual plots: The residual plots are used to visualize the residuals versus the independent variable or predicted values.
  • Mean Squared Error (MSE): The MSE statistic measures the average squared difference between the residuals and zero.

In essence, understanding errors and residuals helps the researcher gauge how well the regression model captures the underlying relationship between variables, despite the inherent randomness or “noise” in real-world data.

FAQS about Errors and Residuals

  1. What is an Error?
  2. What are residuals in regression?
  3. What is the purpose of residual plots?
  4. What is a mean squared error (MSE)?
  5. Differentiate between error and residual.
  6. Discuss the sum of residuals and the sum of errors.
Statistics Help: https://itfeature.com

Learn about Simple Linear Regression Models

Statistical Models in R Language

P-value Interpretation and Misinterpretation of P-value 2012

The P-value is a probability, with a value ranging from zero to one. It is a measure of how much evidence we have against the null hypothesis. P-value is a way to express the likelihood that $H_0$ is not true. The smaller the p-value, the more evidence we have against $H_0$. Here we will discuss about P-value and Its Interpretation.

P-value Definition

The largest significance level at which we would accept the null hypothesis. It enables us to test the hypothesis without first specifying a value for $\alpha$. OR

The probability of observing a sample value as extreme as, or more extreme than, the value observed, given that the null hypothesis is true.

p value and significance level

P-value Interpretation

In general, the P-value interpretation is “If the P-value is smaller than the chosen significance level then $H_0$ (null hypothesis) is rejected even when it is true. If the P-value is larger than the significance level $H_0$ is not rejected”.

p-value Interpretation

If the P-value is less than

  • 0.10, we have some evidence that $H_0$ is not true
  • 0.05, strong evidence that $H_0$ is not true
  • 0.01, Very strong evidence that $H_0$ is not true
  • 0.001, extremely strong evidence that $H_0$ is not true

Misinterpretation of a P-value

Many people misunderstand P-values. For example, if the P-value is 0.03 then it means that there is a 3% chance of observing a difference as large as you observed even if the two population means are the same (i.e. the null hypothesis is true). It is tempting to conclude, therefore, that there is a 97% chance that the difference you observed reflects a real difference between populations and a 3% chance that the difference is due to chance. However, this would be an incorrect conclusion. What you can say is that random sampling from identical populations would lead to a difference smaller than you observed in 97% of experiments and larger than you observed in 3% of experiments.

Note that p-values are a valuable tool in hypothesis testing, but they should be used thoughtfully and in conjunction with other analyses.

Statistics Help

Read More about P-value Interpretation

Read More on Wiki-Pedia

R Frequently Asked Questions