Error and Residual in Regression

Error and Residual in Regression

In Statistics and Optimization, Statistical Errors and Residuals are two closely related and easily confused measures of “Deviation of a sample from the mean”.

Error is a misnomer; an error is the amount by which an observation differs from its expected value. The errors e are unobservable random variables, assumed to have zero mean and uncorrelated elements each with common variance  σ2.

A Residual, on the other hand, is an observable estimate of the unobservable error. The residuals $\hat{e}$ are computed quantities with mean ${E(\hat{e})=0}$ and variance ${V(\hat{e})=\sigma^2 (1-H)}$.

Like the errors, each of the residuals has zero mean, but each residual may have a different variance. Unlike the errors, the residuals are correlated. The residuals are linear combinations of the errors. If the errors are normally distributed so are the errors.

regression: Error and Residual in Regression

Note that the sum of the residuals is necessarily zero, and thus the residuals are necessarily not independent. The sum of the errors need not be zero; the errors are independent random variables if the individuals are chosen from the population independently.

The differences between errors and residuals in Regression are:

Sr. No.ErrorsResiduals
1)Error represents the unobservable difference between an actual value $y$ of the dependent variable and its true population mean.Residuals represent the observable difference between an actual value $y$ of the dependent variable and its predicted value according to the regression model.
2)Error is a theoretical concept because the true population mean is usually unknown.One can calculate residuals because we have the data and the fitted model.
3)Errors are assumed to be random and independent, with a mean of zero.Residuals are considered estimates of the errors for each data point.

Residuals are used in various ways to evaluate the regression model, including:

  • Residual plots: The residual plots are used to visualize the residuals versus the independent variable or predicted values.
  • Mean Squared Error (MSE): The MSE statistic measures the average squared difference between the residuals and zero.

In essence, understanding errors and residuals helps the researcher gauge how well the regression model captures the underlying relationship between variables, despite the inherent randomness or “noise” in real-world data.

FAQS about Errors and Residuals

  1. What is an Error?
  2. What are residuals in regression?
  3. What is the purpose of residual plots?
  4. What is a mean squared error (MSE)?
  5. Differentiate between error and residual.
  6. Discuss the sum of residuals and the sum of errors.
Statistics Help: https://itfeature.com

Learn about Simple Linear Regression Models

Statistical Models in R Language

P-value Interpretation and Misinterpretation of P-value 2012

The P-value is a probability, with a value ranging from zero to one. It is a measure of how much evidence we have against the null hypothesis. P-value is a way to express the likelihood that $H_0$ is not true. The smaller the p-value, the more evidence we have against $H_0$. Here we will discuss about P-value and Its Interpretation.

P-value Definition

The largest significance level at which we would accept the null hypothesis. It enables us to test the hypothesis without first specifying a value for $\alpha$. OR

The probability of observing a sample value as extreme as, or more extreme than, the value observed, given that the null hypothesis is true.

p value and significance level

P-value Interpretation

In general, the P-value interpretation is “If the P-value is smaller than the chosen significance level then $H_0$ (null hypothesis) is rejected even when it is true. If the P-value is larger than the significance level $H_0$ is not rejected”.

p-value Interpretation

If the P-value is less than

  • 0.10, we have some evidence that $H_0$ is not true
  • 0.05, strong evidence that $H_0$ is not true
  • 0.01, Very strong evidence that $H_0$ is not true
  • 0.001, extremely strong evidence that $H_0$ is not true

Misinterpretation of a P-value

Many people misunderstand P-values. For example, if the P-value is 0.03 then it means that there is a 3% chance of observing a difference as large as you observed even if the two population means are the same (i.e. the null hypothesis is true). It is tempting to conclude, therefore, that there is a 97% chance that the difference you observed reflects a real difference between populations and a 3% chance that the difference is due to chance. However, this would be an incorrect conclusion. What you can say is that random sampling from identical populations would lead to a difference smaller than you observed in 97% of experiments and larger than you observed in 3% of experiments.

Note that p-values are a valuable tool in hypothesis testing, but they should be used thoughtfully and in conjunction with other analyses.

Statistics Help

Read More about P-value Interpretation

Read More on Wiki-Pedia

R Frequently Asked Questions

Sampling and Non Sampling Errors

Before Differentiating the Sampling and Non Sampling Errors, let us define the Error term first.

The difference between an estimated value and the population’s true value is called an error. Since a sample estimate is used to describe a characteristic of a population. A sample being only a part of the population cannot provide a perfect representation of the population, no matter how carefully the sample is selected. Generally, it is seen that an estimate is rarely equal to the true value and we may think about how close will the sample estimate be to the population’s true value.

Two Kinds of Errors: Sampling and Non Sampling Errors

There are two kinds of errors, namely (I) Sampling Errors and (II) Non Sampling Errors

Sampling and Non Sampling Errors
  1. Sampling Errors (random error)
  2. Non-Sampling Errors (non-random errors)

  1. Sampling Errors
    A Sampling Error
    is the difference between the value of a statistic obtained from an observed random sample and the value of the corresponding population parameter being estimated. Let $T$ be the sample statistic used to estimate the population parameter, the sampling error denoted by $E$  is  $E = T −\theta$. The value of Sampling Errors reveals the precision of the estimate. The smaller the sampling error, the greater will be the precision of the estimate. The sampling error can be reduced:

    i)   By increasing the sample size
    ii)  By improving the sampling design
    iii) By using the supplementary information

  2. Non Sampling Error
    The errors that are caused by sampling the wrong population of interest and by response bias, as well as those made by an investigator in collecting analysis and reporting the data, are all classified as non-sampling errors or non-random errors. These errors are present in a complete census as well as in the sampling survey.

Learn R Programming Language

Statistics help, sampling and non sampling error

Common Log and Natural Log

Difference between Common Log and Natural Log

In this post, we will learn about the difference between Common Log and Natural Log.

The Logarithm of a number is the exponent by which another fixed value of the base has to be raised to produce that number. For example, the logarithm of 1000 to base 10 is 3 as 1000=103. Logarithms were introduced by John Napier in the early 17th century for simplification of calculation and were widely adopted by scientists, engineers, and others to perform computations more easily using logarithm tables. The logarithm to base b=10 is called the common logarithm and has a lot of applications in science and engineering, while the natural logarithm has the constant e (2.718281828) as its base and is written as $ln(x)$ or $log_e (x)$.

This common log is used in most of the exponential scales (such as 23) in chemistry such as pH scale (for measurement of acidity and alkalinity), Richter scale (for measurement of the intensity of earthquakes), and so on. It is so common that if you find no base written, you can assume it to be $log\, x$ or common log.

Common Log and Natural Log

The natural logarithm is widely used in pure mathematics, especially calculus. The natural logarithm of a number x is the power to which $e$ has to be raised to equal x. For example, ln(7.389…) is 2, because e2=7.389. The natural log of e itself (ln(e)) is 1 because $e^1=e$, while the natural logarithm of $1$ (ln(1))$ is 0, since $e^0=1$.

The question is “The reason for choosing 10 is obvious, but why $e=2.718…$”?

The answer is that it back to 300 years or more ago to Euler (which $e$ comes from his name). The function $e^x$ is the only function that its derivative (and consequently its integral) is itself. ($ex’ =  ex$), no other function has this characteristic. The number e could be achieved by several numerical and analytical methods, more often infinite summations. This number has a more important rule in complex analysis.

Suppose you have a hundred rupees, and the interest rate is 10%, you will have Rs. 110, and the next time another 10% of Rs. 110, will raise your amount to Rs. 121, and so on…  What happens when the interest is being computed continuously (all the time)?  You might think you will soon have an infinite amount of money, but actually, you have your initial deposit times e to the power of the interest rate times the amount of time:

$$P=P_0 e^{kt}$$

where k is the growth rate or interest rate t is time period, $P$ is the Value at time $t$, and $P_0$ is the Value at time $t=0$.

The intuitive explanation is: ex is the amount of continuous growth after a certain amount of time. The natural log gives you the time needed to reach a certain level of growth. That is, $e^x$ is the amount of continuous growth after a certain amount of time and a natural log is the amount of time needed to reach a certain level of continuous growth.

Learn more about Natural Logarithms

R Frequently Asked Questions

Statistics Help Common Logarithm