# Bias (Statistical Bias)

Bias is defined as the difference between the expected value of a statistic and the true value of the corresponding parameter. Therefore the bias is a measure of the systematic error of an estimator. The bias indicates the distance of the estimator from the true value of the parameter. For example, if we calculate the mean of large number of unbiased estimators, we will find the correct value.

Gauss, C.F. (1821) during his work on the least squares method gave the concept of an unbiased estimator.

Bias of an estimator of a parameter should not be confused with its degree of precision as degree of precision is a measure of the sampling error.

There are several types of bias which should not be considered as mutually exclusive

• Selection Bias (arise due to systematic differences between the groups compared)
• Exclusion Bias (arise due to the systematic exclusion of certain individuals from the study)
• Analytical Bias (arise due to the way that the results are evaluated)

Mathematically Bias can be Defined as

Let statistics T used to estimate a parameter θ, if E(T)=θ + b(θ) then b(θ) is called the bias of the statistic T, where E(T) represents the expected value of the statistics T. Note that if b(θ)=0, then E(T)=θ. So T is an unbiased estimator of θ.

Reference:
Gauss, C.F. (1821, 1823, 1826). Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, Parts 1, 2 and suppl. Werke 4, 1-108.

# Difference between a probability value and the significance level?

Basically in hypothesis testing the goal is to see if the probability value is less than or equal to the significance level (i.e., is p ≤ alpha). It is also called the size of the test or size of the critical region. It is generally specified before any samples are drawn so that the results obtained will not influence our choice.

• The probability value (also called the p-value) is the probability of the observed result found in your research study of occurring (or an even more extreme result occurring), under the assumption that the null hypothesis is true (i.e., if the null were true).
• In hypothesis testing, the researcher assumes that the null hypothesis is true and then sees how often the observed finding would occur if this assumption were true (i.e., the researcher determines the p-value).
• The significance level (also called the alpha level) is the cutoff value the researcher selects and then uses to decide when to reject the null hypothesis.
• Most researchers select the significance or alpha level of .05 to use in their research; hence, they reject the null hypothesis when the p-value is less than or equal to .05.
• The key idea of hypothesis testing it that you reject the null hypothesis when the p-value is less than or equal to the significance level of.05.

# Type I and Type II Errors

In hypothesis testing there are two possible errors we can make: Type I and Type II errors.

• A Type I error occurs when your reject a true null hypothesis (remember that when the null hypothesis is true you hope to retain it).
α=P(type I error)=P(Rejecting the null hypothesis when it is true)
Type I error is more serious than type II error and therefore more important to avoid that a type II error.
• A Type II error occurs when you fail to reject a false null hypothesis (remember that when the null hypothesis is false you hope to reject it).
β=P(type II error) = P(accepting null hypothesis when alternative hypothesis is true)
• The best way to allow yourself to set a low alpha level (i.e., to have a small chance of making a Type I error) and to have a good chance of rejecting the null when it is false (i.e., to have a small chance of making a Type II error) is to increase the sample size.
• The key in hypothesis testing is to use a large sample in your research study rather than a small sample!

If you do reject your null hypothesis, then it is also essential that you determine whether the size of the relationship is practically significant.
The hypothesis test procedure is therefore adjusted so that there is a guaranteed “low” probability of rejecting the null hypothesis wrongly; this probability is never zero.

# Type I Error

It has become part of the statistical hypothesis testing culture.

• It is a longstanding convention.
• It reflects a concern over making type I errors (i.e., wanting to avoid the situation where you reject the null when it is true, that is, wanting to avoid “false positive” errors).
• If you set the significance level at .05, then you will only reject a true null hypothesis 5% or the time (i.e., you will only make a type I error 5% of the time) in the long run.

# Estimation: Point and Interval Estimation

## Estimation

The procedure of making judgement or decision about a population parameter is referred to as statistical estimation or simply estimation.  Statistical estimation procedures provide estimates of population parameter with a desired degree of confidence. The degree of confidence can be controlled in part, (i) by the size the sample (larger sample greater accuracy of the estimate) and (ii) by the type of the estimate made. Population parameters are estimated from sample data because it is not possible (it is impracticable) to examine the entire population in order to make such an exact determination.The statistical estimation of population parameter is further divided into two types, (i) Point Estimation and (ii) Interval Estimation

## Point Estimation

The objective of  point estimation is to obtain a single number from the sample which will represent the unknown value of the population parameter. Population parameters (population mean, variance etc) are estimated from the corresponding sample statistics (sample mean, variance etc).
A statistic used to estimate a parameter is called a point estimator or simply an estimator, the actual numerical value obtained by estimator is called an estimate.
Population parameter is denoted by θ which is unknown constant. The available information is in the form of a random sample x1,x2, … , xn of size n drawn from the population. We formulate a function of the sample observation x1,x2, … , xn. The estimator of θ is denoted by $\hat{\theta}$. Different random sample provide different values of the statistics $\hat{\theta}$. Thus $\hat{\theta}$ is a random variable with its own sampling probability distribution.

## Interval Estimation

A point estimator (such as sample mean) calculated from the sample data provides a single number as an estimate of the population parameter, which can not be expected to be exactly equal to the population parameter because the mean of a sample taken from a population may assume different values for different samples. Therefore we estimate an interval/ range  of values (set of values) within which the population parameter is expected to lie with a certain degree of confidence. This range of values used to estimate a population parameter is known as interval estimate or estimate by confidence interval, and is defined by two numbers, between which a population parameter is expected to lie. For example, $a<\bar{x}<b$ is an interval estimate of the population mean μ, indicating that the population mean is greater than a but less than b. The purpose of an interval estimate is to provide information about how close the point estimate is to the true parameter.

Note that the information developed about the shape of a sampling distribution of the sample mean i.e. Sampling Distribution of $\bar{x}$ allows us to locate an interval that has some specified probability of containing the population mean $\mu$.

## Which of the two types of estimation do you like the most, and why?

• Point estimation is nice because it provides an exact point estimate of the population value. It provides you with the single best guess of the value of the population parameter.
•  Interval estimation is nice because it allows you to make statements of confidence that an interval will include the true population value.