# Basic Statistics and Data Analysis

## Standard Error of Estimate

Standard error (SE) is a statistical term used to measure the accuracy within a sample taken from population of interest. The standard error of the mean measures the variation in the sampling distribution of the sample mean, usually denoted by $\sigma_\overline{x}$ is calculated as

$\sigma_\overline{x}=\frac{\sigma}{\sqrt{n}}$

Drawing (obtaining) different samples from the same population of interest usually results in different values of sample means, indicating that there is a distribution of sampled means having its own mean (average values) and variance. The standard error of the mean is considered as the standard deviation of all those possible sample drawn from the same population.

The size of the standard error is affected by standard deviation of the population and number of observations in a sample called the sample size. The larger the standard deviation of the population ($\sigma$), the larger the standard error will be, indicating that there is more variability in the sample means. However larger the number of observations in a sample smaller will be the standard error of estimate, indicating that there is less variability in the sample means, where by less variability we means that the sample is more representative of the population of interest.

If the sampled population is not very larger, we need to make some adjustment in computing the SE of the sample means. For a finite population, in which total number of objects (observations) is $N$ and the number of objects (observations) in a sample is $n$, then the adjustment will be $\sqrt{\frac{N-n}{N-1}}$. This adjustment is called the finite population correction factor. Then the adjusted standard error will be

$\frac{\sigma}{\sqrt{n}} \sqrt{\frac{N-n}{N-1}}$

The SE is used to:

1. measure the spread of values of statistic about the expected value of that statistic
2. construct confidence intervals
3. test the null hypothesis about population parameter(s)

The standard error is computed from sample statistics. To compute SE for simple random samples, assuming that the size of population ($N$) is at least 20 times larger than that of the sample size ($n$).
\begin{align*}
Sample\, mean, \overline{x} & \Rightarrow SE_{\overline{x}} = \frac{n}{\sqrt{n}}\\
Sample\, proportion, p &\Rightarrow SE_{p} \sqrt{\frac{p(1-p)}{n}}\\
Difference\, b/w \, means, \overline{x}_1 – \overline{x}_2 &\Rightarrow SE_{\overline{x}_1-\overline{x}_2}=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\\
Difference\, b/w\, proportions, \overline{p}_1-\overline{p}_2 &\Rightarrow SE_{p_1-p_2}=\sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}
\end{align*}

The standard error is identical to the standard deviation, except that it uses statistics whereas the standard deviation uses the parameter.

# Sampling theory, Introduction and Reasons to Sample

Often we are interested in drawing some valid conclusions (inferences) about a large group of individuals or objects (called population in statistics). Instead of examining (studying) the entire group (population, which may be difficult or even impossible to examine), we may examine (study) only a small part (portion) of the population (entire group of objects or people). Our objective is to draw valid inferences about certain facts for the population from results found in the sample; a process known as statistical inferences. The process of obtaining samples is called sampling and theory concerning the sampling is called sampling theory.

Example: We may wish to draw conclusions about the percentage of defective bolts produced in a factory during a given 6-day week by examining 20 bolts each day produced at various times during the day. Note that all bolts produced in this case during the week comprise the population, while the 120 selected bolts during 6-days constitutes a sample.

In business, medical, social and psychological sciences etc., research, sampling theory is widely used for gathering information about a population. The sampling process comprises several stages:

• Defining the population of concern
• Specifying the sampling frame (set of items or events possible to measure)
• Specifying a sampling method for selecting the items or events from the sampling frame
• Determining the appropriate sample size
• Implementing the sampling plan
• Sampling and data collecting
• Data which can be selected

When studying the characteristics of a population, there many reasons to study a sample (drawn from population under study) instead of entire population such as:

1. Time: as it is difficult to contact each and every individual of the whole population
2. Cost: The cost or expenses of studying all the items (objects or individual) in a population may be prohibitive
3. Physically Impossible: Some population are infinite, so it will be physically impossible to check the all items in the population, such as populations of fish, birds, snakes, mosquitoes. Similarly it is difficult to study the populations that are constantly moving, being born, or dying.
4. Destructive Nature of items: Some items, objects etc are difficult to study as during testing (or checking) they destroyed, for example a steel wire is stretched until it breaks and breaking point is recorded to have a minimum tensile strength. Similarly different electric and electronic components are check and they are destroyed during testing, making impossible to study the entire population as time, cost and destructive nature of different items prohibits to study the entire population.
5. Qualified and expert staff: For enumeration purposes, highly qualified and expert staff is required which is some time impossible. National and International research organizations, agencies and staff is hired for enumeration purposive which is some time costly, need more time (as rehearsal of activity is required), and some time it is not easy to recruiter or hire a highly qualified staff.
6. Reliability: Using a scientific sampling technique the sampling error can be minimized and the non-sampling error committed in the case of sample survey is also minimum, because qualified investigators are included.

Every sampling system is used to obtain some estimates having certain properties of the population under study. The sampling system should be judged by how good the estimates obtained are. Individual estimates, by chance, may be very close or may differ greatly from the true value (population parameter) and may give a poor measure of the merits of the system.

A sampling system is better judged by frequency distribution of many estimates obtained by repeated sampling, giving a frequency distribution having small variance and mean estimate equal to the true value.

## Sampling Unit

The population divided into a finite number of distinct and identifiable units is called sampling units. OR

The individuals whose characteristics are to be measured in the analysis are called elementary or sampling units. OR

Before selecting the sample, the population must be divided into parts called sampling units or simply sample units.

## Sampling Frame

The list of all the sampling units with a proper identification (which represents the population to be covered is called sampling frame). The frame may consist of either a list of units or a map of area (in case sample of area is being taken), such that every element in the population belongs to one and only one unit.

The frame should be accurate, free from omission and duplication (overlapping), adequate, upto data and the units must cover the whole of the population and should be well identified.

In improving the sampling design, supplementary information for the field covered by the sampling frame may also be valuable.

Examples: Sampling Frame and Sampling Unit

1. List of household (and persons) enumerated in population census.
2. A map of areas of a country showing the boundaries of area units.
3. In sampling an agricultural crop, the unit might be a field, a farm or an area of land whose shape and dimensions are at out disposal.

An ideal sampling frame will have the following qualities/characteristics:

• all sampling units have a logical and have numerical identifier
• all sampling units can be found i.e. contact information, map location or other relevant information about sampling units is present
• the frame is organized in a logical and systematic manner
• the sampling frame has some additional information about the units that allow the use of more advanced sampling frames
• every element of the population of interest is present in the frame
• every element of the population is present only once in the frame
• no elements from outside the population of interest are present in the frame
• the data is up-to-date

A sampling frame can be classified subject to several types of defect as follows:

A frame may be inaccurate: where some of the sampling units of the population are listed inaccurately or some units which do not actually exist are included in the list.

A frame may be inadequate: when it does not include all classes of the population which are to be taken the survey.

A frame may be incomplete: when some of the sampling units of the population are either completely omitted or includes more than once.

A frame may be out of date: when it has not been updated according to the demand of the occasion, although it was accurate, complete and adequate at the time of construction.

## Unbiasedness of estimator

Unbiasedness of estimator is probably the most important property that a good estimator should possess. In statistics, the bias (or bias function) of an estimator is the difference between this estimator’s expected value and the true value of the parameter being estimated. An estimator is said to be unbiased if its expected value equals the corresponding population parameter; otherwise it is said to be biased.

## Unbiased Estimator

Suppose in the realization of a random variable X taking values in probability space i.e. ($\chi, \mathfrak{F},P_\theta$), such that $\theta \varepsilon \Theta$, a function $f:\Theta \rightarrow \Omega$ has be estimated, mapping the parameter set $\Theta$ into a certain set $\Omega$, and that as an estimator of $f(\theta)$ a statistic $T=T(X)$ is chosen. if T is such that
$E_\theta[T]=\int_\chi T(x) dP_\theta(x)=f(\theta)$
holds for $\theta\varepsilon \Theta$ then T is called an unbiased estimator of $f(\theta)$. An unbiased estimator is frequently called free of systematic errors.

Suppose $\hat{\theta}$ be an estimator of a parameter $\theta$, then $\hat{\theta}$ is said to be unbiased estimator if $E(\hat{\theta})=0$.

• If $E(\hat{\theta})=\theta$ then $\hat{\theta}$ is an unbiased estimator of a parameter $\theta$.
• If $E(\hat{\theta})<\theta$ then $\hat{\theta}$ is a negatively biased estimator of a parameter $\theta$.
• If $E(\hat{\theta})>\theta$ then $\hat{\theta}$ is a positively biased estimator of a parameter $\theta$.

Bias of an estimator $\theta$ can be found by $[E(\hat{\theta})-\theta]$.

$\overline{X}$ is an unbiased estimator of the mean of a population (whose mean exists). $\overline{X}$ is an unbiased estimator of $\mu$ in a Normal distribution i.e. $N(\mu, \sigma^2)$. $\overline{X}$ is an unbiased estimator of the parameter $p$ of the Bernoulli distribution. $\overline{X}$ is an unbiased estimator of the parameter $\lambda$ of the Poisson distribution. In each of these cases, the parameter $\mu, p$ or $\lambda$ is the mean of the respective population being sampled.

However, Sample variance $\sigma^2$ is not an unbiased estimator of population variance $\sigma$, but consistent.

It is possible to have more than one unbiased estimator for an unknown parameter. The sample mean and the sample median are unbiased estimator of the population mean $\mu$, if the population distribution is symmetrical.

# Standard Error

The standard error of a statistic is actually the standard deviation of the sampling distribution of that statistic. Standard errors reflects how much sampling fluctuation a statistic will show. The inferential statistics (deductive statistics) involved in the construction of confidence intervals and significance testing are based on standard errors. Increasing the sample size, the Standard Error decreases.

In practical applications, the true value of the standard deviation of the error is unknown. As a result, the term standard error is often used to refer to an estimate of this unknown quantity.

The size of standard error is affected by two values.

1. The Standard Deviation of the population which affects the standard error. Larger the population’s standard deviation (σ), larger is standard error i.e. $\frac{\sigma}{\sqrt{n}}$. If the population is homogeneous (which results in small population standard deviation), the standard error will also be small.
2. The standard error is affected by the number of observations in a sample. A large sample will result in a small standard error of estimate (indicates less variability in the sample means)

## Application of Standard Errors

Standard errors are used in different statistical tests such as

• used to measure the distribution of the sample means
• used to build confidence intervals for means, proportions, different between means etc for cases when population standard deviation is known or unknown.
• used to determine the sample size
• used in control charts for control limits for means
• used in comparisons test such as z-test, t-test, Analysis of Variance, Correlation and Regression Analysis (standard error of regression) etc

## (1) Standard Error of Means

The standard error for the mean or standard deviation of the sampling distribution of the mean, measures the deviation/ variation in the sampling distribution of the sample mean, denoted by $\sigma_{\bar{x}}$ and calculated as the function of the standard deviation of the population and respective size of the sample i.e

$\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}}$                      (used when population is finite)

If the population size is infinite then ${\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}} \times \sqrt{\frac{N-n}{N}}}$ because $\sqrt{\frac{N-n}{N}}$ tends towards 1 as N tends to infinity.

When standard deviation (σ) of the population is unknown, we estimate it from the sample standard deviation. In this case standard error formula is $\sigma_{\bar{x}}=\frac{S}{\sqrt{n}}$

## (2) Standard Error for Proportion

Standard error for proportion can also be calculated in same manner as we calculated standard error of mean, denoted by $\sigma_p$ and calculated as $\sigma_p=\frac{\sigma}{\sqrt{n}}\sqrt{\frac{N-n}{N}}$.

In case of finite population $\sigma_p=\frac{\sigma}{\sqrt{n}}$
in case of infinite population $\sigma=\sqrt{p(1-p)}=\sqrt{pq}$, where p is the probability that an element possesses the studied trait and q=1-p is the probability that it does not.

## (3) Standard Error for Difference between Means

Standard error for difference between two independent quantities is the square root of the of the sum of the squared standard errors of the both quantities i.e $\sigma_{\bar{x}_1+\bar{x}_2}=\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}$, where $\sigma_1^2$ and $\sigma_2^2$ are the respective variances of the two independent population to be compared and $n_1+n_2$ are the respective sizes of the two samples drawn from their respective populations.

Unknown Population Variances
If the variances of the two populations are unknown, we estimate them from the two samples i.e. $\sigma_{\bar{x}_1+\bar{x}_2}=\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}}$, where $S_1^2$ and $S_2^2$ are the respective variances of the two samples drawn from their respective population.

Equal Variances are assumed
In case when it is assumed that the variance of the two populations are equal, we can estimate the value of these variances with a pooled variance $S_p^2$ calculated as a function of $S_1^2$ and $S_2^2$ i.e

$S_p^2=\frac{(n_1-1)S_1^2+(n_2-1)S_2^2}{n_1+n_2-2}$
$\sigma_{\bar{x}_1}+{\bar{x}_2}=S_p \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}$

## (4) Standard Error for Difference between Proportions

The standard error of the difference between two proportions is calculated in the same way as the standard error of the difference between means is calculated i.e.
\begin{eqnarray*}
\sigma_{p_1-p_2}&=&\sqrt{\sigma_{p_1}^2+\sigma_{p_2}^2}\\
&=& \sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}
\end{eqnarray*}
where $p_1$ and $p_2$ are the proportion for infinite population calculated for the two samples of sizes $n_1$ and $n_2$.