# Basic Statistics and Data Analysis

## Sufficient statistics and Sufficient Estimators

An estimator $\hat{\theta}$ is sufficient if it make so much use of the information in the sample that no other estimator could extract from the sample, additional information about the population parameter being estimated.

The sample mean $\overline{X}$ utilizes all the values included in the sample so it is sufficient estimator of population mean $\mu$.

Sufficient estimators are often used to develop the estimator that have minimum variance among all unbiased estimators (MVUE).

If sufficient estimator exists, no other estimator from the sample can provide additional information about the population being estimated.

If there is a sufficient estimator, then there is no need to consider any of the non-sufficient estimator. Good estimator are function of sufficient statistics.

Let $X_1,X_2,\cdots,X_n$ be a random sample from a probability distribution with unknown parameter $\theta$, then this statistic (estimator) $U=g(X_1,X_,\cdots,X_n)$ observation gives $U=g(X_1,X_2,\cdots,X_n)$ does not depend upon population parameter $\Theta$.

## Sufficient Statistic Example

The sample mean $\overline{X}$ is a sufficient for the population mean $\mu$ of a normal distribution with known variance. Once the sample mean is known, no further information about the population mean $\mu$ can be obtained from the sample itself, while median is not sufficient for the mean; even if the median of the sample is known, knowing the sample itself would provide further information about the population mean $\mu$.

## Mathematical Definition of Sufficiency

Suppose that $X_1,X_2,\cdots,X_n \sim p(x;\theta)$. $T$ is sufficient for $\theta$ if the conditional distribution of $X_1,X_2,\cdots, X_n|T$ does not depend upon $\theta$. Thus
$p(x_1,x_2,\cdots,x_n|t;\theta)=p(x_1,x_2,\cdots,x_n|t)$
This means that we can replace $X_1,X_2,\cdots,X_n$ with $T(X_1,X_2,\cdots,X_n)$ without losing information.

## Consistent Estimator

A statistics is a consistent estimator of a population parameter if “as the sample size increases, it becomes almost certain that the value of the statistics comes close (closer) to the value of the population parameter”. If an estimator is consistent, it becomes more reliable with large sample. All this means that the distribution of the estimates become more and more concentrated near the value of the population parameter which is being estimated, such that the probability of the estimator being arbitrarily closer to $\theta$ converges to one (sure event).

The estimator $\hat{\theta}_n$ is said to be a consistent estimator of $\theta$ if for any positive $\varepsilon$;
$limit_{n \rightarrow \infty} P[|\hat{\theta}_n-\theta| \le \varepsilon]=1$
or
$limit_{n\rightarrow \infty} P[|\hat{\theta}_n-\theta|> \varepsilon]=0]$

Here $\hat{\theta}_n$ expresses the estimator of $\theta$, calculated by using a sample size of size $n$.

The sample median is a consistent estimator of the population mean, if the population distribution is symmetrical; otherwise the sample median would approach the population median not the population mean.

The sample estimate of standard deviation is biased but consistent as the distribution of $\hat{\sigma}^2$ is becoming more and more concentrated at $\sigma^2$ as the sample size increases.

A sample statistic can be an inconsistent estimator, whereas a consistent statistic is unbiased in the limit but an unbiased estimator may or may not be consistent estimator.

Note that these two are not equivalent: (1) Unbiasedness is a statement about the expected value of the sampling distribution of the estimator, while (ii) Consistency is a statement about “where the sampling distribution of the estimator is going” as the sample size

## Point Estimation of Parameters

The objective of point estimation of parameters is to obtain a single number from the sample which will represent the unknown value of the parameter.

Practically we did not know about the population mean and standard deviation i.e population parameters such as mean, standard deviation etc. But  our goal is to measure (estimate) the mean and standard deviation of population we are interested from sample information to save time, cost etc.  This can be done by estimating the sample mean and standard deviation as a best guess for the true population mean and standard deviation.  We can call this estimate as “best guess” and termed as a “point estimateas it a single number summarized one.

A Point Estimate is a statistic (a statistical measure from sample) that gives a plausible estimate (or possible a best guess) for the value in question.

$\overline{x}$ is a point estimate for $\mu$ and s is a point estimate for $\sigma$.

Or we can say that

A statistic used to estimate a parameter is called a point estimator or simply an estimator. The actual numerical value which we obtain for an estimator in a given problem is called an estimate.

Generally symbol $\theta$ (unknown constant) is used to denote a population parameter which may be a proportion, mean or some measure of variability. The available information is in the form of a random sample $X_1,X_2,\cdots, X_n$ of size n drawn from the population. We wish to formulate a function of the sample observations $X_1,X_2,\cdots,X_n$; that is, we look for a statistic such that its value computed from the sample data would reflect the value of the population parameter as closely as possible. The estimator of $\theta$ is commonly denoted by $\hat{\theta}$. Different random samples usually provide different values of the statistic $\hat{\theta}$ having its own sampling distribution.

Note that Unbiasedness, Efficiency, Consistency and Sufficiency are the criteria (statistical properties of estimator) to identify that whether a statistic is “good” estimator.

## Application of Point Estimator Confidence Intervals

We can build interval with confidence as we are not only interested in finding the point estimate for the mean, but also determining how accurate the point estimate is. Here the Central Limit Theorem plays a very important role in building confidence interval.  We assume that the sample standard deviation is close to the population standard deviation (which will almost always be true for large samples). The standard deviation of the sampling distribution of estimator (here for mean) is

$\sigma_x \approx \frac{\sigma}{\sqrt{n}}$

Our interest is to find an interval around $\overline{x}$ such that there is a large probability that the actual (true) mean falls inside the computed interval.  This interval is called a confidence interval and the large probability is called the confidence level.

Example

Suppose that we check for clarity in 50 locations in Lake and discover that the average depth of clarity of the lake is 14 feet with a standard deviation of 2 feet.  What can we conclude about the average clarity of the lake with a 95% confidence level?

Solution

variable x (depth of lack at 50 location) can be used to provide a point estimate for $\mu$ and s to provide a point estimate for s. To answer how accurate is x as a point estimate, we can construct a 95% confidence interval for $\mu$ as follows.

Draw the picture like given below and use the standard normal table to find the z-score associated to the probability of .025 (there is .025 to the left and .025 to the right i.e. two tailed case).

z-score for 95% confidence level is about ±1.96.

\begin{align*}
Z&=\frac{\overline{x}-\mu}{\frac{\sigma}{\sqrt{n}}}\\
\pm 1.96&=\frac{\overline{x}-\mu}{\frac{2}{\sqrt{n}}}\\
\overline{x}-14&=\pm 0.5488
\end{align*}

Note that $Z\frac{\sigma}{\sqrt{n}}$ is called the margin of error.

The 95% confidence interval for the mean clarity will be (13.45, 14.55)

In other words there is a 95% chance that the mean clarity is between 13.45 and 14.55.

In general if z is the standard normal table value associated with given level of confidence then a $\alpha$% confidence interval for the mean is

$\overline{x} \pm Z_{\alpha}\frac{\sigma}{\sqrt{n}}$

## Unbiasedness of estimator

Unbiasedness of estimator is probably the most important property that a good estimator should possess. In statistics, the bias (or bias function) of an estimator is the difference between this estimator’s expected value and the true value of the parameter being estimated. An estimator is said to be unbiased if its expected value equals the corresponding population parameter; otherwise it is said to be biased.

## Unbiased Estimator

Suppose in the realization of a random variable X taking values in probability space i.e. ($\chi, \mathfrak{F},P_\theta$), such that $\theta \varepsilon \Theta$, a function $f:\Theta \rightarrow \Omega$ has be estimated, mapping the parameter set $\Theta$ into a certain set $\Omega$, and that as an estimator of $f(\theta)$ a statistic $T=T(X)$ is chosen. if T is such that
$E_\theta[T]=\int_\chi T(x) dP_\theta(x)=f(\theta)$
holds for $\theta\varepsilon \Theta$ then T is called an unbiased estimator of $f(\theta)$. An unbiased estimator is frequently called free of systematic errors.

Suppose $\hat{\theta}$ be an estimator of a parameter $\theta$, then $\hat{\theta}$ is said to be unbiased estimator if $E(\hat{\theta})=0$.

• If $E(\hat{\theta})=\theta$ then $\hat{\theta}$ is an unbiased estimator of a parameter $\theta$.
• If $E(\hat{\theta})<\theta$ then $\hat{\theta}$ is a negatively biased estimator of a parameter $\theta$.
• If $E(\hat{\theta})>\theta$ then $\hat{\theta}$ is a positively biased estimator of a parameter $\theta$.

Bias of an estimator $\theta$ can be found by $[E(\hat{\theta})-\theta]$.

$\overline{X}$ is an unbiased estimator of the mean of a population (whose mean exists). $\overline{X}$ is an unbiased estimator of $\mu$ in a Normal distribution i.e. $N(\mu, \sigma^2)$. $\overline{X}$ is an unbiased estimator of the parameter $p$ of the Bernoulli distribution. $\overline{X}$ is an unbiased estimator of the parameter $\lambda$ of the Poisson distribution. In each of these cases, the parameter $\mu, p$ or $\lambda$ is the mean of the respective population being sampled.

However, Sample variance $\sigma^2$ is not an unbiased estimator of population variance $\sigma$, but consistent.

It is possible to have more than one unbiased estimator for an unknown parameter. The sample mean and the sample median are unbiased estimator of the population mean $\mu$, if the population distribution is symmetrical.

# Standard Error

The standard error of a statistic is actually the standard deviation of the sampling distribution of that statistic. Standard errors reflects how much sampling fluctuation a statistic will show. The inferential statistics (deductive statistics) involved in the construction of confidence intervals and significance testing are based on standard errors. Increasing the sample size, the Standard Error decreases.

In practical applications, the true value of the standard deviation of the error is unknown. As a result, the term standard error is often used to refer to an estimate of this unknown quantity.

The size of standard error is affected by two values.

1. The Standard Deviation of the population which affects the standard error. Larger the population’s standard deviation (σ), larger is standard error i.e. $\frac{\sigma}{\sqrt{n}}$. If the population is homogeneous (which results in small population standard deviation), the standard error will also be small.
2. The standard error is affected by the number of observations in a sample. A large sample will result in a small standard error of estimate (indicates less variability in the sample means)

## Application of Standard Errors

Standard errors are used in different statistical tests such as

• used to measure the distribution of the sample means
• used to build confidence intervals for means, proportions, different between means etc for cases when population standard deviation is known or unknown.
• used to determine the sample size
• used in control charts for control limits for means
• used in comparisons test such as z-test, t-test, Analysis of Variance, Correlation and Regression Analysis (standard error of regression) etc

## (1) Standard Error of Means

The standard error for the mean or standard deviation of the sampling distribution of the mean, measures the deviation/ variation in the sampling distribution of the sample mean, denoted by $\sigma_{\bar{x}}$ and calculated as the function of the standard deviation of the population and respective size of the sample i.e

$\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}}$                      (used when population is finite)

If the population size is infinite then ${\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}} \times \sqrt{\frac{N-n}{N}}}$ because $\sqrt{\frac{N-n}{N}}$ tends towards 1 as N tends to infinity.

When standard deviation (σ) of the population is unknown, we estimate it from the sample standard deviation. In this case standard error formula is $\sigma_{\bar{x}}=\frac{S}{\sqrt{n}}$

## (2) Standard Error for Proportion

Standard error for proportion can also be calculated in same manner as we calculated standard error of mean, denoted by $\sigma_p$ and calculated as $\sigma_p=\frac{\sigma}{\sqrt{n}}\sqrt{\frac{N-n}{N}}$.

In case of finite population $\sigma_p=\frac{\sigma}{\sqrt{n}}$
in case of infinite population $\sigma=\sqrt{p(1-p)}=\sqrt{pq}$, where p is the probability that an element possesses the studied trait and q=1-p is the probability that it does not.

## (3) Standard Error for Difference between Means

Standard error for difference between two independent quantities is the square root of the of the sum of the squared standard errors of the both quantities i.e $\sigma_{\bar{x}_1+\bar{x}_2}=\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}$, where $\sigma_1^2$ and $\sigma_2^2$ are the respective variances of the two independent population to be compared and $n_1+n_2$ are the respective sizes of the two samples drawn from their respective populations.

Unknown Population Variances
If the variances of the two populations are unknown, we estimate them from the two samples i.e. $\sigma_{\bar{x}_1+\bar{x}_2}=\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}}$, where $S_1^2$ and $S_2^2$ are the respective variances of the two samples drawn from their respective population.

Equal Variances are assumed
In case when it is assumed that the variance of the two populations are equal, we can estimate the value of these variances with a pooled variance $S_p^2$ calculated as a function of $S_1^2$ and $S_2^2$ i.e

$S_p^2=\frac{(n_1-1)S_1^2+(n_2-1)S_2^2}{n_1+n_2-2}$
$\sigma_{\bar{x}_1}+{\bar{x}_2}=S_p \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}$

## (4) Standard Error for Difference between Proportions

The standard error of the difference between two proportions is calculated in the same way as the standard error of the difference between means is calculated i.e.
\begin{eqnarray*}
\sigma_{p_1-p_2}&=&\sqrt{\sigma_{p_1}^2+\sigma_{p_2}^2}\\
&=& \sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}
\end{eqnarray*}
where $p_1$ and $p_2$ are the proportion for infinite population calculated for the two samples of sizes $n_1$ and $n_2$.