Size of Sampling Error

In this post, we will discuss sampling error and the size of sampling error. Sampling error is the difference between a sample statistic (such as a sample mean) and the true population parameter (the actual population mean). Sampling error arises because a sample is being studied instead of the entire population.

The word “error” in sampling error may be misleading for someone. It does not mean that you made a mistake in your research process. Sampling error is a statistical concept that exists even when your sampling is perfectly random and your execution is flawless.

Cause of Sampling Error

Sampling error is caused by random chance. When someone randomly selects a subset of a population, that specific subset will never have the exact same characteristics as the entire population. This chance variation is sampling error. For example

Suppose you have a large bowl of soup (consider it the population), and you taste a single spoonful (it is a sample). The flavour of that spoonful will probably be very close to the whole bowl, but it might be a tiny bit saltier or have one more piece of vegetable than the average spoonful. This small natural difference is the “Sampling Error”. It is not a mistake that you made; it is an inevitable result of sampling.

How is it measured?

Let $\hat{\theta}$ be a sample statistic and let $\theta$ be its true population parameter, then sampling error is
$$Sampling\,\, Error = \hat{\theta} – \theta$$

For example, $\overline{x}$ be the sample mean and $\mu$ is the true population parameter then

$$Sampling\,\, Error = \overline{x} – \mu$$

The most common way to quantify Sampling Error is the computation of standard error (SE). The computation of the standard error of the mean (SEM) estimates how much the sample average is likely to vary from the true population mean. A smaller standard error means less variability and more precision in the estimate.

The standard error formula is

$$SE = \frac{s}{\sqrt{n}}$$

where $s$ is the sample standard deviation and $n$ is the sample size.

Factors Affecting the Size of Sampling Error

Two main factors control the size of sampling error:

  1. Sample Size (n): This is the most important factor.
    • Larger Sample Size → Smaller Sampling Error. As the sample size increases, the sample becomes a better and better representation of the population. That is, the sampling error shrinks.
    • This is why national polls survey thousands of people, not just a few dozen.
  2. Population Variability (Standard Deviation s):
    • More Variable Population → Larger Sampling Error. If the individuals in the population are very diverse (e.g., “ages of all people in a country”), any given sample might be less representative. If the population is very homogeneous (e.g., “diameters of ball bearings from the same machine”), a small sample will be very accurate.

This relationship is captured in the formula for the Standard Error above.

Size of Sampling Error

Sampling Error vs. Sampling Bias

This is a crucial distinction.

FeatureSampling ErrorSampling Bias (a non-sampling error)
CauseRandom chanceFlawed sampling method
NatureUnavoidable and measurableAvoidable and problematic
EffectCauses imprecision (scatter)Causes inaccuracy (shift)
SolutionIncrease sample sizeFix the sampling 333method
  • Sampling Error: Firing a rifle multiple times at a target. The shots will cluster tightly (small error) or be spread out (large error) around the bullseye.
  • Sampling Bias: The rifle’s scope is miscalibrated. All your shots are consistently off-target in one direction, missing the true bullseye.

Sampling Error: Real World Example

Suppose you want to know the average height of all 10000 students at the university (the population). The average height is 5’8″ (the parameter is known to you). You take a random sample of 100 students and calculate their average height. It comes out to 5’7.5″. You take another random sample of 100 different students, the average for this sample is 5’8.5″.

The difference between your first sample’s results (5’7.5″) and the true value (5’8″) is -0.5inches. This is the sampling error for that first sample. The difference for the second sample is +0.5 inches. This is the sampling error for the second sample.

This variation is natural and expected. Similarly, if the sample size is increased to 500 students, the sample averages (e.g., 5’7.9″, 5’8.1″) would likely be much closer to the true 5’8″, meaning that the sampling error would be smaller.

Sampling Error: Summary

  • What it is: Natural variation between a sample and the population.
  • What it’s not: A mistake or bias in the research design.
  • Why it matters: It tells us the precision of our sample-based estimates.
  • How to reduce it: Increase the sample size.
  • How to measure it: Calculate the Standard Error (SE).

FAQs about Sampling Error and Size of Sampling Error

  • What is sampling error?
  • What is meant by the size of sampling error?
  • How can sampling error be reduced?
  • Give some real-world examples related to sampling error.
  • How is sampling error computed?
  • Describe the causes of sampling error.
  • What is the difference between error, sampling error, and sampling bias

Simulation in R Language

Sample Size Determination

Sample size determination is one of the most critical steps in designing any research study or experiment. Whether the researcher is conducting clinical trials, market research, or social science studies, the selection of an appropriate sample size ensures that the results are statistically valid while optimizing resources. This guide will walk you through the key concepts and methods for sample size determination.

In planning a study, the sample size determination is an important issue required to meet certain conditions. For example, for a study dealing with blood cholesterol levels, these conditions are typically expressed in terms such as “How large a sample do I need to be able to reject the null hypothesis that two population means are equal if the difference between them is $d=10$mg/dl?

Why Sample Size Matters

  1. Statistical Power: Adequate sample sizes increase the ability to detect true effects
  2. Precision: Larger samples typically yield more precise estimates
  3. Resource Efficiency: Avoid wasting time/money on unnecessarily large samples
  4. Ethical Considerations: Especially important in clinical research to neither under- nor over-recruit participants

Special Considerations for Estimating Sample Size

  1. Small Populations: May require finite population corrections
  2. Stratified Sampling: Need to calculate for each stratum
  3. Cluster Sampling: Must account for design effect
  4. Longitudinal Studies: Consider repeated measures and attrition
Sample Size Determination

Sample Size Determination Formula

In general, there exists a formula for computing a sample size for the specific test statistic (appropriate to test a specified hypothesis). These formulae require that the user specify the $\alpha$-level and Power = ($1-\beta$) desired, as well as the difference to be detected and the variability of the measure.

Common Approaches to Sample Size Calculation

For Estimating Proportions (Prevalence Studies)

The common approach to calculate sample size, use the formula:

$$n=\frac{Z^2 p (1-p)}{E^2}$$

where

  • Z = Z-value (1.96 for 95% confidence interval)
  • p = estimated proportion
  • E = margin of error

For a survey with an expected proportion of 50%, a 95% confidence level, and 5% margin of error, the sample size will be

$$n=\frac{1.96^2 \times 0.5 \times 0.5}{0.05^2} \approx 385$$

Note that it is not wise to calculate a single number for the sample size. It is better to calculate a range of values by varying the assumptions so that one can get a sense of their impact on the resulting projected sample size. From this range of sample sizes, a suitable sample may be picked for the research work.

Common Situations for Sample Size Determination

We consider the process of estimating sample size for three common circumstances:

  • One-Sample t-test and paired t-test
  • Two-Sample t-test
  • Comparison of $P_1$ vs $P_2$ with a Z-test

One Sample t-test and Paired test

For testing the hypothesis:

$H_o:\mu=\mu_o\quad$ vs $\quad H_1:\mu \ne \mu_o$

For a two-tailed test, the formula of one-sample t-test is

$$n = \left[\frac{(Z_{1-\alpha/2} + Z_{1-\beta})\sigma}{d} \right]^2$$

Example: Suppose we are interested in estimating the size of a sample from a population of blood cholesterol levels. The typical standard deviation of the population is, say, 30 mg/dl. Consider, $\alpha = 0.05, \sigma = 25, d = 5.0, power = 0.80$

\begin{align*}
n & = \left[ \frac{(Z_{1-\alpha/2} + Z_{1-\beta})\sigma}{d} \right]^2\\
&= \left[\frac{(1.96 + 0.842)}{5}25\right]^2 = 196.28 \approx 197
\end{align*}

Two Sample t-test

How large a sample would be needed for comparing two approaches to cholesterol lowering using $\alpha=0.05$, to detect a difference of $d=20$ mg/dl or more with power = $1-\beta=0.90$? For the following hypothesis

$H_o:\mu_1 =\mu_2\quad$ vs $\quad H_1:\mu_1 \ne \mu_2$. For a two-tailed t-test, the formula is

$$N=n_1+n_2 = \frac{4\sigma^2(Z_{1-\alpha/2} + Z_{1-\beta})^2 } {(d = \mu_1 – \mu_2)^2}$$

For $\sigma = 30$mg/dl, $\beta=0.10, \alpha = 0.05$, $Z_{1-\alpha/2}=1.96$, Power = $1-\beta$, $Z_{1-\beta}=1.282$, d = 20 mg/dl.

\begin{align*}
N &= n_1 + n_2 = \frac{4(30)^2 (1.96 + 1.282)^2}{20^2}\\
&= \frac{4\times 900 \times (3.242)^2}{400} = 94.6
\end{align*}

The required sample size is about 50 for each group.

Two Sample Proportion Test

For testing the two-sample proportions hypothesis,

$H_o:P_1=P_2 \quad$ vs $\quad H_1:P_1\ne P_2$

The formula for the two-sample proportion test is

$$N=n_1+n_2 = \frac{{4(Z_{1-\alpha} + Z_{1-\beta})^2}\left[\left(\frac{P_1+P_2}{2}\right) \left(1-\frac{P_1+P_2}{2}\right) \right] }{(d=P_1-P_2)^2}$$

Consider when $\sigma = 30$ mg/dl, $\beta=0.10$, $\alpha = 0.05$, $Z_{1-\alpha/2} = 1.96$, Power = $1-\beta$; $Z_{1-\beta} = 1.282$. $P_1 = 0.7, P_2=0.5$, $d=P_1 – P_2 = 0.7-0.5 = 0.2$. The sample size will be

\begin{align*}
N &= n_1+n_2 = \frac{4(1.96+1.282)^2 [0.6(1-0.6)]}{0.2^2}\\
&= \frac{4(3.242^2)[0.6\times 0.4]}{0.2^2} = 252.25
\end{align*}

Considering using $N=260$ or 130 in each group.

Summary

Proper sample size determination is both an art and a science that balances statistical requirements with practical constraints. While formulas provide a starting point, thoughtful consideration of your specific research context is essential. When in doubt, consult with a statistician to ensure your study is appropriately powered to answer your research questions.

Sample Size Determination FAQs

  • What is meant by sample size?
  • What is the importance of determining the sample size?
  • What are the important considerations in determining the sample size?
  • What are the common situations for sample size determination?
  • What is the formula of a one-sample t-test?
  • What is the formula of a two-sample test?
  • What is the formula of a two-sample proportion test?
  • What is the importance of sample size determination?

R Programming Language

Sampling Distribution of Differences

Understand the sampling distribution of differences between means—what it is, why it matters, and how to apply it in hypothesis testing (with examples). Perfect for students, data scientists, and analysts! Ever wondered how statisticians compare two groups (e.g., test scores, sales performance, or medical treatments)? The key lies in the sampling distribution of differences between means—a fundamental concept for hypothesis testing, confidence intervals, and A/B testing.

Sampling Distribution of Differences Between Means

The Sampling Distribution of Differences Between Means is the probability distribution of differences between two sample means (e.g., $Mean_A – Mean_B$) if you repeatedly sampled from two populations.

Let there are two populations of size $N_1$ and $N_2$ having means $\mu_1$ and $\mu_2$ with variances $\sigma_1^2$ and $\sigma_2^2$. We need to draw all possible samples of size $n_1$ from the first population and $n_2$ from the second population, with or without replacement.

Let $\overline{x}_1$ be the means/averages of samples of the first population and $\overline{x}_2$ be the means/averages of the samples of the second population. After this, we will determine all possible differences between means/averages denoted by
$$d =\overline{x}_1 – \overline{x}_2$$

We call the frequency distribution differences as frequency distribution, while the probability distribution of the differences is the sampling distribution of differences between means.

Notations for Sampling Distribution of Differences between Means

NotationShort Description
$\mu_1$Mean of the first population
$\mu_2$Mean of the second population
$\sigma_1^2$Variance of the first population
$\sigma_2^2$Variance of the second population
$\sigma_1$Standard deviation of the first population
$\sigma_2$Standard deviation of the second population
$\mu_{\overline{x}_1 – \overline{x}_2}$Mean of the sampling distribution of difference between means
$\sigma^2_{\overline{x}_1 – \overline{x}_2}$Variance of the sampling distribution of difference between means
$\sigma_{\overline{x}_1 – \overline{x}_2}$Standard deviation of the sampling distribution of difference between means

Some Formulas for Sampling with/without Replacement

Sr. No.Sampling with ReplacementSampling without Replacement
1.$\mu_{\overline{x}_1 -\overline{x}_2} = \mu_1-\mu_2$$\mu_{\overline{x}_1 -\overline{x}_2} = \mu_1-\mu_2$
2.$\sigma^2_{\overline{x}_1 -\overline{x}_2}=\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}$$\sigma^2_{\overline{x}_1 -\overline{x}_2}=\frac{\sigma_1^2}{n_1}\left(\frac{N-1-n_2}{N_1-1}\right) + \frac{\sigma_2^2}{n_2}\left(\frac{N_2-n_2}{N_2-1}\right)$
3.$\sigma_{\overline{x}_1 -\overline{x}_2}=\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}$$\sigma_{\overline{x}_1 -\overline{x}_2}=\sqrt{\frac{\sigma_1^2}{n_1}\left(\frac{N-1-n_2}{N_1-1}\right) + \frac{\sigma_2^2}{n_2}\left(\frac{N_2-n_2}{N_2-1}\right)}$

Example

Let $\overline{x}$ represent the mean of a sample of size $n_1=2$ selected at random with replacement from a finite population consisting of values 5, 7, and 9. Similarly, let $\overline{x}_2$ represent the mean of a sample of size $n_2=2$ selected at random from another finite population consisting of values 4, 6, and 8. Form the sampling distribution of the random variable $\overline{x}_1 – \overline{x}_2$ and verify that

  • $\mu_{\overline{x}_1 – \overline{x}_2} = \mu_1 – \mu_2$
  • $\sigma^2_{\overline{x}_1 – \overline{x}_2} = \frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}

Solution

Population IPopulation II
5, 7, 9
$N_1=3$
$n_1=2$
4, 6, 8
$N_2=3$
$n_2=2$
Possible samples with Replacement are $N_1^{n_1}=3^2 =9$Possible samples with Replacement are
$N_2^{n_2} = 3^2 = 9$
Sampling Distribution of Differences Between Means

All Possible Samples

All possible differences between samples means from both of the population is ($d=\overline{x}_1 – \overline{x}_2$).

$d=\overline{x}_1 =-\overline{x}_2$455666778
55-4= 100-1-1-1-2-2-3
6211000-1-1-2
6211000-1-1-2
732211100-1
732211100-1
732211100-1
8433222110
8433222110
9544333221

The Sampling Distribution of Differences Between Means

$d=\overline{x}_1 – \overline{x}_2$$f$$P(d)$$d\cdot P(d)$$d^2$$d^2 \cdot P(d)$
-311/81$-3 * 1/81 = -3/81$99/81
-244/81-8/81416/81
-11010/81-10/81110/81
01616/810/8100/81
11919/8119/81119/81
21616/8131/81464/81
31010/8130/81990/81
444/8116/811664/81
511/815/8125125/81
Total8181/81=1 297/81=3.67

\begin{align*}
\mu_{\overline{x}_1 – \overline{x}_2} &= E(d) = \Sigma(d\cdot P(d)) = \frac{81}{81}=1\\
\sigma^2_{\overline{x}_1 – \overline{x}_2} &= E(d^2) – [E(d)]^2\\
&=\Sigma d^2 P(d) – \left[\Sigma (d\cdot P(d))\right]^2\\
&= 3.67 – 1^2 = 2.67
\end{align*}

Sampling Distribution of differences between means, mean and variance of both populations

Verification

  • $\mu_{\overline{x}_1 – \overline{x}_2} = \mu_1 – \mu_2 \Rightarrow 7-6 = 1$
  • $\sigma_{\overline{x}_1 – \overline{x}_2}^2 = \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} = \frac{2.66}{2} + \frac{2.66}{2}\Rightarrow 2.66$

Sampling in R Language