Sample Size Determination

Sample size determination is one of the most critical steps in designing any research study or experiment. Whether the researcher is conducting clinical trials, market research, or social science studies, the selection of an appropriate sample size ensures that the results are statistically valid while optimizing resources. This guide will walk you through the key concepts and methods for sample size determination.

In planning a study, the sample size determination is an important issue required to meet certain conditions. For example, for a study dealing with blood cholesterol levels, these conditions are typically expressed in terms such as “How large a sample do I need to be able to reject the null hypothesis that two population means are equal if the difference between them is $d=10$mg/dl?

Why Sample Size Matters

  1. Statistical Power: Adequate sample sizes increase the ability to detect true effects
  2. Precision: Larger samples typically yield more precise estimates
  3. Resource Efficiency: Avoid wasting time/money on unnecessarily large samples
  4. Ethical Considerations: Especially important in clinical research to neither under- nor over-recruit participants

Special Considerations for Estimating Sample Size

  1. Small Populations: May require finite population corrections
  2. Stratified Sampling: Need to calculate for each stratum
  3. Cluster Sampling: Must account for design effect
  4. Longitudinal Studies: Consider repeated measures and attrition
Sample Size Determination

Sample Size Determination Formula

In general, there exists a formula for computing a sample size for the specific test statistic (appropriate to test a specified hypothesis). These formulae require that the user specify the $\alpha$-level and Power = ($1-\beta$) desired, as well as the difference to be detected and the variability of the measure.

Common Approaches to Sample Size Calculation

For Estimating Proportions (Prevalence Studies)

The common approach to calculate sample size, use the formula:

$$n=\frac{Z^2 p (1-p)}{E^2}$$

where

  • Z = Z-value (1.96 for 95% confidence interval)
  • p = estimated proportion
  • E = margin of error

For a survey with an expected proportion of 50%, a 95% confidence level, and 5% margin of error, the sample size will be

$$n=\frac{1.96^2 \times 0.5 \times 0.5}{0.05^2} \approx 385$$

Note that it is not wise to calculate a single number for the sample size. It is better to calculate a range of values by varying the assumptions so that one can get a sense of their impact on the resulting projected sample size. From this range of sample sizes, a suitable sample may be picked for the research work.

Common Situations for Sample Size Determination

We consider the process of estimating sample size for three common circumstances:

  • One-Sample t-test and paired t-test
  • Two-Sample t-test
  • Comparison of $P_1$ vs $P_2$ with a Z-test

One Sample t-test and Paired test

For testing the hypothesis:

$H_o:\mu=\mu_o\quad$ vs $\quad H_1:\mu \ne \mu_o$

For a two-tailed test, the formula of one-sample t-test is

$$n = \left[\frac{(Z_{1-\alpha/2} + Z_{1-\beta})\sigma}{d} \right]^2$$

Example: Suppose we are interested in estimating the size of a sample from a population of blood cholesterol levels. The typical standard deviation of the population is, say, 30 mg/dl. Consider, $\alpha = 0.05, \sigma = 25, d = 5.0, power = 0.80$

\begin{align*}
n & = \left[ \frac{(Z_{1-\alpha/2} + Z_{1-\beta})\sigma}{d} \right]^2\\
&= \left[\frac{(1.96 + 0.842)}{5}25\right]^2 = 196.28 \approx 197
\end{align*}

Two Sample t-test

How large a sample would be needed for comparing two approaches to cholesterol lowering using $\alpha=0.05$, to detect a difference of $d=20$ mg/dl or more with power = $1-\beta=0.90$? For the following hypothesis

$H_o:\mu_1 =\mu_2\quad$ vs $\quad H_1:\mu_1 \ne \mu_2$. For a two-tailed t-test, the formula is

$$N=n_1+n_2 = \frac{4\sigma^2(Z_{1-\alpha/2} + Z_{1-\beta})^2 } {(d = \mu_1 – \mu_2)^2}$$

For $\sigma = 30$mg/dl, $\beta=0.10, \alpha = 0.05$, $Z_{1-\alpha/2}=1.96$, Power = $1-\beta$, $Z_{1-\beta}=1.282$, d = 20 mg/dl.

\begin{align*}
N &= n_1 + n_2 = \frac{4(30)^2 (1.96 + 1.282)^2}{20^2}\\
&= \frac{4\times 900 \times (3.242)^2}{400} = 94.6
\end{align*}

The required sample size is about 50 for each group.

Two Sample Proportion Test

For testing the two-sample proportions hypothesis,

$H_o:P_1=P_2 \quad$ vs $\quad H_1:P_1\ne P_2$

The formula for the two-sample proportion test is

$$N=n_1+n_2 = \frac{{4(Z_{1-\alpha} + Z_{1-\beta})^2}\left[\left(\frac{P_1+P_2}{2}\right) \left(1-\frac{P_1+P_2}{2}\right) \right] }{(d=P_1-P_2)^2}$$

Consider when $\sigma = 30$ mg/dl, $\beta=0.10$, $\alpha = 0.05$, $Z_{1-\alpha/2} = 1.96$, Power = $1-\beta$; $Z_{1-\beta} = 1.282$. $P_1 = 0.7, P_2=0.5$, $d=P_1 – P_2 = 0.7-0.5 = 0.2$. The sample size will be

\begin{align*}
N &= n_1+n_2 = \frac{4(1.96+1.282)^2 [0.6(1-0.6)]}{0.2^2}\\
&= \frac{4(3.242^2)[0.6\times 0.4]}{0.2^2} = 252.25
\end{align*}

Considering using $N=260$ or 130 in each group.

Summary

Proper sample size determination is both an art and a science that balances statistical requirements with practical constraints. While formulas provide a starting point, thoughtful consideration of your specific research context is essential. When in doubt, consult with a statistician to ensure your study is appropriately powered to answer your research questions.

Sample Size Determination FAQs

  • What is meant by sample size?
  • What is the importance of determining the sample size?
  • What are the important considerations in determining the sample size?
  • What are the common situations for sample size determination?
  • What is the formula of a one-sample t-test?
  • What is the formula of a two-sample test?
  • What is the formula of a two-sample proportion test?
  • What is the importance of sample size determination?

R Programming Language

Sampling Distribution of Differences

Understand the sampling distribution of differences between means—what it is, why it matters, and how to apply it in hypothesis testing (with examples). Perfect for students, data scientists, and analysts! Ever wondered how statisticians compare two groups (e.g., test scores, sales performance, or medical treatments)? The key lies in the sampling distribution of differences between means—a fundamental concept for hypothesis testing, confidence intervals, and A/B testing.

Sampling Distribution of Differences Between Means

The Sampling Distribution of Differences Between Means is the probability distribution of differences between two sample means (e.g., $Mean_A – Mean_B$) if you repeatedly sampled from two populations.

Let there are two populations of size $N_1$ and $N_2$ having means $\mu_1$ and $\mu_2$ with variances $\sigma_1^2$ and $\sigma_2^2$. We need to draw all possible samples of size $n_1$ from the first population and $n_2$ from the second population, with or without replacement.

Let $\overline{x}_1$ be the means/averages of samples of the first population and $\overline{x}_2$ be the means/averages of the samples of the second population. After this, we will determine all possible differences between means/averages denoted by
$$d =\overline{x}_1 – \overline{x}_2$$

We call the frequency distribution differences as frequency distribution, while the probability distribution of the differences is the sampling distribution of differences between means.

Notations for Sampling Distribution of Differences between Means

NotationShort Description
$\mu_1$Mean of the first population
$\mu_2$Mean of the second population
$\sigma_1^2$Variance of the first population
$\sigma_2^2$Variance of the second population
$\sigma_1$Standard deviation of the first population
$\sigma_2$Standard deviation of the second population
$\mu_{\overline{x}_1 – \overline{x}_2}$Mean of the sampling distribution of difference between means
$\sigma^2_{\overline{x}_1 – \overline{x}_2}$Variance of the sampling distribution of difference between means
$\sigma_{\overline{x}_1 – \overline{x}_2}$Standard deviation of the sampling distribution of difference between means

Some Formulas for Sampling with/without Replacement

Sr. No.Sampling with ReplacementSampling without Replacement
1.$\mu_{\overline{x}_1 -\overline{x}_2} = \mu_1-\mu_2$$\mu_{\overline{x}_1 -\overline{x}_2} = \mu_1-\mu_2$
2.$\sigma^2_{\overline{x}_1 -\overline{x}_2}=\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}$$\sigma^2_{\overline{x}_1 -\overline{x}_2}=\frac{\sigma_1^2}{n_1}\left(\frac{N-1-n_2}{N_1-1}\right) + \frac{\sigma_2^2}{n_2}\left(\frac{N_2-n_2}{N_2-1}\right)$
3.$\sigma_{\overline{x}_1 -\overline{x}_2}=\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}$$\sigma_{\overline{x}_1 -\overline{x}_2}=\sqrt{\frac{\sigma_1^2}{n_1}\left(\frac{N-1-n_2}{N_1-1}\right) + \frac{\sigma_2^2}{n_2}\left(\frac{N_2-n_2}{N_2-1}\right)}$

Example

Let $\overline{x}$ represent the mean of a sample of size $n_1=2$ selected at random with replacement from a finite population consisting of values 5, 7, and 9. Similarly, let $\overline{x}_2$ represent the mean of a sample of size $n_2=2$ selected at random from another finite population consisting of values 4, 6, and 8. Form the sampling distribution of the random variable $\overline{x}_1 – \overline{x}_2$ and verify that

  • $\mu_{\overline{x}_1 – \overline{x}_2} = \mu_1 – \mu_2$
  • $\sigma^2_{\overline{x}_1 – \overline{x}_2} = \frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}

Solution

Population IPopulation II
5, 7, 9
$N_1=3$
$n_1=2$
4, 6, 8
$N_2=3$
$n_2=2$
Possible samples with Replacement are $N_1^{n_1}=3^2 =9$Possible samples with Replacement are
$N_2^{n_2} = 3^2 = 9$
Sampling Distribution of Differences Between Means

All Possible Samples

All possible differences between samples means from both of the population is ($d=\overline{x}_1 – \overline{x}_2$).

$d=\overline{x}_1 =-\overline{x}_2$455666778
55-4= 100-1-1-1-2-2-3
6211000-1-1-2
6211000-1-1-2
732211100-1
732211100-1
732211100-1
8433222110
8433222110
9544333221

The Sampling Distribution of Differences Between Means

$d=\overline{x}_1 – \overline{x}_2$$f$$P(d)$$d\cdot P(d)$$d^2$$d^2 \cdot P(d)$
-311/81$-3 * 1/81 = -3/81$99/81
-244/81-8/81416/81
-11010/81-10/81110/81
01616/810/8100/81
11919/8119/81119/81
21616/8131/81464/81
31010/8130/81990/81
444/8116/811664/81
511/815/8125125/81
Total8181/81=1 297/81=3.67

\begin{align*}
\mu_{\overline{x}_1 – \overline{x}_2} &= E(d) = \Sigma(d\cdot P(d)) = \frac{81}{81}=1\\
\sigma^2_{\overline{x}_1 – \overline{x}_2} &= E(d^2) – [E(d)]^2\\
&=\Sigma d^2 P(d) – \left[\Sigma (d\cdot P(d))\right]^2\\
&= 3.67 – 1^2 = 2.67
\end{align*}

Sampling Distribution of differences between means, mean and variance of both populations

Verification

  • $\mu_{\overline{x}_1 – \overline{x}_2} = \mu_1 – \mu_2 \Rightarrow 7-6 = 1$
  • $\sigma_{\overline{x}_1 – \overline{x}_2}^2 = \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} = \frac{2.66}{2} + \frac{2.66}{2}\Rightarrow 2.66$

Sampling in R Language

Sampling with Replacement

In sampling with replacement, the units drawn are returned to the population before drawing the next unit. This means the same individual can be chosen more than once in the sampling process. The sampling with replacement may provide valuable insights while maintaining flexibility in selecting samples from a given population.

Key Characteristics of Sampling with Replacement

The following are key characteristics of Sampling with Replacement:

  1. Independence: Each selection is independent, as the same item can be selected multiple times.
  2. Population Size: The effective population size remains the same for each draw since previously selected items are replaced.
  3. Use Cases: This method is commonly used in algorithms, simulations, and bootstrapping techniques in statistics, where it’s important to assess variability or make inferences from a sample.

Example of Sampling with Replacement

As an example of sampling with replacement, suppose, you have a bag containing three colored balls (red, blue, and green), and you sample with a replacement, if you draw a red ball, you put it back into the bag before the next draw. As a result, in subsequent draws, you could again draw a red ball.

Drawing All Possible Samples Using Sampling with Replacement

Question: Consider a population with elements A, B, C, and D. Draw all possible samples of size 2 with replacement from this population.

Solution: In this problem, $N=4$ and $n=2$.

Possible number of samples (with replacement) = $N^n = 4^2 = 16$.

The 16 samples of size 2 are

AAABACAD
BABBBCBD
CACBCCCD
DADBDCDD

Question: Draw all possible samples of size 3 with replacement from a population having elements 2, 4, and 6.

Solution:

Population size = $N=3$, Sample size = n = 3$

Number of possible samples are $N^n = 3^3 = 27$

There are two ways to list these samples.

First Method:

First divide possible samples (27) by the population size unit quotient 1 is returned. For example, $\frac{27}{3} = 9, \quad \frac{9}{3}, \quad \frac{9}{3}=1$.

We obtained three quotients: 9, 3, and 1. These are the number of repetitions of population units. First, write every unit 9 times, then 3 times, and lastly, write every unit 1 time.

Sampling with Replacement

Second Method:

First, make the samples of size 2, which are easy to draw.

2, 2
2, 4
2, 6
4, 2
4, 4
4, 6
6, 2
6, 4
6, 6

Repeat these samples three times. Since the required number of samples is 27, add every population unit at (the start or) at the end of these samples of size two.

2, 2, 22, 2, 42, 2, 6
2, 4, 22, 4, 42, 4, 6
2, 6, 22, 6, 42, 6, 6
4, 2, 24, 2, 44, 2, 6
4, 4, 24, 4, 44, 4, 6
4, 6, 24, 6, 44, 6, 6
6, 2, 26, 2, 46, 2, 6
6, 4, 26, 4, 46, 4, 6
6, 6, 26, 6, 46, 6, 6

From the table above, 2 is added in the last of the first nine samples, then 4 is added in the last of the next 9 samples and finally 6 is added in the last nine samples.

Real-Life Examples of Sampling with Replacement

The following are some real-life examples of sampling with replacement:

  1. Lottery Draws: In some types of lotteries, numbers can be drawn multiple times before the final selection. For example, if a lottery allows for the same number to be drawn again after being selected, this is akin to sampling with replacement.
  2. Quality Control in Manufacturing: In a factory, inspectors might draw samples of products to test for defects. After testing each item, they return it to the production line before drawing the next sample to maintain the same population size and ensure each product has a chance of being selected again.
  3. Genetic Studies: In genetics, researchers might take DNA samples from a population to study traits or disorders. By replacing each sample with the population (considering genetic diversity), they can analyze the data while allowing for the possibility of selecting the same individual multiple times.
  4. Surveys: When conducting surveys, researchers might randomly select participants from a population (like voters or consumers) and, after querying each individual, they can include them again in the pool for subsequent selections, especially in larger datasets where the same individuals might provide valuable insights if repeated.
  5. Educational Testing: In standardized testing, students might take multiple attempts at a test where scores from previous attempts can be considered again in analyses to assess trends in learning or improvement.
  6. Customer Behavior Analysis: Companies may analyze customer purchase patterns by repeatedly sampling transactions. For instance, if a customer makes multiple purchases, their transaction data might be included in each analysis to understand their buying behavior over time.

Sampling Quiz Questions

Simulation and Sampling in R