Properties of Measure of Central Tendency

Understanding the Properties of Measure of Central Tendency helps in selecting the appropriate measure for accurate data interpretation. This blog post explores the key properties of measures of central tendency: mean, median, and mode, along with their advantages and limitations.

Introduction: Properties of Measure of Central Tendency

In statistics, measures of central tendency are crucial for summarizing and interpreting data. Measures of central tendency provide a single value that represents the center or typical value of a dataset. The three most common measures of central tendency are the mean, median, and mode. Each central tendency has unique properties that make it suitable for different types of data and analytical purposes.

Mean (Arithmetic Average)

The mean (the most widely used measure of central tendency) is the sum of all values in a dataset divided by the number of values $\left(\frac{\sum\limits_{i=1}^n X_i}{n}\right)$.

Properties of Mean

  • Sensitive to All Data Points
    The mean considers every value in the dataset, making it highly responsive to changes. A single extreme value (outlier) can significantly affect the mean.
  • Algebraic Manipulability
    The mean is used in further mathematical operations (measures of dispersion, e.g., calculating variance, standard deviation). The sum of deviations from the mean ($x-\overline{x}$) is always zero:
    $$\sum\limits_{i=1}^n (X_i – \overline{X}) =0$$
  • Applicable to Interval and Ratio Data
    The mean is suitable for continuous numerical data (for example, height, weight, and income). It is not appropriate for nominal or ordinal data.
  • Affected by Skewness
    In skewed distributions, the mean is pulled toward the tail, making it less representative of central tendency.

Advantages of the Mean

  • Mean uses all data points, providing a comprehensive measure.
  • It is useful in statistical inferences and parametric tests.

Limitations of the Mean

  • Distorted by outliers.
  • Mean should not be used for highly skewed data.
properties of measures of central tendency

Median (Middle Value)

The median is the middle value (the most central data value) in an ordered dataset/array. If the dataset has an even number of observations, the median is the average of the two central values.

Properties of Median

  • Resistant to Outliers
    Unlike the mean, the median is not influenced/affected by extreme values (outliers). It is because the median only depends on the middle value(s) in the ordered dataset. It is also applicable to Ordinal, Interval, and Ratio Data. On the other hand, median works well for ranked (ordinal) and continuous numerical data. However, the median is not suitable for nominal data (categories without order).
  • Unaffected by Skewness
    The median remains stable in skewed distributions, making it a better measure than the mean in such cases.
  • Not Algebraically Manipulable
    Unlike the mean, the median cannot be used in further mathematical computations (for example, standard deviation).

Advantages of the Median

  • Median is robust against outliers.
  • Median better represents the central tendency in skewed distributions.

Limitations of the Median

  • Median does not consider all data points.
  • It is less efficient than the mean for normally distributed data.

Mode (Most Frequent Value)

The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), or multiple modes (multimodal). It is the only measure of central tendency that can have more than one value.

Properties of Mode

Mode applies to All Data Types (that is, it works with nominal, ordinal, interval, and ratio data). However, it is the only measure of central tendency suitable for categorical data (e.g., colors, brands).

  • Unaffected by Outliers
    Since the mode depends on frequency, extreme values do not impact the mode.
  • Not Necessarily Unique
    Some datasets have no mode (if all values are unique or no value repeats in the dataset,) or data may have multiple modes.
  • Not Useful for Small Datasets
    In small samples, the mode may not accurately represent central tendency.

Advantages of the Mode

  • Mode is useful for categorical data.
  • Mode helps identify peaks in frequency distributions.

Limitations of the Mode

  • May not exist in some datasets.
  • Less informative for continuous numerical data with no repeated values.

Comparison of Mean, Median, and Mode

PropertyMeanMedianMode
Sensitive to OutliersYesNoNo
Works with Skewed DataNoYesSometimes
Applicable to Nominal DataNoNoYes
Mathematical UsabilityHighLowLow
Best for Symmetric DataYesYesSometimes

Choosing the Right Measures of Central Tendency

The choice between mean, median, and mode depends on:

  • Data Type
    • Use the mean for normally distributed numerical data, that is, data points are homogeneous.
    • Use the median for ordinal or skewed numerical data, that is, data points are heterogeneous.
    • Use mode for categorical data, or when data points repeat.
  • Presence of Outliers
    • If outliers are present, the median is preferred.
    • If data is clean and normally distributed, the mean is ideal.
  • Purpose of Analysis
    • For statistical computations (e.g., regression), the mean is necessary.
    • For descriptive summaries (e.g., income distribution), the median is better.

Summary: Properties of Measures of Central Tendency

Measures of central tendency: mean, median, and mode, each has unique properties that determine their suitability for different datasets. The mean is precise but affected by outliers, the median is robust against skewness, and the mode is versatile for categorical data. Understanding these properties ensures accurate data interpretation and informed decision-making in statistical analysis.

By selecting the appropriate measure based on data characteristics, analysts can derive meaningful insights and avoid misleading conclusions. Whether summarizing exam scores, income levels, or survey responses, the right measure of central tendency provides clarity in a world of data.

General Knowledge Quiz

Sample Size Determination

Sample size determination is one of the most critical steps in designing any research study or experiment. Whether the researcher is conducting clinical trials, market research, or social science studies, the selection of an appropriate sample size ensures that the results are statistically valid while optimizing resources. This guide will walk you through the key concepts and methods for sample size determination.

In planning a study, the sample size determination is an important issue required to meet certain conditions. For example, for a study dealing with blood cholesterol levels, these conditions are typically expressed in terms such as “How large a sample do I need to be able to reject the null hypothesis that two population means are equal if the difference between them is $d=10$mg/dl?

Why Sample Size Matters

  1. Statistical Power: Adequate sample sizes increase the ability to detect true effects
  2. Precision: Larger samples typically yield more precise estimates
  3. Resource Efficiency: Avoid wasting time/money on unnecessarily large samples
  4. Ethical Considerations: Especially important in clinical research to neither under- nor over-recruit participants

Special Considerations for Estimating Sample Size

  1. Small Populations: May require finite population corrections
  2. Stratified Sampling: Need to calculate for each stratum
  3. Cluster Sampling: Must account for design effect
  4. Longitudinal Studies: Consider repeated measures and attrition
Sample Size Determination

Sample Size Determination Formula

In general, there exists a formula for computing a sample size for the specific test statistic (appropriate to test a specified hypothesis). These formulae require that the user specify the $\alpha$-level and Power = ($1-\beta$) desired, as well as the difference to be detected and the variability of the measure.

Common Approaches to Sample Size Calculation

For Estimating Proportions (Prevalence Studies)

The common approach to calculate sample size, use the formula:

$$n=\frac{Z^2 p (1-p)}{E^2}$$

where

  • Z = Z-value (1.96 for 95% confidence interval)
  • p = estimated proportion
  • E = margin of error

For a survey with an expected proportion of 50%, a 95% confidence level, and 5% margin of error, the sample size will be

$$n=\frac{1.96^2 \times 0.5 \times 0.5}{0.05^2} \approx 385$$

Note that it is not wise to calculate a single number for the sample size. It is better to calculate a range of values by varying the assumptions so that one can get a sense of their impact on the resulting projected sample size. From this range of sample sizes, a suitable sample may be picked for the research work.

Common Situations for Sample Size Determination

We consider the process of estimating sample size for three common circumstances:

  • One-Sample t-test and paired t-test
  • Two-Sample t-test
  • Comparison of $P_1$ vs $P_2$ with a Z-test

One Sample t-test and Paired test

For testing the hypothesis:

$H_o:\mu=\mu_o\quad$ vs $\quad H_1:\mu \ne \mu_o$

For a two-tailed test, the formula of one-sample t-test is

$$n = \left[\frac{(Z_{1-\alpha/2} + Z_{1-\beta})\sigma}{d} \right]^2$$

Example: Suppose we are interested in estimating the size of a sample from a population of blood cholesterol levels. The typical standard deviation of the population is, say, 30 mg/dl. Consider, $\alpha = 0.05, \sigma = 25, d = 5.0, power = 0.80$

\begin{align*}
n & = \left[ \frac{(Z_{1-\alpha/2} + Z_{1-\beta})\sigma}{d} \right]^2\\
&= \left[\frac{(1.96 + 0.842)}{5}25\right]^2 = 196.28 \approx 197
\end{align*}

Two Sample t-test

How large a sample would be needed for comparing two approaches to cholesterol lowering using $\alpha=0.05$, to detect a difference of $d=20$ mg/dl or more with power = $1-\beta=0.90$? For the following hypothesis

$H_o:\mu_1 =\mu_2\quad$ vs $\quad H_1:\mu_1 \ne \mu_2$. For a two-tailed t-test, the formula is

$$N=n_1+n_2 = \frac{4\sigma^2(Z_{1-\alpha/2} + Z_{1-\beta})^2 } {(d = \mu_1 – \mu_2)^2}$$

For $\sigma = 30$mg/dl, $\beta=0.10, \alpha = 0.05$, $Z_{1-\alpha/2}=1.96$, Power = $1-\beta$, $Z_{1-\beta}=1.282$, d = 20 mg/dl.

\begin{align*}
N &= n_1 + n_2 = \frac{4(30)^2 (1.96 + 1.282)^2}{20^2}\\
&= \frac{4\times 900 \times (3.242)^2}{400} = 94.6
\end{align*}

The required sample size is about 50 for each group.

Two Sample Proportion Test

For testing the two-sample proportions hypothesis,

$H_o:P_1=P_2 \quad$ vs $\quad H_1:P_1\ne P_2$

The formula for the two-sample proportion test is

$$N=n_1+n_2 = \frac{{4(Z_{1-\alpha} + Z_{1-\beta})^2}\left[\left(\frac{P_1+P_2}{2}\right) \left(1-\frac{P_1+P_2}{2}\right) \right] }{(d=P_1-P_2)^2}$$

Consider when $\sigma = 30$ mg/dl, $\beta=0.10$, $\alpha = 0.05$, $Z_{1-\alpha/2} = 1.96$, Power = $1-\beta$; $Z_{1-\beta} = 1.282$. $P_1 = 0.7, P_2=0.5$, $d=P_1 – P_2 = 0.7-0.5 = 0.2$. The sample size will be

\begin{align*}
N &= n_1+n_2 = \frac{4(1.96+1.282)^2 [0.6(1-0.6)]}{0.2^2}\\
&= \frac{4(3.242^2)[0.6\times 0.4]}{0.2^2} = 252.25
\end{align*}

Considering using $N=260$ or 130 in each group.

Summary

Proper sample size determination is both an art and a science that balances statistical requirements with practical constraints. While formulas provide a starting point, thoughtful consideration of your specific research context is essential. When in doubt, consult with a statistician to ensure your study is appropriately powered to answer your research questions.

Sample Size Determination FAQs

  • What is meant by sample size?
  • What is the importance of determining the sample size?
  • What are the important considerations in determining the sample size?
  • What are the common situations for sample size determination?
  • What is the formula of a one-sample t-test?
  • What is the formula of a two-sample test?
  • What is the formula of a two-sample proportion test?
  • What is the importance of sample size determination?

R Programming Language

Formal Hypothesis Test

A formal hypothesis test in statistics is a structured method used to determine whether there is enough evidence in a sample of data to infer that a certain condition holds for the entire population. It involves making an initial assumption (the null hypothesis) and then evaluating whether the observed data provides sufficient evidence to reject that assumption in favor of an alternative hypothesis.

Null and Alternative Hypotheses

In a formal hypothesis test, the null hypotheses are denoted by $H_o$ and the alternative hypotheses are denoted by $H_a$. The null and alternative hypotheses need to be assigned as follows:

Null Hypothesis

The null hypothesis is the hypothesis being tested. $H_o$ must

  • be the hypothesis we want to reject
  • contain the condition of equality (=, $\ge$, or $\le$)

Alternative Hypothesis

The alternative hypothesis is always the opposite of the null hypothesis, $H_o$. $H_a$ must

  • be the hypothesis we want to support
  • not contain the condition of equality (<, >, $\ne$)

A formal hypothesis test will always conclude with a decision to reject $H_o$ based on sample data or the decision that there is not strong enough evidence to reject $H_o$.

Formal Hypothesis Test, Hypothesis Testing

Components of a Formal Hypothesis Test

The following are key components of a formal hypothesis test.

  • Null Hypothesis ($H_o$)
    It is a statement of “No Effect” or “No Difference”. For example, $H_o:\mu=\mi_o$ (population mean $\mu$ equals a specified value $\mu_o$
  • Alternative Hypothesis ($H_1$)
    It is a statement that contradicts the null hypothesis. An alternative hypothesis can be one-tailed (for example, $H_1:\mu> \mu_o$, or $H_1:\mu<\mu_o$) or two-tailed (for instance, $H_1:\mu\ne\mu_o$).
  • Test Statistic (Test Formula)
  • A numerical value is calculated from sample data by using an appropriate t-statistic, z-score, f-statistic, or $\chi^2$ statistic.
  • Significance Level ($\alfha$)
    The maximum acceptable probability is typically chosen at the outset of the hypothesis test and is referred to as the level of significance or significance level for the test. The level of significance is denoted by $\alpha$, and the most commonly used values are $\alpha = 0.10, 0.05, and 0.01$.
    Note that once $\alpha$ (level of significance) is determined, the value of $\beta$ is also fixed; the probability of making a type-II error in a hypothesis test.
  • P-value
    The probability of observing the test statistic (or more extreme) if $H_o$ is true. If $p\le\alpha$, reject $H_o$; otherwise, accept it.
  • Decision Rule
    Reject $H_o$ if the test statistic falls in the critical region or if $p\le\alpha$
  • Conclusion
    State whether there is sufficient evidence to reject $H_o$ in favour of $H_1$.

Hypothetical Example: One-Sample t-test

  • Null Hypothesis: The population mean $\mu=50$
  • Alternative Hypothesis: The population mean $\mu \ne 50$ (two-tailed test)
  • Test Statistic: $t=\frac{\overline{x}-\mu}{\frac{s}{\sqrt{n}}}$, where $\overline{x}$ is the sample mean, $s4 is the sample standard deviation, $n$ is the sample size
  • Decision: if $|t| > t_{\alpha/2, n-1}$ or $p<\alpha$, reject $H_o$

Real Life Examples of Formal Hypothesis Tests

The following are a few real-life examples of formal hypothesis tests used in various fields.

  • Medical Testing (Drug Efficacy): Consider a pharmaceutical company that tests whether a new drug lowers blood pressure more effectively than a placebo. It is a real case used in clinical trials for hypertension medications. The hypotheses will be
    • $H_o$: The drug has no effect ($\mu_{drug} = \mu_{placebo}$.
    • $H_1: The drug reduces blood pressure ($\mu_{drug}<\mu_{placebo}$). It is a one-tailed test.
    • Test Statistic Used: Two-sample t-test will be used for comparing the means of two groups.
  • Social Science (Opinion Polls): Consider a pollster who tests whether support for a political party candidate differs between men and women. The hypothesis may be
    • $H_o:$: No gender difference in support ($p_{men} = p_{women}$).
    • $H_1: Support differs by gender ($p_{men}\ne p_{p_{women}$). It is a two-tailed test.
    • Test Statistic Used: Chi-Square test for independence (categorical data) will be used.
  • Economics (Policy Impact): A government tests whether a tax incentive increased small business growth. The hypotheses will be
    • $H_o$: The policy had no effect ($\mu_{after} – \mu_{before}=0$).
    • Test Statistic use: Regression analysis with a dummy variable or difference-in-differences test.
  • Business and Marketing (A/B Testing): An e-commerce company tests whether a redesigned website increases sales compared to the old version. The hypotheses will be:
    • $H_o$:The new design has no impact on sales ($p_{new}=p_{old}$)
    • $H_1$: The new design increases sales ($p_{new}>p_{old}$). It is a one-tailed test.
    • Test Statistic: For comparing conversion rates, a two-proportion z-test can be used.
  • Manufacturing (Quality Control): Suppose a factory checks if the average weight of cereal boxes meets the advertised weight of 500g. The hypotheses are:
    • $H_o$: The mean weight is 500g ($\mu=500$)
    • $H_1$: The mean weight differs from 500g ($\mu\ne 500$). It is a two-tailed test.
    • Test Statistic: A sample t-test can be used for testing against a known standard.
  • Environmental Science (Pollution Levels): Researchers are interested in testing if a river’s pollution level exceeds the safe limit (e.g., lead concentration > 15ppm). The hypotheses may be:
    • $H_o$: Mean lead concentration $\le$ 15 ppm ($\mu\le 15$)
    • $H_1$: Mean lead concentration > 15 ppm ($\mu > 15$). It is a two-tailed test.
    • Test Statistic: One-sample t-test (or non-parametric Wilcoxon test, if data is skewed) can be used
  • Education (Test Score Improvement): A school may be interested in testing whether a new teaching method improves students’ math scores. The hypothesis may be
    • $H_o$: The new method has no effect ($\mu_{after} – \mu_{before}=0$)
    • $H_1$: The new method improves scores ($\mu_{after} > \mu_{before}$). It is a one-tailed test.
    • Test Statistic: A paired sample t-test can be used.
  • Psychology (Behavioural Studies): A researcher may test whether sleep deprivation affects reaction time. The hypotheses are
    • $H_o$: Sleep deprivation has no effect ($\mu_{sleep\,deprived} > u_{normal\,sleep})
    • $H_1$: Sleep deprivation increases reaction time ($\mu_{sleep\,deprived}>\mu_{normal}$)
    • Test Statistic: An Independent two-sample t-test can be used for comparing two groups.

Exploratory Data Analysis in R