Chebyshev’s Theorem

Chebyshev’s Theorem (also known as Chebyshev’s Inequality) is a statistical rule that applies to any dataset that applies to any distribution, regardless of its shape (not just normal distributions). It provides a way to estimate the minimum proportion of data points that fall within a certain number of standard deviations from the mean.

Chebyshev’s Theorem Statement

For any dataset (with mean $\mu$ and standard deviation $\sigma$), at least $1−\frac{1}{k^2}$​ of the data values will fall within $k$ standard deviations from the mean, where $k>1$. It can be defined in probability form as

$$P\left[|X-\mu| < k\sigma \right] \ge 1 – \frac{1}{k^2}$$

  • At least 75% of data lies within 2 standard deviations of the mean (since $1-\frac{1}{2^2}=0.75$).
  • At least 89% of data lies within 3 standard deviations of the mean ($1−\frac{1}{3^2}≈0.89$).
  • At least 96% of data lies within 5 standard deviations of the mean ($1−\frac{1}{5^2}=0.96$).

Key Points about Chebyshev’s Theorem

  • Works for any distribution (normal, skewed, uniform, etc.).
  • Provides a conservative lower bound (actual proportions may be higher).
  • Useful when the data distribution is unknown.

Unlike the Empirical Rule (which applies only to bell-shaped distributions), Chebyshev’s Theorem is universal—great for skewed or unknown distributions.

Note: Chebyshev’s Theorem gives only lower bounds for the proportion of data values, whereas the Empirical Rule gives approximations. If a data distribution is known to be bell-shaped, the Empirical Rule should be used.

Real-Life Application of Chebyshev’s Theorem

  • Quality Control & Manufacturing: Manufacturers use Chebyshev’s Theorem to determine the minimum percentage of products that fall within acceptable tolerance limits. For example, if a factory produces bolts with a mean length of 5cm and a standard deviation of 0.1cm, Chebyshev’s Theorem guarantees that at least 75% of bolts will be between 4.8 cm and 5.2 cm (within 2 standard deviations).
  • Finance & Risk Management: Investors use Chebyshev’s Theorem to assess the risk of stock returns. For example, if a stock has an average return of 8% with a standard deviation of 2%, Chebyshev’s Theorem ensures that at least 89% of returns will be between 2% and 14% (within 3 standard deviations).
  • Weather Forecasting: Meteorologists use Chebyshev’s Theorem to predict temperature variations. For example, if the average summer temperature in a city is 30${}^\circ$C with a standard deviation of 3${}^\circ$C, at least 75% of days will have temperatures between 24${}^\circ$C and 36${}^\circ$C (within 2 standard deviations).
  • Education & Grading Systems: Teachers can use Chebyshev’s Theorem to estimate grade distributions. As schools might not know the exact distribution of test scores. For example, if an exam has a mean score of 70 with a standard deviation of 10, at least 96% of students scored between 50 and 90 (within 5 standard deviations). Therefore, Chebyshev’s theorem can help assess performance ranges.
  • Healthcare & Medical Studies: Medical researchers use Chebyshev’s Theorem to analyze biological data (e.g., blood pressure, cholesterol levels). For example, if the average blood pressure is 120 mmHg with a standard deviation of 10, at least 75% of patients have blood pressure between 100 and 140 mmHg (within 2 standard deviations).
  • Insurance & Actuarial Science: Insurance companies use Chebyshev’s Theorem to estimate claim payouts. For example, if the average claim is 5,000 with a standard deviation of 1,000, at least 89% of claims will be between 2,000 and 8,000 (within 3 standard deviations).
  • Environmental Studies: When tracking irregular phenomena like daily pollution levels, Chebyshev’s inequality helps understand the concentration of values – even when the data is erratic.

Numerical Example of Chebyshev’s Data

Consider the daily delivery times (in minutes) for a courier.
Data: 30, 32, 35, 36, 37, 39, 40, 41, 43, 50

Calculate the mean and standard deviation:

  • Mean $\mu$ = 38.3
  • Standard Deviation $\sigma$ = 5.77

Let $k=2$ (we want to know how many values will lie within 2 standard deviation of the mean)
\begin{align}
\mu – 2\sigma &= 38.3 – (2\times 5.77) \approx 26.76\\
\mu + 2\sigma &= 38.3 + (2\times 5.77) \approx 49.84
\end{align}

So, values between 26.76 and 49.84 should contain at least 75% of the data, according to Chebyshev’s inequality.

A visual representation of the data points, mean, and shaded bands for $\pm 1\sigma$, $\pm 2\sigma$, and $\pm 3\sigma$.

Chebyshev's Theorem Inequality

From the visual representation of Chebyshev’s Theorem, one can see how most of the data points cluster around the mean value and how the $\pm 2\sigma$ range captures 90% of the data.

Summary

Chebyshev’s Inequality/Theorem is a powerful tool in statistics because it applies to any dataset, making it useful in fields like finance, manufacturing, healthcare, and more. While it doesn’t give exact probabilities like the normal distribution, it provides a worst-case scenario guarantee, which is valuable for risk assessment and decision-making.

FAQs about Chebyshev’s Method

  • What is Chebyshev’s Inequality/Theorem?
  • What is the range of values of Chebyshev’s Inequality?
  • Give some real-life application of Chebyshev’s Theorem.
  • What is the Chebyshev Theorem Formula?

Data Analysis in R Programming Language

Empirical Rule

The Empirical Rule (also known as the 68-95-99.7 Rule) is a statistical principle that applies to normally distributed data (bell-shaped curves). Empirical Rule tells us how data is spread around the mean in such (bell-shaped) distributions.

Empirical Rule states that:

  • 68% of data falls within 1 standard deviation ($\sigma$) of the mean ($\mu$). In other words, 68% of the data falls within ±1 standard deviation ($\sigma$) of the mean ($\mu$). Range: $\mu-1\sigma$ to $\mu+1\sigma$.
  • 95% of data falls within 2 standard deviations ($\sigma$) of the mean ($\mu$). In other words, 95% of the data falls within ±2 standard deviations ($2\sigma$) of the mean ($\mu$). Range: $\mu-2\sigma$ to $\mu+2\sigma$.
  • 99.7% of data falls within 3 standard deviations ($\sigma$) of the mean ($\mu$). In other words, 99.7% of the data falls within ±3 standard deviations ($3\sigma$) of the mean ($\mu$). Range: $\mu-3\sigma$ to $\mu+3\sigma$.

Visual Representation of Empirical Rule

The empirical rule can be visualized from the following graphical representation:

Visual Representation of Empirical Rule

Key Points

  • Empirical Rule only applies to normal (symmetric, bell-shaped) distributions.
  • It helps estimate probabilities and identify outliers.
  • About 0.3% of data lies beyond ±3σ (considered rare events).

Numerical Example of Empirical Rule

Suppose adult human heights are normally distributed with Mean ($\mu$) = 70 inches and standard deviation ($\sigma$) = 3 inches. Then:

  • 68% of heights are between 67–73 inches ($\mu \pm \sigma \Rightarrow 70 \pm 3$ ).
  • 95% are between 64–76 inches ($\mu \pm 2\sigma\Rightarrow 70 \pm 2\times 3$).
  • 99.7% are between 61–79 inches ($\mu \pm 3\sigma \Rightarrow 70 ± 3\times 3$).

This rule is a quick way to understand variability in normally distributed data without complex calculations. For non-normal distributions, other methods (like Chebyshev’s inequality) may be used.

Real-Life Applications & Examples

  • Quality Control in Manufacturing: Manufacturers measure product dimensions (e.g., bottle fill volume, screw lengths). If the process is normally distributed, the Empirical Rule helps detect defects: If soda bottles have a mean volume of 500ml with $\sigma$ = 10ml:
    • 68% of bottles will be between 490ml–510ml.
    • 95% will be between 480ml–520ml.
    • Bottles outside 470ml–530ml (3$\sigma$) are rare and may indicate a production issue.
  • Human Height Distribution: The Heights of people in a population often follow a normal distribution. If the average male height is 70 inches (5’10”) with $\sigma$ = 3 inches:
    • 68% of men are between 67–73 inches.
    • 95% are between 64–76 inches.
    • 99.7% are between 61–79 inches.
  • Test Scores (Standardized Exams): The exam scores (SAT, IQ tests) are often normally distributed. If SAT scores have $\mu$ = 1000 and $\sigma$ = 200:
    • 68% of students score between 800–1200.
    • 95% score between 600–1400.
    • Extremely low (<400) or high (>1600) scores are rare.
  • Financial Market Analysis (Stock Returns): The daily stock returns often follow a normal distribution. If a stock has an average daily return of 0.1% with σ = 2%: If a stock has an average daily return of 0.1% with σ = 2%:
    • 68% of days will see returns between -1.9% to +2.1%.
    • 95% will be between -3.9% to +4.1%.
    • Extreme crashes or surges beyond ±6% are very rare (0.3%).
  • Medical Data (Blood Pressure, Cholesterol Levels): Many health metrics are normally distributed. If the average systolic blood pressure is 120 mmHg with $\sigma$ = 10:
    • 68% of people have readings between 110–130 mmHg.
    • 95% fall within 100–140 mmHg.
    • Readings above 150 mmHg may indicate hypertension.
  • Weather Data (Temperature Variations): The daily temperatures in a region often follow a normal distribution. If the average July temperature is 85°F with σ = 5°F:
    • 68% of days will be between 80°F–90°F.
    • 95% will be between 75°F–95°F.
    • Extremely hot (>100°F) or cold (<70°F) days are rare.

Why the Empirical Rule Matters

  • It helps in predicting probabilities without complex calculations.
  • It is used in risk assessment (finance, insurance).
  • It guides quality control and process improvements.
  • It assists in setting thresholds (e.g., medical diagnostics, passing scores).

FAQs about Empirical Rule

  • What is the empirical rule?
  • For what kind of probability distribution, the empirical rule is used.
  • What is the area under the curve (or percentage) if data falls within 1, 2, and 3 standard deviations?
  • Represent the rule graphically.
  • Give real-life applications and examples of the rule.
  • Why the empirical rule matters, describe.

R Frequently Asked Questions

Importance of Dispersion in Statistics

The importance of dispersion in statistics cannot be ignored. The term dispersion (or spread, or variability) is used to express the variability in the data set. The measure of dispersion is very important in statistics as it gives an average measure of how much data points differ from the average or another measure. The measure of variability tells about the consistency in the data sets.

The dispersion is a quantity that is far away from its center point (such as average). The data with minimum variation/variability with respect to its center point (average) is said to be more consistent. The lesser the variability in the data the more consistent the data.

Example of Measure of Dispersion

Suppose the score of three batsmen in three cricket matches:

PlayerMatch 1Match 2Match 3Average Score
A70809080
B75809580
C65809580

The question is which player is more consistent with his performance.

In the above data set the player whose deviation from average is minimum will be the most consistent player. So, the player B is more consistent than others. He shows less variation.

There are two types of measures of dispersion:

Absolute Measure of Dispersion

In absolute measure of dispersion, the measure is expressed in the original units in which the data is collected. For example, if data is collected in grams, the measure of dispersion will also be expressed in grams. The absolute measure of dispersion has the following types:

  • Range
  • Quartile Deviation
  • Average Deviation
  • Standard Deviation
  • Variance

Relative Measures of Dispersion

In the relative measures of dispersion, the measure is expressed in terms of coefficients, percentages, ratios, etc. It has the following types:

  • Coefficient of range
  • Coefficient of Quartile Deviation
  • Coefficient of Average Deviation
  • Coefficient of Variation (CV)

See more about Measures of Dispersion

Range and Coefficient of Range

Range is defined as the difference between the maximum value and minimum value of the data, statistically, it is $R=x_{max} – x_{min}$.

The Coefficient of Range is $=\frac{x_{max} – x_{min} }{x_{max} – x_{min} }$. Multiplying it by 100 will express it in percentages.

Consider the ungrouped data $x = 32, 36, 36, 37, 39, 41, 45, 46, 48$

The range will be $x_{max} – x_{min} = 48 – 32 = 16$.

The coefficient of Range will be $=\frac{x_{max} – x_{min} }{x_{max} – x_{min} }$

\begin{align*}
Coef\,\, of\,\, Range =\frac{x_{max} – x_{min} }{x_{max} – x_{min} } \\
&= \frac{48-32}{48+32} = \frac{16}{80} = 0.2\\
&= 0.2 \times 100 = 20\%
\end{align*}

For the following grouped data, the range and coefficient of the range will be

ClassesFreqClass Boundaries
65 – 84964.5 – 84.5
85 – 1041084.5 – 104.5
105 – 12417104.5 – 124.5
125 – 14410124.5 – 144.5
145 – 1645144.5 – 164.5
165 – 1844164.5 – 184.5
185 – 2045184.5 – 204.5
Tota.60

The upper class bound of the highest class will be $x_{min}$ and the lower class boundary of the lowest class will be $x_{min}$. Therefore, $x_{max}=204.5$ and $x_{min} = 64.5$. Therefore,

$$Range = x_{max} – x_{min} = 204.5 – 64.5 = 140$$

The Coefficient of Range will be

\begin{align*}
Coef\,\, of\,\, Range &=\frac{x_{max} – x_{min} }{x_{max} – x_{min} } \\
&= \frac{204.5-64.5}{204.5+64.5} = \frac{140}{269} = 0.5204\\
&= 0.5204 \times 100 = 52.04\%
\end{align*}

Average Deviation and Coefficient of Average Deviation

The average deviation is an absolute measure of dispersion. The mean/average of absolute deviation either taken from mean, median, or mode is called average deviation. Statistically, it is

$$Mean\,\, Deviation_{\overline{X}} = \frac{\sum\limits_{i=1}^n|x_i-\overline{x}|}{n}$$

$X$$x-\overline{x}$$|x-\overline{x}|$$x-\tilde{x}$$|x-\tilde{x}|$$x-\hat{x}$$|x-\hat{x}|$
32$32-40 = -8$8$32-39=-7$7$32-36=-4$4
36$36-40=-4$4$36-39=-3$3$36-36=0$0
36$36-40=-4$4$36-39=-3$3$36-36=0$0
37$37-40=-3$3$37-39=-2$2$37-36=1$1
39$39-40=-1$1$39-39=0$0$39-36=3$3
41$41-40=1$1$41-39=2$2$41-36=5$5
45$45-40=5$5$45-39=6$6$45-36=9$9
46$46-40=6$6$46-39=7$7$46-36=10$10
48$48-40=8$7$48-39=9$9$48-36=12$12
Total0403936

Where
\begin{align*}
Mean &= \overline{x} = \frac{\sum\limits_{i=1}^n x_i}{n} = \frac{360}{9} = 40\\
Mode &= 36\\
Median &= 39\\
MD_{\overline{x}} &= \frac{\sum\limits_{i=1}^n |x-\overline{x}|}{n} = \frac{40}{9} = 4.44\\
MD_{\tilde{x}} &= \frac{\sum\limits_{i=1}^n |x-\tilde{x}|}{n} = \frac{39}{9} = 4.33\\
MD_{\hat{x}} &= \frac{\sum\limits_{i=1}^n |x-\hat{x}|}{n} = \frac{36}{9} = 4.00
\end{align*}

The relative measure of average deviation is the coefficient of average deviation. It can be calculated as follows:

Coefficient of Average Deviation from Mean (also called Mean Coefficient of Dispersion)

\begin{align*}\text{Mean Coefficient of Dispersion} = \frac{MD_{\overline{x}}}{\overline{x}} = \frac{4.44}{40}\times 100 = 11.1\%\end{align*}

Coefficient of Average Deviation from Median (also called Median Coefficient of Dispersion)

\begin{align*}\text{Median Coefficient of Dispersion} = \frac{MD_{\tilde{x}}}{\tilde{x}} = \frac{4.33}{39}\times 100 = 11.1\%\end{align*}

Coefficient of Average Deviation from Mode (also called Mode Coefficient of Dispersion)

\begin{align*}\text{Mode Coefficient of Dispersion} = \frac{MD_{\hat{x}}}{\hat{x}} = \frac{4}{36}\times 100 = 11.1\%\end{align*}

Average Deviation for Grouped Data

One can also compute average deviations for grouped data (Discrete Case) as follows:

$x$
Mid Point
$f$$fx$$|x-\overline{x}|$$f|x-\overline{x}|$$|x-\tilde{x}|$$f|x-\tilde{x}|$
10990$10-34=24$21620180
2010200$20-34=14$14010100
3017510$30-34=4$6800
4010400$40-34=6$6010100
505250$50-34=16$8020100
604240$60-34=26$10430120
705350$70-34=36$18040200
Total602040848800

\begin{align*}
\overline{x} &= \frac{\sum\limits_{i=1}^n}{n} = \frac{2040}{60} = 34\\
\tilde{x} &= 30\\
\hat{x} &= 30\\
MD_{\overline{x}} &= \frac{\sum\limits_{i=1}^n f|x-\overline{x}|}{n} = \frac{848}{60} = 14.13\\
MD_{\tilde{x}} &= \frac{\sum\limits_{i=1}^n f|x-\tilde{x}|}{n} = \frac{800}{60} = 13.33\\
MD_{\hat{x}} &= \frac{\sum\limits_{i=1}^n |x-\hat{x}|}{n} = \frac{36}{9} = 4\\
\text{Mean Coefficient of Dispersion} &= \frac{MD_{\overline{x}}} {n} = \frac{14.13}{34}\times = 41.57\%\\
\text{Median Coefficient of Dispersion} &= \frac{MD_{\tilde{x}}}{\tilde{x}} = \frac{13.333}{30}\times100=44.44\%
\end{align*}

Importance of Dispersion in Statistics

From the above discussion and numerical examples, In statistics, the variability or dispersion is crucial. The following are some reasons for the importance of Dispersion in Statistics:

  • Understanding Data Spread: Variability gives insights into the spread or distribution of data, helping to understand how much individual data points differ from the average or some other measure.
  • Data Reliability: Lower variability in data can indicate higher reliability and consistency, which is key for making sound predictions and decisions.
  • Identifying Outliers: High variability can indicate the presence of outliers or anomalies in the data, which might require further investigation.
  • Comparing Datasets: Dispersion measures, such as variance and standard deviation, allow for the comparison of different datasets. Two datasets might have the same mean but different levels of dispersion, which can imply different data patterns or behaviors.
  • Risk Assessment: In fields like finance, assessing the variability of returns is crucial for understanding and managing risk. Higher variability often implies higher risk.
  • Statistical Inferences: Many statistical methods, such as hypothesis testing and confidence intervals, rely on the variability of data to make accurate inferences about populations from samples.
  • Balanced Decision Making: Understanding variability helps in making more informed decisions by providing a clearer picture of the data’s characteristics and potential fluctuations.
Importance of Dispersion in Statistics

Overall, variability is essential for a comprehensive understanding of data, enabling analysts to draw meaningful conclusions and make informed decisions.

R Language Frequently Asked Questions