Standard Deviation: A Measure of Dispersion

The standard deviation is a widely used concept in statistics and it tells how much variation (spread or dispersion) is in the data set. It can be defined as the positive square root of the mean (average) of the squared deviations of the values from their mean.
To calculate the standard deviation one has to follow these steps:

Calculation of Standard Deviation

  1. First, find the mean of the data.
  2. Take the difference of each data point from the mean of the given data set (which is computed in step 1). Note that, the sum of these differences must be equal to zero or near to zero due to rounding of numbers.
  3. Now compute the square of the differences obtained in Step 2, it would be greater than zero, and it will be a positive quantity.
  4. Now add up all the squared quantities obtained in step 3. We call it the sum of squares of differences.
  5. Divide this sum of squares of differences (obtained in step 4) by the total number of observations (available in data) if we have to calculate population standard deviation ($\sigma$). If you want t to compute sample standard deviation ($S$) then divide the sum of squares of differences (obtained in step 4) by the total number of observations minus one ($n-1$) i.e. the degree of freedom. Note that $n$ is the number of observations available in the data set.
  6. Find the square root (also known as under root) of the quantity obtained in step 5. The resultant quantity in this way is known as the standard deviation (SD) for the given data set.

The sample SD of a set of $n$ observation, $X_1, X_2, \cdots, X_n$ denoted by $S$ is

\begin{aligned}
\sigma &=\sqrt{\frac{\sum_{i=1}^n (X_i-\overline{X})^2}{n}}; Population\, SD\\
S&=\sqrt{ \frac{\sum_{i=1}^n (X_i-\overline{X})^2}{n-1}}; Sample\, SD
\end{aligned}

The standard deviation can be computed from variance too.

The real meaning of the standard deviation is that for a given data set 68% of the data values will lie within the range $\overline{X} \pm \sigma$ i.e. within one standard deviation from the mean or simply within one $\sigma$. Similarly, 95% of the data values will lie within the range $\overline{X} \pm 2 \sigma$ and 99% within $\overline{X} \pm 3 \sigma$.

Standard Deviation

Examples

A large value of SD indicates more spread in the data set which can be interpreted as the inconsistent behaviour of the data collected. It means that the data points tend to be away from the mean value. For the case of smaller standard deviation, data points tend to be close (very close) to the mean indicating the consistent behavior of the data set.
The standard deviation and variance are used to measure the risk of a particular investment in finance. The mean of 15% and standard deviation of 2% indicates that it is expected to earn a 15% return on investment and we have a 68% chance that the return will be between 13% and 17%. Similarly, there is a 95% chance that the return on the investment will yield an 11% to 19% return.

Online MCQs Test Preparation Website

Range Measure of Dispersion

Measure of Central Tendency provides typical value about the data set, but it does not tell the actual story about the data i.e. mean, median, and mode are enough to get summary information, though we know about the center of the data. In other words, we can measure the center of the data by looking at averages (mean, median, and mode). These measures tell nothing about the spread of data. So for more information about data, we need some other measure, such as the Range measure of dispersion or spread.

Range Measure of Dispersion

The Spread of data can be measured by calculating the range of data; the range tells us how many numbers of data extend. The range is an absolute measure of dispersion that can be found by subtracting the highest value (called upper bound) in data from the smallest value (called lower bound). i.e.

Range = Upper Bound – Lowest Bound
OR
Range = Largest Value – Smallest Value

This absolute measure of dispersion has disadvantages as range only describes the width of the data set (i.e. only spread out) measured in the same unit as data, but it does not give the real picture of how data is distributed. If data has outliers, using range to describe the spread of that can be very misleading as the range is sensitive to outliers.

We need to be careful in using the range measure of dispersion as it does not give the full picture of what’s going between the highest and lowest values. It might give a misleading picture of the spread of the data because it is based only on the two extreme values. Therefore, Range is an unsatisfactory measure of dispersion.

Range measure-of-dispersion

However, the range measure of dispersion is widely used in statistical process control such as control charts of manufactured products, daily temperature, stock prices, etc., applications as it is very easy to calculate. It is an absolute measure of dispersion, its relative measure known as the coefficient of dispersion defined the the relation

\[Coefficient\,\, of\,\, Dispersion = \frac{x_m-x_0}{x_m-x_0}\]

The coefficient of dispersion is pure dimensionless and is used for comparison purposes.

Data Frame in R Language

Online MCQs Test Website

Absolute Measure of Dispersion

An absolute Measure of Dispersion gives an idea about the amount of dispersion/ spread in a set of observations. These quantities measure the dispersion in the same units as the units of original data. The absolute measure of dispersion cannot be used to compare the variation of two or more series/ data sets. The absolute measure of dispersion does not in itself, tell whether the variation is large or small.

Absolute Measure of Dispersion

The absolute Measure of Dispersion:

  1. Range
  2. Quartile Deviation
  3. Mean Deviation
  4. Variance or Standard Deviation
Absolute Measures of Dispersion

Range

The Range is the difference between the largest value and the smallest value in the data set. For ungrouped data, let $X_0$ be the smallest value and $X_n$ be the largest  value in a data set then the range ($R$) is defined as
$R=X_n-X_0$.

For grouped data Range can be calculated in three different ways
R=Mid point of the highest class – Midpoint of the lowest class
R=Upper class limit of the highest class – Lower class limit of the lower class
R=Upper class boundary of the highest class – The lower class boundary of the lowest class

Quartile Deviation (Semi-Interquantile Range)

The Quartile deviation (an absolute measure of dispersion) is defined as the difference between the third and first quartiles, and half of this range is called the semi-interquartile range (SIQD) or simply quartile deviation (QD). $$QD=\frac{Q_3-Q_1}{2}$$

The Quartile Deviation is superior to the range as it is not affected by extremely large or small observations, anyhow it does not give any information about the position of observation lying outside the two quantities. It is not amenable to mathematical treatment and is greatly affected by sampling variability. Although Quartile Deviation is not widely used as a measure of dispersion, it is used in situations in which extreme observations are thought to be unrepresentative/ misleading. Quartile Deviation is not based on all observations therefore it is affected by extreme observations.

Note: The range “Median ± QD” contains approximately 50% of the data.

Mean Deviation (Average Deviation)

The Mean Deviation is another absolute measure of dispersion and is defined as the arithmetic mean of the deviations measured either from the mean or from the median. All these deviations are counted as positive to avoid the difficulty arising from the property that the sum of deviations of observations from their mean is zero.

$MD=\frac{\sum|X-\overline{X}|}{n}\quad$ for ungrouped data for mean
$MD=\frac{\sum f|X-\overline{X}|}{\sum f}\quad$ for grouped data for mean
$MD=\frac{\sum|X-\tilde{X}|}{n}\quad$ for ungrouped data for median
$MD=\frac{\sum f|X-\tilde{X}|}{\sum f}\quad$ for grouped data for median
Mean Deviation can be calculated about other central tendencies but it is least when deviations are taken as the median.

The Mean Deviation gives more information than the range or the Quartile Deviation as it is based on all the observed values. The Mean Deviation does not give undue weight to occasional large deviations, so it should likely be used in situations where such deviations are likely to occur.

Variance and Standard Deviation

This absolute measure of dispersion is defined as the mean of the squares of deviations of all the observations from their mean. Traditionally population variance is denoted by $\sigma^2$ (sigma square) and for sample data denoted by $S^2$ or $s^2$.

Symbolically
$\sigma^2=\frac{\sum(X_i-\mu)^2}{N}\quad$ Population Variance for ungrouped data
$S^2=\frac{\sum(X_i-\overline{X})^2}{n}\quad$ sample Variance for ungrouped data
$\sigma^2=\frac{\sum f(X_i-\mu)^2}{\sum f}\quad$ Population Variance for grouped data
$\sigma^2=\frac{\sum f (X_i-\overline{X})^2}{\sum f}\quad$ Sample Variance for grouped data

The variance is denoted by $Var(X)$ for random variable $X$. The term variance was introduced by R. A. Fisher (1890-1982) in 1918. The variance is in squares of units and the variance is a large number compared to observations themselves.
Note that there are alternative formulas to compute Variance or Standard Deviations.

The positive square root of the variance is called Standard Deviation (SD) to express the deviation in the same units as the original observation. It is a measure of the average spread about the mean and is symbolically defined as

$\sigma^2=\sqrt{\frac{\sum(X_i-\mu)^2}{N}}\quad$ Population Standard for ungrouped data
$S^2=\sqrt{\frac{\sum(X_i-\overline{X})^2}{n}}\quad$ Sample Standard Deviation for ungrouped data
$\sigma^2=\sqrt{\frac{\sum f(X_i-\mu)^2}{\sum f}}\quad$ Population Standard Deviation for grouped data
$\sigma^2=\sqrt{\frac{\sum f (X_i-\overline{X})^2}{\sum f}}\quad$ Sample Standard Deviation for grouped data
Standard Deviation is the most useful measure of dispersion and is credited with the name Standard Deviation by Karl Pearson (1857-1936).

In some text Sample, Standard Deviation is defined as $S^2=\frac{\sum (X_i-\overline{X})^2}{n-1}$ based on the argument that knowledge of any $n-1$ deviations determines the remaining deviations as the sum of n deviations must be zero. This is an unbiased estimator of the population variance $\sigma^2$. The Standard Deviation has a definite mathematical measure, it utilizes all the observed values and is amenable to mathematical treatment but affected by extreme values.

References

R Language Tutorial

MCQs about Business Mathematics

Measure of Dispersion or Variability

The measure of location or averages or central tendency is not sufficient to describe the characteristics of a distribution, because two or more distributions may have averages that are exactly alike, even though the distributions are dissimilar in other aspects. On the other hand, a measure of central tendency represents the typical value of the data set. To give a sensible description of data, a numerical quantity called the measure of dispersion/ variability or scatter that describes the spread of the values in a set of data has two types of measures of dispersion or variability:

measures-of-dispersion
  1. Absolute Measures
  2. Relative Measures

A measure of central tendency together with a measure of dispersion gives an adequate description of data as compared to the use of a measure of location only, because the averages or measures of central tendency only describe the balancing point of the data set, it does not provide any information about the degree to which the data tend to spread or scatter about the average value. So, the Measure of dispersion is an indication of the characteristic of the central tendency measure. The smaller the variability of a given set, the more the values of the measure of averages will be representative of the data set.

Absolute Measure of Dispersion

Absolute measures are defined in such a way that they have units such as meters, grams, etc. same as those of the original measurements. Absolute measures cannot be used to compare the variation/spread of two or more data sets.
Most Common absolute measures of variability are:

Relative Measures of Dispersion

The relative measures have no units as these are ratios, coefficients, or percentages. Relative measures are independent of units of measurement and are useful for comparing data of different natures.

  • Coefficient of Variation
  • Coefficient of Mean Deviation
  • Coefficient of Quartile Deviation
  • Coefficient of Standard Deviation

Different terms are used for the measure of dispersion or variability such as variability, spread, scatterness, the measure of uncertainty, deviation, etc.

References:
http://www2.le.ac.uk/offices/careers/ld/resources/numeracy/variability

R Language Frequently Asked Questions