# Basic Statistics and Data Analysis

## Range: An Absolute Measure of Dispersion

Measure of Central Tendency provides typical value about the data set, but it does not tell the actual story about data i.e. mean, median and mode are enough to get summary information, though we know about the center of the data. In other words, we can measure the center of the data by looking at averages (mean, median, mode). These measure tell nothing about the spread of data. So for more information about data we need some other measure, such as measure of dispersion or spread.

Spread of data can be measured by calculating the range of data; range tell us over how many numbers of data extends. Range (an absolute measure of dispersion) can be found by subtracting highest value (called upper bound) in data from smallest value (called lower bound) in data. i.e.

Range = Upper Bound – Lowest Bound
OR
Range = Largest Value – Smallest Value

This absolute measure of dispersion have disadvantages as range only describes the width of the data set (i.e. only spread out) measure in same unit as data, but it does not gives the real picture of how data is distributed. If data has outliers, using range to describe the spread of that can be very misleading as range is sensitive to outliers. So we need to be careful in using range as it does not give the full picture of what’s going between the highest and lowest value. It might give misleading picture of the spread of the data because it is based only on the two extreme values. It is therefore an unsatisfactory measure of dispersion.

However range is widely used in statistical process control such as control charts of manufactured products, daily temperature, stock prices etc., applications as it is very easy to calculate. It is an absolute measure of dispersion, its relatives measure known as the coefficient of dispersion defined the the relation

$Coefficient\,\, of\,\, Dispersion = \frac{x_m-x_0}{x_m-x_0}$

Coefficient of dispersion is a pure dimensionless and is used for comparison purpose.

## Absolute Measure of Dispersion

Absolute Measure of Dispersion gives an idea about the amount of dispersion/ spread in a set of observations. These quantities measures the dispersion in the same units as the units of original data. Absolute measures cannot be used to compare the variation of two or more series/ data set. A measure of absolute dispersion does not in itself, tell whether the variation is large or small.

## Range

Range is the difference between the largest value and the smallest value in the data set. For ungrouped data, let $X_0$ is the smallest value and $X_n$ is the largest  value in a data set then the range (R) is defined as
$R=X_n-X_0$.

For grouped data Range can be calculated in three different ways
R=Mid point of highest class – Mid point of lowest class
R=Upper class limit of highest class-Lower class limit of lower class
R=Upper class boundary of highest class – Lower class boundary of lowest class

## Quartile Deviation (Semi-Interquantile Range)

Quartile deviation defined as the difference between the third and first quartiles, and half of this range is called the semi-interquartile range (SIQD) or simply quartile deviation (QD). $QD=\frac{Q_3-Q_1}{2}$
The Quartile Deviation is superior to range as it is not affected by extremely large or small observations, any how it does not give any information about the position of observation lying outside the two quantities. It is not amenable to mathematical treatment and is greatly affected by sampling variability. Although Quartile Deviation is not widely used as measure of dispersion, but it is used in situations in which extreme observations are thought to be unrepresentative/ misleading. Quartile Deviation is not based on all observation therefore it is affected by extreme observations.

Note: The range “Median ± QD” contains approximately 50% of the data.

## Mean Deviation (Average Deviation)

The Mean Deviation is defined as the arithmetic mean of the deviations measured either from mean or from the median. All these deviations are counted as positive to avoid the difficulty arising from the property that the sum of deviations of observations from their mean is zero.
$MD=\frac{\sum|X-\overline{X}|}{n}\quad$ for ungrouped data for mean
$MD=\frac{\sum f|X-\overline{X}|}{\sum f}\quad$ for grouped data for mean
$MD=\frac{\sum|X-\tilde{X}|}{n}\quad$ for ungrouped data for median
$MD=\frac{\sum f|X-\tilde{X}|}{\sum f}\quad$ for grouped data for median
Mean Deviation can be calculated about other central tendencies but it is least when deviations are taken as median.

The Mean Deviation gives more information than range or the Quartile Deviation as it is based on all the observed values. The Mean Deviation does not give undue weight to occasional large deviations, so it should likely to be used in situation where such deviation are likely to occur.

## Variance and Standard Deviation

This absolute measure of dispersion is defined as the mean of the squares of deviations of all the observations from their mean. Traditionally for population variance is denoted by $\sigma^2$ (sigma square) and for sample data denoted by $S^2$ or $s^2$.
Symbolically
$\sigma^2=\frac{\sum(X_i-\mu)^2}{N}\quad$ Population Variance for ungrouped data
$S^2=\frac{\sum(X_i-\overline{X})^2}{n}\quad$ sample Variance for ungrouded data
$\sigma^2=\frac{\sum f(X_i-\mu)^2}{\sum f}\quad$ Population Variance for grouped data
$\sigma^2=\frac{\sum f (X_i-\overline{X})^2}{\sum f}\quad$ Sample Variance for grouped data

The variance is denoted by Var(X) for random variable X. The term variance was introduced by R. A. Fisher (1890-1982) in 1918. The variance is in square of units and the variance is a large number compared to observation themselves.
Note that there are alternative formulas to compute Variance or Standard Deviations.

The positive square root of the variance is called Standard Deviation (SD) to express the deviation in the same units as the original observation themselves.It is a measure of the average spread about the mean and symbolically defined as
$\sigma^2=\sqrt{\frac{\sum(X_i-\mu)^2}{N}}\quad$ Population Standard for ungrouped data
$S^2=\sqrt{\frac{\sum(X_i-\overline{X})^2}{n}}\quad$ Sample Standard Deviation for ungrouped data
$\sigma^2=\sqrt{\frac{\sum f(X_i-\mu)^2}{\sum f}}\quad$ Population Standard Deviation for grouped data
$\sigma^2=\sqrt{\frac{\sum f (X_i-\overline{X})^2}{\sum f}}\quad$ Sample Standard Deviation for grouped data
Standard Deviation is most useful measure of dispersion is credited with the name Standard Deviation by Karl Pearson (1857-1936).
In some text Sample Standard Deviation is defined as $S^2=\frac{\sum (X_i-\overline{X})^2}{n-1}$ on the basis of the argument that knowledge of any $n-1$ deviations determines the remaining deviations as the sum of n deviations must be zero. In fact this is an unbiased estimator of the population variance $\sigma^2$. The Standard Deviation has a definite mathematical measure, it utilizes all the observed values and is amenable to mathematical treatment but affected by extreme values.

References

## Descriptive Statistics Multivariate Data set

Much of the information contained in the data can be assessed by calculating certain summary numbers, known as descriptive statistics such as Arithmetic mean (measure of location), average of the squares of the distances of all of the numbers from the mean (variation/spread i.e. measure of spread or variation) etc. Here we will discuss about descriptive statistics multivariate data set.

We shall rely most heavily on descriptive statistics that is measure of location, variation and linear association.

## Measure of Location

The arithmetic Average of n measurements $(x_{11}, x_{21}, x_{31},x_{41})$ on the first variable (defined in Multivariate Analysis: An Introduction) is

Sample Mean = $\bar{x}=\frac{1}{n} \sum _{j=1}^{n}x_{j1} \mbox{ where } j =1, 2,3,\cdots , n$

The sample mean for $n$ measurements on each of the p variables (there will be p sample means)

$\bar{x}_{k} =\frac{1}{n} \sum _{j=1}^{n}x_{jk} \mbox{ where } k = 1, 2, \cdots , p$

Measure of spread (variance) for n measurements on the first variable can be found as
$s_{1}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )^{2}$ where $\bar{x}_{1}$ is sample mean of the $x_{j}$’s for p variables.

Measure of spread (variance) for n measurements on all variable can be found as

$s_{k}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} \mbox{ where } k=1,2,\dots ,p \mbox{ and } j=1,2,\cdots ,p$

The Square Root of the sample variance is sample standard deviation i.e

$S_{l}^{2} =S_{kk} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} \mbox{ where } k=1,2,\cdots ,p$

Sample Covariance

Consider n pairs of measurement on each of Variable 1 and Variable 2
$\left[\begin{array}{c} {x_{11} } \\ {x_{12} } \end{array}\right],\left[\begin{array}{c} {x_{21} } \\ {x_{22} } \end{array}\right],\cdots ,\left[\begin{array}{c} {x_{n1} } \\ {x_{n2} } \end{array}\right]$
That is $x_{j1}$ and $x_{j2}$ are observed on the jth experimental item $(j=1,2,\cdots ,n)$. So a measure of linear association between the measurements of  $V_1$ and $V_2$ is provided by the sample covariance
$s_{12} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )(x_{j2} -\bar{x}_{2} )$
(the average of product of the deviation from their respective means) therefore

$s_{ik} =\frac{1}{n} \sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )(x_{jk} -\bar{x}_{k} )$;  i=1,2,..,p and k=1,2,\… ,p.

It measures the association between the kth variable.

Variance is the most commonly used measure of dispersion (variation) in the data and it is directly proportional to the amount of variation or information available in the data.

## Sample Correlation Coefficient

The sample correlation coefficient for the ith and kth variable is

$r_{ik} =\frac{s_{ik} }{\sqrt{s_{ii} } \sqrt{s_{kk} } } =\frac{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{j} )(x_{jk} -\bar{x}_{k} ) }{\sqrt{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )^{2} } \sqrt{\sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} } }$
$\mbox{ where } i=1,2,..,p \mbox{ and} k=1,2,\dots ,p$

Note that $r_{ik} =r_{ki}$ for all $i$ and $k$, and $r$ lies between -1 and +1. $r$ measures the strength of the linear association. If $r=0$ the lack of linear association between the components exists. The sign of $r$ indicates the direction of the association.

## Measure of Dispersion or Variability

The measure of location or averages or central tendency is not sufficient to describe the characteristics of a distribution, because two or more distributions may have averages which are exactly alike, even though the distributions are dissimilar in other aspects, and on the other hand, measure of central tendency represents the typical value of the data set. To give a sensible description of data, a numerical quantity called measure of dispersion/ variability or scatter that describe the spread of the values in a set of data have two types of measures of dispersion or variability:

1. Absolute Measures
2. Relative Measures

A measure of central tendency together with a measure of dispersion gives adequate description of data as compared to use of measure of location only, because the averages or measures of central tendency only describes the balancing point of the data set, it does not provide any information about the degree to which the data tend to spread or scatter about the average value. So Measure of dispersion is an indication of the characteristic of the central tendency measure. The smaller the variability of a given set, the more the values of the measure of averages will be representative of the data set.

1. Absolute Measures
Absolute measures defined in such a way that they have units such as meters, grams etc. same as those of the original measurements. Absolute measures cannot be used to compare the variation/spread of two or more sets of data.
Most Common absolute measures of variability are:

• Range
• Semi-Interquartile Range or Quartile Deviation
• Mean Deviation
• Variance
• Standard Deviation
2. Relative Measures
The relative measures have no units as these are ratios, coefficients, or percentages. Relative measures are independent of units of measurements and are useful for comparing data of different natures.

• Coefficient of Variation
• Coefficient of Mean Deviation
• Coefficient of Quartile Deviation
• Coefficient of Standard Deviation

Different terms are used for measure of dispersion or variability such as variability, spread, scatter, measure of uncertainty,deviation etc.

References:
http://www2.le.ac.uk/offices/careers/ld/resources/numeracy/variability

# Moments

Measure of central tendency (location) and measure of dispersion (variation) both are useful to describe a data set but both of them fail to tell anything about the shape of the distribution. We need some other certain measure called the moments to identify the shape of the distribution known as skewness and kurtosis.

The moments about mean are the mean of deviations from the mean after raising them to integer powers. The rth population moment about mean is denote by μr is

$\mu_r=\frac{\sum^{N}_{i=1}(y_i – \bar{y} )^r}{N}$

where r=1, 2, …

Corresponding sample moment denoted by mr is

$\mu_r=\frac{\sum^{n}_{i=1}(y_i – \bar{y} )^r}{n}$

Note that if r=1 i.e. the first moment is zero as $\mu_1=\frac{\sum^{n}_{i=1}(y_i – \bar{y} )^1}{n}=0$. So first moment is always zero.

If r=2 then the second moment is variance i.e. $\mu_2=\frac{\sum^{n}_{i=1}(y_i – \bar{y} )^2}{n}$

Similarly the 3rd and 4th moments are

$\mu_3=\frac{\sum^{n}_{i=1}(y_i – \bar{y} )^3}{n}$

$\mu_4=\frac{\sum^{n}_{i=1}(y_i – \bar{y} )^4}{n}$

For grouped data the rth sample moment  about sample mean $\bar{y}$ is

$\mu_r=\frac{\sum^{n}_{i=1}f_i(y_i – \bar{y} )^r}{\sum^{n}_{i=1}f_i}$

where $\sum^{n}_{i=1}f_i=n$

The rth sample sample moment about any arbitrary origin “a” denoted by $m’_r$ is
$m’_r = \frac{\sum^{n}_{i=1}(y_i – a)^2}{n} = \frac{\sum^{n}_{i=1}D^r_i}{n}$
where $D_i=(y_i -a)$ and r = 1, 2, ….

therefore
\begin{eqnarray*}
m’_1&=&\frac{\sum^{n}_{i=1}(y_i – a)}{n}=\frac{\sum^{n}_{i=1}D_i}{n}\\
m’_2&=&\frac{\sum^{n}_{i=1}(y_i – a)^2}{n}=\frac{\sum^{n}_{i=1}D_i ^2}{n}\\
m’_3&=&\frac{\sum^{n}_{i=1}(y_i – a)^3}{n}=\frac{\sum^{n}_{i=1}D_i ^3}{n}\\
m’_4&=&\frac{\sum^{n}_{i=1}(y_i – a)^4}{n}=\frac{\sum^{n}_{i=1}D_i ^4}{n}
\end{eqnarray*}

The rth sample moment for grouped data about any arbitrary origin “a” is

$m’_r=\frac{\sum^{n}_{i=1}f_i(y_i – a)^r}{\sum^{n}_{i=1}f} = \frac{\sum f_i D_i ^r}{\sum f}$

The moment about the mean are usually called central moments and the moments about any arbitrary origin “a” are called non-central moments or raw moments.

One can calculate the moments about mean from the following relations by calculating the moments about arbitrary value

\begin{eqnarray*}
m_1&=& m’_1 – (m’_1) = 0 \\
m_2 &=& m’_2 – (m’_1)^2\\
m_3 &=& m’_3 – 3m’_2m’_1 +2(m’_1)^3\\
m_4 &=& m’_4 -4 m’_3m’_1 +6m’_2(m’_1)^2 -3(m’_1)^4
\end{eqnarray*}

If variable y assumes n values $y_1, y_2, \cdots, y_n$ then rth moment about zero can be obtained by taking a=0 so moment about arbitrary value will be
$m’_r = \frac{\sum y^r}{n}$

where r = 1, 2, 3, ….

therefore
\begin{eqnarray*}
m’_1&=&\frac{\sum y^1}{n}\\
m’_2 &=&\frac{\sum y^2}{n}\\
m’_3 &=&\frac{\sum y^3}{n}\\
m’_4 &=&\frac{\sum y^4}{n}\\
\end{eqnarray*}

The third moment is used to define the skewness of a distribution
${\rm Skewness} = \frac{\sum^{i=1}_{n} (y_i – \bar{y})^3}{ns^3}$

If distribution is symmetric then the skewness will be zero. Skewness will be positive if there is a long tail in the positive direction and skewness will be negative if there is a long tail in the negative direction.

The fourth moment is used to define the kurtosis of a distribution

${\rm Kurtosis} = \frac{\sum^{i=1}_{n} (y_i -\bar{y})^4}{ns^4}$