Basic Statistics and Data Analysis

Descriptive Statistics Multivariate Data set

Much of the information contained in the data can be assessed by calculating certain summary numbers, known as descriptive statistics such as Arithmetic mean (measure of location), average of the squares of the distances of all of the numbers from the mean (variation/spread i.e. measure of spread or variation) etc. Here we will discuss about descriptive statistics multivariate data set.

We shall rely most heavily on descriptive statistics that is measure of location, variation and linear association.

Measure of Location

The arithmetic Average of n measurements $(x_{11}, x_{21}, x_{31},x_{41})$ on the first variable (defined in Multivariate Analysis: An Introduction) is

Sample Mean = $\bar{x}=\frac{1}{n} \sum _{j=1}^{n}x_{j1} \mbox{ where } j =1, 2,3,\cdots , n$

The sample mean for $n$ measurements on each of the p variables (there will be p sample means)

$\bar{x}_{k} =\frac{1}{n} \sum _{j=1}^{n}x_{jk} \mbox{ where } k = 1, 2, \cdots , p$

Measure of spread (variance) for n measurements on the first variable can be found as
$s_{1}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )^{2}$ where $\bar{x}_{1}$ is sample mean of the $x_{j}$’s for p variables.

Measure of spread (variance) for n measurements on all variable can be found as

$s_{k}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} \mbox{ where } k=1,2,\dots ,p \mbox{ and } j=1,2,\cdots ,p$

The Square Root of the sample variance is sample standard deviation i.e

$S_{l}^{2} =S_{kk} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} \mbox{ where } k=1,2,\cdots ,p$

Sample Covariance

Consider n pairs of measurement on each of Variable 1 and Variable 2
$\left[\begin{array}{c} {x_{11} } \\ {x_{12} } \end{array}\right],\left[\begin{array}{c} {x_{21} } \\ {x_{22} } \end{array}\right],\cdots ,\left[\begin{array}{c} {x_{n1} } \\ {x_{n2} } \end{array}\right]$
That is $x_{j1}$ and $x_{j2}$ are observed on the jth experimental item $(j=1,2,\cdots ,n)$. So a measure of linear association between the measurements of  $V_1$ and $V_2$ is provided by the sample covariance
$s_{12} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )(x_{j2} -\bar{x}_{2} )$
(the average of product of the deviation from their respective means) therefore

$s_{ik} =\frac{1}{n} \sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )(x_{jk} -\bar{x}_{k} )$;  i=1,2,..,p and k=1,2,\… ,p.

It measures the association between the kth variable.

Variance is the most commonly used measure of dispersion (variation) in the data and it is directly proportional to the amount of variation or information available in the data.

Sample Correlation Coefficient

The sample correlation coefficient for the ith and kth variable is

$r_{ik} =\frac{s_{ik} }{\sqrt{s_{ii} } \sqrt{s_{kk} } } =\frac{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{j} )(x_{jk} -\bar{x}_{k} ) }{\sqrt{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )^{2} } \sqrt{\sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} } }$
$\mbox{ where } i=1,2,..,p \mbox{ and} k=1,2,\dots ,p$

Note that $r_{ik} =r_{ki}$ for all $i$ and $k$, and $r$ lies between -1 and +1. $r$ measures the strength of the linear association. If $r=0$ the lack of linear association between the components exists. The sign of $r$ indicates the direction of the association.

Measure of Dispersion or Variability

The measure of location or averages or central tendency is not sufficient to describe the characteristics of a distribution, because two or more distributions may have averages which are exactly alike, even though the distributions are dissimilar in other aspects, and on the other hand, measure of central tendency represents the typical value of the data set. To give a sensible description of data, a numerical quantity called measure of dispersion/ variability or scatter that describe the spread of the values in a set of data have two types of measures of dispersion or variability:

1. Absolute Measures
2. Relative Measures

A measure of central tendency together with a measure of dispersion gives adequate description of data as compared to use of measure of location only, because the averages or measures of central tendency only describes the balancing point of the data set, it does not provide any information about the degree to which the data tend to spread or scatter about the average value. So Measure of dispersion is an indication of the characteristic of the central tendency measure. The smaller the variability of a given set, the more the values of the measure of averages will be representative of the data set.

1. Absolute Measures
Absolute measures defined in such a way that they have units such as meters, grams etc. same as those of the original measurements. Absolute measures cannot be used to compare the variation/spread of two or more sets of data.
Most Common absolute measures of variability are:

• Range
• Semi-Interquartile Range or Quartile Deviation
• Mean Deviation
• Variance
• Standard Deviation
2. Relative Measures
The relative measures have no units as these are ratios, coefficients, or percentages. Relative measures are independent of units of measurements and are useful for comparing data of different natures.

• Coefficient of Variation
• Coefficient of Mean Deviation
• Coefficient of Quartile Deviation
• Coefficient of Standard Deviation

Different terms are used for measure of dispersion or variability such as variability, spread, scatter, measure of uncertainty,deviation etc.

References:
http://www2.le.ac.uk/offices/careers/ld/resources/numeracy/variability

Moments

Measure of central tendency (location) and measure of dispersion (variation) both are useful to describe a data set but both of them fail to tell anything about the shape of the distribution. We need some other certain measure called the moments to identify the shape of the distribution known as skewness and kurtosis.

The moments about mean are the mean of deviations from the mean after raising them to integer powers. The rth population moment about mean is denote by μr is

$\mu_r=\frac{\sum^{N}_{i=1}(y_i – \bar{y} )^r}{N}$

where r=1, 2, …

Corresponding sample moment denoted by mr is

$\mu_r=\frac{\sum^{n}_{i=1}(y_i – \bar{y} )^r}{n}$

Note that if r=1 i.e. the first moment is zero as $\mu_1=\frac{\sum^{n}_{i=1}(y_i – \bar{y} )^1}{n}=0$. So first moment is always zero.

If r=2 then the second moment is variance i.e. $\mu_2=\frac{\sum^{n}_{i=1}(y_i – \bar{y} )^2}{n}$

Similarly the 3rd and 4th moments are

$\mu_3=\frac{\sum^{n}_{i=1}(y_i – \bar{y} )^3}{n}$

$\mu_4=\frac{\sum^{n}_{i=1}(y_i – \bar{y} )^4}{n}$

For grouped data the rth sample moment  about sample mean $\bar{y}$ is

$\mu_r=\frac{\sum^{n}_{i=1}f_i(y_i – \bar{y} )^r}{\sum^{n}_{i=1}f_i}$

where $\sum^{n}_{i=1}f_i=n$

The rth sample sample moment about any arbitrary origin “a” denoted by $m’_r$ is
$m’_r = \frac{\sum^{n}_{i=1}(y_i – a)^2}{n} = \frac{\sum^{n}_{i=1}D^r_i}{n}$
where $D_i=(y_i -a)$ and r = 1, 2, ….

therefore
\begin{eqnarray*}
m’_1&=&\frac{\sum^{n}_{i=1}(y_i – a)}{n}=\frac{\sum^{n}_{i=1}D_i}{n}\\
m’_2&=&\frac{\sum^{n}_{i=1}(y_i – a)^2}{n}=\frac{\sum^{n}_{i=1}D_i ^2}{n}\\
m’_3&=&\frac{\sum^{n}_{i=1}(y_i – a)^3}{n}=\frac{\sum^{n}_{i=1}D_i ^3}{n}\\
m’_4&=&\frac{\sum^{n}_{i=1}(y_i – a)^4}{n}=\frac{\sum^{n}_{i=1}D_i ^4}{n}
\end{eqnarray*}

The rth sample moment for grouped data about any arbitrary origin “a” is

$m’_r=\frac{\sum^{n}_{i=1}f_i(y_i – a)^r}{\sum^{n}_{i=1}f} = \frac{\sum f_i D_i ^r}{\sum f}$

The moment about the mean are usually called central moments and the moments about any arbitrary origin “a” are called non-central moments or raw moments.

One can calculate the moments about mean from the following relations by calculating the moments about arbitrary value

\begin{eqnarray*}
m_1&=& m’_1 – (m’_1) = 0 \\
m_2 &=& m’_2 – (m’_1)^2\\
m_3 &=& m’_3 – 3m’_2m’_1 +2(m’_1)^3\\
m_4 &=& m’_4 -4 m’_3m’_1 +6m’_2(m’_1)^2 -3(m’_1)^4
\end{eqnarray*}

If variable y assumes n values $y_1, y_2, \cdots, y_n$ then rth moment about zero can be obtained by taking a=0 so moment about arbitrary value will be
$m’_r = \frac{\sum y^r}{n}$

where r = 1, 2, 3, ….

therefore
\begin{eqnarray*}
m’_1&=&\frac{\sum y^1}{n}\\
m’_2 &=&\frac{\sum y^2}{n}\\
m’_3 &=&\frac{\sum y^3}{n}\\
m’_4 &=&\frac{\sum y^4}{n}\\
\end{eqnarray*}

The third moment is used to define the skewness of a distribution
${\rm Skewness} = \frac{\sum^{i=1}_{n} (y_i – \bar{y})^3}{ns^3}$

If distribution is symmetric then the skewness will be zero. Skewness will be positive if there is a long tail in the positive direction and skewness will be negative if there is a long tail in the negative direction.

The fourth moment is used to define the kurtosis of a distribution

${\rm Kurtosis} = \frac{\sum^{i=1}_{n} (y_i -\bar{y})^4}{ns^4}$

Skewness

Skewness is the degree of asymmetry or departure from symmetry of the distribution of a real valued random variable.

Positive Skewed
If the frequency curve of a distribution has a longer tail to the right of the central maximum than to the left, the distribution is said to be skewed to the right or to have positive skewed. In a positive skewed distribution, the mean is greater than the media and median is greater than the mode i.e. Mean > Median > Mode

Negative Skewed
If the frequency curve has a longer tail to the left of the central maximum than to the right, the distribution is said to be skewed to the left or to have negative skewed. In a negatively skewed distribution, mode is greater than median and median is greater than mean i.e. Mode > Median > Mean.

In a symmetrical distribution the mean, median and mode coincide. In skewed distribution these values are pulled apart.

Pearson’s Coefficient of Skewness
Karl Pearson, (1857-1936) introduced a coefficient of skewness to measure the degree of skewness of a distribution or curve, which is denote by Sk and define by

\begin{eqnarray*}
S_k &=& \frac{Mean – Mode}{Standard Deviation}\\
S_k &=& \frac{3(Mean – Median)}{Standard Deviation}\\
\end{eqnarray*}
Usually this coefficient varies between –3 (for negative skewness) to +3 (for positive skewness) and the sign indicates the direction of skewness.

Bowley’s Coefficient of Skewness or Quartile Coefficient of Skewness
Arthur Lyon Bowley (1869-1957) proposed a measure of skewness based on the median and the two quartiles.

$S_k=\frac{Q_1+Q_3-2Median}{Q_3 – Q_1}$
Its values lie between 0 and ±1.

Moment Coefficient of Skewness
This measure of skewness is the third moment expressed in standard units (or the moment ratio) thus given by

$S_k=\frac{\mu_3}{\sigma^3}$
Its values lie between -2 and +2.

If Sk is greater than zero, the distribution or curve is said to be positive skewed. If Sk is less than zero the distribution or curve is said to be negative skewed. If Sk is zero the distribution or curve is said to be symmetrical.

The skewness of the distribution of a real valued random variable can easily be seen by drawing histogram or frequency curve.

The skewness may be very extreme and in such a case these are called J-shaped distributions.

Measure of Kurtosis

Kurtosis is a measure of peakedness of a distribution relative to the normal distribution. A distribution having a relatively high peak is called leptokurtic. A distribution which is flat topped is called platykurtic. The normal distribution which is neither very peaked nor very flat-topped is also called mesokurtic.  The histogram is an effective graphical technique for showing both the skewness and kurtosis of data set.

Data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak.

Moment ratio and Percentile Coefficient of kurtosis are used to measure the kurtosis

Moment Coefficient of Kurtosis= $b_2 = \frac{m_4}{S^2} = \frac{m_4}{m^{2}_{2}}$

Percentile Coefficient of Kurtosis = $k=\frac{Q.D}{P_{90}-P_{10}}$
where Q.D = $\frac{1}{2}(Q_3 – Q_1)$ is the semi-interquartile range. For normal distribution this has the value 0.263.

A normal random variable has a kurtosis of 3 irrespective of its mean or standard deviation. If a random variable’s kurtosis is greater than 3, it is said to be Leptokurtic. If its kurtosis is less than 3, it is said to be Platykurtic.