# Basic Statistics and Data Analysis

## Median Measure of Central Tendency

Median is the middle most value in the data set when all of the values (observations) in a data set are arranged either in ascending or descending order of their magnitude. Median is also considered as a measure of central tendency which divides the data set in two half, where the first half contains 50% observations below the median value and 50% above the median value. If in a data set there are odd number of observations (data points), the median value is the single most middle value after sorting the data set.

Example: Consider the following data set 5, 9, 8, 4, 3, 1, 0, 8, 5, 3, 5, 6, 3.
To find the median of the given data set, first sort it (either in ascending or descending order), that is
0, 1, 3, 3, 3, 4, 5, 5, 5, 6, 8, 8, 9. The middle most value of the above data after sorting is 5, which is median of the given data set.

When the number of observations in a data set is even then the median value is the average of two middle most values in the sorted data.

Example: Consider the following data set, 5, 9, 8, 4, 3, 1, 0, 8, 5, 3, 5, 6, 3, 2.
To find the median first sort it and then locate the middle most two values, that is,
0, 1, 2, 3, 3, 3, 4, 5, 5, 5, 6, 8, 8, 9. The middle most two values are 4 and 5. So median will be average of these two values, i.e. 4.5 in this case.

The Median is less affected by extreme values in the data set, so median is preferred measure of central tendency when the data set is skewed or not symmetrical.

For large data set it is relatively very difficult to locate median value in sorted data. It will be helpful to use median value using formula. The formula for odd number of observations is
\begin{aligned} Median &=\frac{n+1}{2}th\\ Median &=\frac{n+1}{2}\\ &=\frac{13+1}{2}\\ &=\frac{14}{2}=7th \end{aligned}

The 7th value in sorted data is the median of the given data.

The median formula for even number of observation is
\begin{aligned} Median&=\frac{1}{2}(\frac{n}{2}th + (\frac{n}{2}+1)th)\\ &=\frac{1}{2}(\frac{14}{2}th + (\frac{14}{2}+1)th)\\ &=\frac{1}{2}(7th + 8th )\\ &=\frac{1}{2}(4 + 5)= 4.5 \end{aligned}

Note that median measure of central tendency, cannot be found for categorical data.

## Mode Measure of Central Tendency

The mode is the most frequent observation in the data set i.e. the value (number) that appears the most in data set. It is possible that there may be more than one mode or it may also be possible that there is no mode in a data set. Usually mode is used for categorical data (data belongs to nominal or ordinal scale) but it is not necessary. Mode can also be used for ordinal and ratio scale, but there should be some repeated value in the data set or data set can be classified in groups. If any of the data point don’t have same values (no repetition in data values) , then the mode of that data set will not exit or may not be meaningful. A data set having more than one mode is called multimode or multimodal.

Example 1: Consider the following data set showing the weight of child at age of 10 years: 33, 30, 23, 23, 32, 21, 23, 30, 30, 22, 25, 33, 23, 23, 25. We can found the mode by tabulating the given data in form of frequency distribution table, whose first column is the weight of child and second column is the number of times the weight appear in the data i.e frequency of the each weight in first column.

 Weight of 10 year child Frequency 22 1 23 5 25 2 30 3 32 1 33 2 Total 15

From above frequency distribution table we can easily found the most frequently occurring observation (data point), which will be the mode of data set. Therefore the mode of the given data set is 23, meaning that majority of the 10 year child have weight of 23kg. Note that for finding mode it is not necessary do make frequency distribution table, but it helps in finding the mode quickly and frequency table can also be used in further calculations such as percentage and cumulative percentage of each weight group.

Example 2: Consider we have information of person about his/her gender. Consider the M stands for male and F stands for Female. The sequence of person’s gender noted is as follows: F, F, M, F, F, M, M, M, M, F, M, F, M, F, M, M, M, F, F, M. The frequency distribution table of gender is

 Weight of 10 year child Frequency Male 11 Female 9 Total 25

The mode of gender data is male, showing that most frequent or majority of the people have male gender in this data set.

Mode can be found by simply sorting the data in ascending or descending order. Mode can also be found by counting the frequent value without sorting the data especially when data contains small number of observations, though it may be difficult in remembering the number of times which observation occurs. Note that mode is not affected by the extreme values (outliers or influential observations).

Mode is also a measure of central tendency, but the mode may not reflect the center of the data very well. For example the mean of data set in example 1, is 26.4kg while mode is of 23kg.

One should use mode measure of central tendency, if he/ she expect that data points will repeat or have some classification in it. For example in production process a product produced can be classified as defective or non-defective product. Similarly student grades can classified as A, B, C, D etc. For such kind of data one should use mode as a measure of central tendency instead of mean or median.

Example 3: Consider the following data. 3, 4, 7, 11, 15, 20, 23, 22, 26, 33, 25, 13. There is no mode of this data as each of the value occurs once. Grouping this data in some useful and meaningful form we can get mode of the data for example, the grouped frequency table is

 Group Values Frequency 0 to 9 3, 4, 7 3 10 to 19 11, 13, 15 3 20 to 29 20, 22, 23, 25, 26 5 30 to 39 33 1 Total 12

From this table, we cannot find the most appearing value, but we can say that “20 to 29” is the group in which most of the observations occur. We can say that this group contains the mode which can be found by using mode formula for grouped data.

## Descriptive Statistics Multivariate Data set

Much of the information contained in the data can be assessed by calculating certain summary numbers, known as descriptive statistics such as Arithmetic mean (measure of location), average of the squares of the distances of all of the numbers from the mean (variation/spread i.e. measure of spread or variation) etc. Here we will discuss about descriptive statistics multivariate data set.

We shall rely most heavily on descriptive statistics that is measure of location, variation and linear association.

## Measure of Location

The arithmetic Average of n measurements $(x_{11}, x_{21}, x_{31},x_{41})$ on the first variable (defined in Multivariate Analysis: An Introduction) is

Sample Mean = $\bar{x}=\frac{1}{n} \sum _{j=1}^{n}x_{j1} \mbox{ where } j =1, 2,3,\cdots , n$

The sample mean for $n$ measurements on each of the p variables (there will be p sample means)

$\bar{x}_{k} =\frac{1}{n} \sum _{j=1}^{n}x_{jk} \mbox{ where } k = 1, 2, \cdots , p$

Measure of spread (variance) for n measurements on the first variable can be found as
$s_{1}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )^{2}$ where $\bar{x}_{1}$ is sample mean of the $x_{j}$’s for p variables.

Measure of spread (variance) for n measurements on all variable can be found as

$s_{k}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} \mbox{ where } k=1,2,\dots ,p \mbox{ and } j=1,2,\cdots ,p$

The Square Root of the sample variance is sample standard deviation i.e

$S_{l}^{2} =S_{kk} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} \mbox{ where } k=1,2,\cdots ,p$

Sample Covariance

Consider n pairs of measurement on each of Variable 1 and Variable 2
$\left[\begin{array}{c} {x_{11} } \\ {x_{12} } \end{array}\right],\left[\begin{array}{c} {x_{21} } \\ {x_{22} } \end{array}\right],\cdots ,\left[\begin{array}{c} {x_{n1} } \\ {x_{n2} } \end{array}\right]$
That is $x_{j1}$ and $x_{j2}$ are observed on the jth experimental item $(j=1,2,\cdots ,n)$. So a measure of linear association between the measurements of  $V_1$ and $V_2$ is provided by the sample covariance
$s_{12} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )(x_{j2} -\bar{x}_{2} )$
(the average of product of the deviation from their respective means) therefore

$s_{ik} =\frac{1}{n} \sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )(x_{jk} -\bar{x}_{k} )$;  i=1,2,..,p and k=1,2,\… ,p.

It measures the association between the kth variable.

Variance is the most commonly used measure of dispersion (variation) in the data and it is directly proportional to the amount of variation or information available in the data.

## Sample Correlation Coefficient

The sample correlation coefficient for the ith and kth variable is

$r_{ik} =\frac{s_{ik} }{\sqrt{s_{ii} } \sqrt{s_{kk} } } =\frac{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{j} )(x_{jk} -\bar{x}_{k} ) }{\sqrt{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )^{2} } \sqrt{\sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} } }$
$\mbox{ where } i=1,2,..,p \mbox{ and} k=1,2,\dots ,p$

Note that $r_{ik} =r_{ki}$ for all $i$ and $k$, and $r$ lies between -1 and +1. $r$ measures the strength of the linear association. If $r=0$ the lack of linear association between the components exists. The sign of $r$ indicates the direction of the association.

## Measure of Dispersion or Variability

The measure of location or averages or central tendency is not sufficient to describe the characteristics of a distribution, because two or more distributions may have averages which are exactly alike, even though the distributions are dissimilar in other aspects, and on the other hand, measure of central tendency represents the typical value of the data set. To give a sensible description of data, a numerical quantity called measure of dispersion/ variability or scatter that describe the spread of the values in a set of data have two types of measures of dispersion or variability:

1. Absolute Measures
2. Relative Measures

A measure of central tendency together with a measure of dispersion gives adequate description of data as compared to use of measure of location only, because the averages or measures of central tendency only describes the balancing point of the data set, it does not provide any information about the degree to which the data tend to spread or scatter about the average value. So Measure of dispersion is an indication of the characteristic of the central tendency measure. The smaller the variability of a given set, the more the values of the measure of averages will be representative of the data set.

1. Absolute Measures
Absolute measures defined in such a way that they have units such as meters, grams etc. same as those of the original measurements. Absolute measures cannot be used to compare the variation/spread of two or more sets of data.
Most Common absolute measures of variability are:

• Range
• Semi-Interquartile Range or Quartile Deviation
• Mean Deviation
• Variance
• Standard Deviation
2. Relative Measures
The relative measures have no units as these are ratios, coefficients, or percentages. Relative measures are independent of units of measurements and are useful for comparing data of different natures.

• Coefficient of Variation
• Coefficient of Mean Deviation
• Coefficient of Quartile Deviation
• Coefficient of Standard Deviation

Different terms are used for measure of dispersion or variability such as variability, spread, scatter, measure of uncertainty,deviation etc.

References:
http://www2.le.ac.uk/offices/careers/ld/resources/numeracy/variability

# Measure of Central Tendency or Measure of Location

Measure of central tendency is a statistic that summarizes the entire quantitative set of data in a single value (a representative value of the data set) having tendency to concentrate somewhere in the center of the data.
The tendency of the observations to cluster in the central part of the data is called the central tendency and the summary values as a measure of central tendency, also known as measure of location or position, it is also known as averages too.
Note that

• Measure of central tendency should be somewhere within the range of the data set.
• It should remain unchanged by a rearrangement of the observations in a different order.

## Criteria of a Satisfactory Measure of Location or Averages

There are several types of averages available to measure the representative value of a set of data or distribution. So it is desirable that an average should satisfy or possess all or most of the following conditions.

• It should be well defined i.e rigorously defined. There should be no confusion in its definition. Such as Sum of values divided by their total number is well defined definition of Arithmetic Mean.
• It should be based on all the observation made.
• Should be Simple to understand and easy to interpret.
• Can be calculated quickly and easily.
• Should be amenable/manageable o mathematical treatment.
• Should be relatively stable in repeating sampling experiments.
• Should not be unduly influenced by abnormally large or small observations (i.e. extreme observations)

The mean, median and mode are all valid measures of central tendency, but under different conditions, some measures of central tendency become more appropriate to use than others. There are several different kinds of calculations for central tendency where the kind of calculation depends on the type of the data i.e. level of measurement on which data is measured.

The following are measure of central tendencies may be for univariate or multivariate data.

• Arithmetic mean: The sum of all measurements divided by the number of observations in the data set
• Median:  The middle most value for sorted data. Median separates the higher half from the lower half of the data set i.e partition the data set in to parts.
• Mode: The most frequent or repeated value in the data set.
• Geometric mean: The nth root of the product of the data values.
• Harmonic mean: The reciprocal of the arithmetic mean of the reciprocals of the data values
• Weighted mean: An arithmetic mean incorporating the  weights to elements of the certain data.
• Distance-weighted estimator: The measure uses weighting coefficients for xi that are computed as the inverse mean distance between xi and the other data points.
• Truncated mean: The arithmetic mean of data values after a certain number or proportion of the highest and lowest data values have been discarded.
• Midrange: The arithmetic mean of the maximum and minimum values of a data set.
• Midhinge: The arithmetic mean of the two quartiles.
• Trimean: The weighted arithmetic mean of the median and two quartiles.
• Winsorized mean: An arithmetic mean in which extreme values are replaced by values closer to the median.

Reference:
1) Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP. ISBN 0-19-920613-9
2) http://en.wikipedia.org/wiki/Central_tendency
3) Dodge, Y. (2005) The Concise Encyclopedia of Statistics. Springer,