# Basic Statistics and Data Analysis

## Quartiles: Measure of relative standing of an observation within data

Like Percentile and Deciles, Quartiles is a type of Quantile, which is measure of relative standing of an observation within the data set. Quartiles are the values are three points that divide the data into four equal parts each group comprising a quarter of the data (the first quartile $Q_1$, second quartile $Q_2$ (also median) and the third quartile $Q_3$) in the order statistics.The first quartile, (also known as the lower quartile) is the value of order statistic that exceeds 1/4 of the observations and less than the remaining 3/4 observations. The third quartile is known as upper quartile is the value in the order statistic that exceeds 3/4 of the observations and is less than remaining 1/4 observations, while second quartile is the median.

## Quartiles for Ungrouped Data

For ungrouped data, the quartiles are calculated by splitting the order statistic at the median and then calculating the median of the two halves. If n is odd, the median can be included in both sides.

Example: Find the $Q_1, Q_2$ and $Q_3$ for the following ungrouped data set 88.03, 94.50, 94.90, 95.05, 84.60.Solution: We split the order statistic at the median and calculate the median of two halves. Since n is odd, we can include the median in both halves. The order statistic is 84.60, 88.03, 94.50, 94.90, 95.05.

\begin{align*}
Q_2&=median=Y_{(\frac{n+1}{2})}=Y_{(3)}\\
&=94.50  (\text{the third observation})\\
Q_1&=\text{Median of the first three value}=Y_{(\frac{3+1}{2})}\\&=Y_{(2)}=88.03 (\text{the second observation})\\
Q_3&=\text{Median of the last three values}=Y_{(\frac{3+5}{2})}\\
&=Y_{(4)}=94.90 (\text{the forth observation})
\end{align*}

## Quartiles for Grouped Data

For the grouped data (in ascending order) the quartiles are calculated as:
\begin{align*}
Q_1&=l+\frac{h}{f}(\frac{n}{4}-c)\\
Q_2&=l+\frac{h}{f}(\frac{2n}{4}-c)\\
Q_3&=l+\frac{h}{f}(\frac{3n}{4}-c)
\end{align*}
where
l    is the lower class boundary of the class containing the $Q_1,Q_2$ or $Q_3$.
h    is the width of the class containing the $Q_1,Q_2$ or $Q_3$.
f    is the frequency of the class containing the $Q_1,Q_2$ or $Q_3$.
c    is the cumulative frequency of the class immediately preceding to the class containing $Q_1,Q_2$ or $Q_3, \left[\frac{n}{4},\frac{2n}{4} \text{or} \frac{3n}{4}\right]$ are used to locate $Q_1,Q_2$ or $Q_3$ group.

Example: Find the quartiles for the following grouped dataSolution: To locate the class containing $Q_1$, find $\frac{n}{4}$th observation which is here $\frac{30}{4}$th observation i.e. 7.5th observation. Note that 7.5th observation falls in the group ($Q_1$ group) 90.5–95.5.
\begin{align*}
Q_1&=l+\frac{h}{f}(\frac{n}{4}-c)\\
&=90.5+\frac{5}{4}(7.5-6)=90.3750
\end{align*}

For $Q_2$, the $\frac{2n}{4}$th observation=$\frac{2 \times 30}{4}$th observation = 15th observation falls in the group 95.5–100.5.
\begin{align*}
Q_2&=l+\frac{h}{f}(\frac{2n}{4}-c)\\
&=95.5+\frac{5}{10}(15-10)=98
\end{align*}

For $Q_3$, the $\frac{3n}{4}$th observation=$\frac{3\times 30}{4}$th = 22.5th observation. So
\begin{align*}
Q_3&=l+\frac{h}{f}(\frac{3n}{4}-c)\\
&=100.5+\frac{5}{6}(22.5-20)=102.5833
\end{align*}

Reference:

## Percentiles: Measure of relative standing of an observation within data

Percentiles are measure of relative standing of an observation within a data. Percentiles divides a set of observations into 100 equal parts, and percentile scores are frequently used to report results from national standardized tests such as NAT, GAT etc.

The pth percentile is the value Y(p) in order statistic such that p percent of the values are less than the value Y(p) and (100-p) percent of the values are greater Y(p) . The 5th percentile is denoted by P5 , the 10th by P10 and 95th by P95 .

## Percentiles for the ungrouped data

To calculate percentiles (measure of relative standing of an observation) for the ungrouped data, adopt the following procedure

1. Order the observation
2. For the mth percentile, determine the product $\frac{m.n}{100}$. If $\frac{m.n}{100}$ is not an integer, round it up and find the corresponding ordered value and if $\frac{m.n}{100}$ is an integer, say k, then calculate the mean of the Kth and (k+1)th ordered observations.

Example: For the following height data collected from students find the 10th and 95th percentiles. 91, 89, 88, 87, 89, 91, 87, 92, 90, 98, 95, 97, 96, 100, 101, 96, 98, 99, 98, 100, 102, 99, 101, 105, 103, 107, 105, 106, 107, 112.

Solution: The ordered observations of the data are 87, 87, 88, 89, 89, 90, 91, 91, 92, 95, 96, 96, 97, 98, 98, 98, 99, 99, 100, 100, 101, 101, 102, 103, 105, 105, 106, 107, 107, 112.

$P_{10}= \frac{10 \times 30}{100}=3$

So the 10th percentile i.e  P10 is 3rd observation in sorted data is 88, means that 10 percent of the observations in data set are less than 88.

$P_{95}=\frac{95 \times 30}{100}=28.5$

29th observation is our 95th percentile i.e. P95=107.

## Percentiles for the Grouped data

The mth percentile (measure of relative standing of an observation) for grouped data is

$P_m=l+\frac{h}{f}\left(\frac{m.n}{100}-c\right)$

Like median, $\frac{m.n}{100}$ is used to locate the mth percentile group.

l     is the lower class boundary of the class containing the mth percentile
h   is the width of the class containing Pm
f    is the frequency of the class containing
n   is the total number of frequencies Pm
c    is the cumulative frequency of the class immediately preceding to the class containing Pm

Note that 50th percentile is the median by definition as half of the values in the data are smaller than the median and half of the values are larger than the median. Similarly 25th and 75th percentiles are the lower (Q1) and upper quartiles (Q3) respectively. The quartiles, deciles and percentiles are also called quantiles or fractiles.

Measure of relative standing of an observation in Grouped Data

Example: For the following grouped data compute P10 , P25 , P50 , and P95 given below.Solution:

1. Locate the 10th percentile (lower deciles i.e. D1)by $\frac{10 \times n}{100}=\frac{10 \times 3o}{100}=3$ observation.
so, P10 group is 85.5–90.5 containing the 3rd observation
\begin{align*}
P_{10}&=l+\frac{h}{f}\left(\frac{10 n}{100}-c\right)\\
&=85.5+\frac{5}{6}(3-0)\\
&=85.5+2.5=88
\end{align*}
2. Locate the 25th percentile (lower quartiles i.e. Q1)  by $\frac{10 \times n}{100}=\frac{25 \times 3o}{100}=7.5$ observation.
so, P25 group is 90.5–95.5 containing the 7.5th observation
\begin{align*}
P_{25}&=l+\frac{h}{f}\left(\frac{25 n}{100}-c\right)\\
&=90.5+\frac{5}{4}(7.5-6)\\
&=90.5+1.875=92.375
\end{align*}
3. Locate the 50th percentile (Median i.e. 2nd quartiles, 5th deciles) by $\frac{50 \times n}{100}=\frac{50 \times 3o}{100}=15$ observation.
so, P50 group is 95.5–100.5 containing the 15th observation
\begin{align*}
P_{50}&=l+\frac{h}{f}\left(\frac{50 n}{100}-c\right)\\
&=95.5+\frac{5}{10}(15-10)\\
&=95.5+2.5=98
\end{align*}
4. Locate the 95th percentile by $\frac{95 \times n}{100}=\frac{95 \times 3o}{100}=28.5$th observation.
so, P95 group is 105.5–110.5 containing the 3rd observation
\begin{align*}
P_{95}&=l+\frac{h}{f}\left(\frac{95 n}{100}-c\right)\\
&=105.5+\frac{5}{3}(28.5-26)\\
&=105.5+4.1667=109.6667
\end{align*}

The percentiles and quartiles may be read directly from the graphs of cumulative frequency function.

# Deciles (Measures of Positions)

The deciles are the values (nine in numbers) of the variable that divide an ordered (sorted, arranged) data set into ten equal parts so that each part represents 1/10 of the sample or population. Deciles are denoted by D1D2, D3,…D10, where First decile (D1) is the value of order statistics that exceeds 1/10 of the observations and less than the remaining 9/10 and the D9 (ninth decile) is the value in order statistic that exceeds 9/10 of the observations and is less than 1/10 remaining observations. Note that the fifth deciles is equal to median. The deciles determine the values for 10%, 20%… and 90% of the data.

## Calculating Deciles for ungrouped Data

To calculate deciles for the ungrouped data, first order the all observation according to the magnitudes of the values, then use the following formula for mth decile.

$D_m= m \times \left( \frac{(n+1)}{10} \right) \mbox{th value; } \qquad \mbox{where} m=1,2,\cdots,9$

Example: Calculate 2nd and 8th deciles of following ordered data 13, 13,13, 20, 26, 27, 31, 34, 34, 34, 35, 35, 36, 37, 38, 41, 41, 41, 45, 47, 47, 47, 50, 51, 53, 54, 56, 62, 67, 82.
Solution:

\begin{eqnarray*}
D_m &=&m \times \{\frac{(n+1)}{10} \} \mbox{th value}\\
&=& 2 \times \frac{30+1}{10}=6.2\\
\end{eqnarray*}

We have to locate the sixth value in the ordered array and then have to more 0.2 of the distance between the sixth and seventh values. i.e. the value of 2nd decile can be calculated as
$6 \mbox{th observation} + \{7 \mbox{th observation} – 6 \mbox{th observation} \}\times 0.2$
as 6th observation is 27 and 7th observation is 31.
The second decile would be $27+\{31-27\} \times 0.2 = 27.8$

Similarly D can be calculated. D8 = 52.6.

## Calculating Deciles for grouped Data

The mth decile for grouped data (in ascending order) can be calculated from the following formula.

$D_m=l+\frac{h}{f}\left(\frac{m.n}{10}-c\right)$

where

l = is the lower class boundary of the class containing mth deciles
h = is the width of the class containing mth deciles
f = is the frequency of the class containing mth deciles
n = is the total number of frequencies
c = is the cumulative frequency of the class preceding to the class containing mth deciles

Example: Calculate the first and third deciles of the following grouped data

Solution: Deciles class for D1 can be calculated from $\left(\frac{m.n}{10}-c\right) = \frac{1 \times 30}{10} = 3$rd observation. As 3rd observation lie in first class (first group) so

\begin{eqnarray*}
D_m&=&l+\frac{h}{f}\left(\frac{m.n}{10}-c\right)\\
D_1&=&85.5+\frac{5}{6}\left(\frac{1\times30}{10}-0\right)\\
&=&88\\
\end{eqnarray*}

Deciles class for D7 is 100.5—105.5 as $\frac{7 \times 30}{10}=21$th observation which is in fourth class (group).
\begin{eqnarray*}
D_m&=&l+\frac{h}{f}\left(\frac{m.n}{10}-c\right)\\
D_7&=&100.5+\frac{5}{6}\left(\frac{7\times30}{10}-20\right)\\
&=&101.333\\
\end{eqnarray*}

# Measure of Central Tendency or Measure of Location

Measure of central tendency is a statistic that summarizes the entire quantitative set of data in a single value (a representative value of the data set) having tendency to concentrate somewhere in the center of the data.
The tendency of the observations to cluster in the central part of the data is called the central tendency and the summary values as a measure of central tendency, also known as measure of location or position, it is also known as averages too.
Note that

• Measure of central tendency should be somewhere within the range of the data set.
• It should remain unchanged by a rearrangement of the observations in a different order.

## Criteria of a Satisfactory Measure of Location or Averages

There are several types of averages available to measure the representative value of a set of data or distribution. So it is desirable that an average should satisfy or possess all or most of the following conditions.

• It should be well defined i.e rigorously defined. There should be no confusion in its definition. Such as Sum of values divided by their total number is well defined definition of Arithmetic Mean.
• It should be based on all the observation made.
• Should be Simple to understand and easy to interpret.
• Can be calculated quickly and easily.
• Should be amenable/manageable o mathematical treatment.
• Should be relatively stable in repeating sampling experiments.
• Should not be unduly influenced by abnormally large or small observations (i.e. extreme observations)

The mean, median and mode are all valid measures of central tendency, but under different conditions, some measures of central tendency become more appropriate to use than others. There are several different kinds of calculations for central tendency where the kind of calculation depends on the type of the data i.e. level of measurement on which data is measured.

The following are measure of central tendencies may be for univariate or multivariate data.

• Arithmetic mean: The sum of all measurements divided by the number of observations in the data set
• Median:  The middle most value for sorted data. Median separates the higher half from the lower half of the data set i.e partition the data set in to parts.
• Mode: The most frequent or repeated value in the data set.
• Geometric mean: The nth root of the product of the data values.
• Harmonic mean: The reciprocal of the arithmetic mean of the reciprocals of the data values
• Weighted mean: An arithmetic mean incorporating the  weights to elements of the certain data.
• Distance-weighted estimator: The measure uses weighting coefficients for xi that are computed as the inverse mean distance between xi and the other data points.
• Truncated mean: The arithmetic mean of data values after a certain number or proportion of the highest and lowest data values have been discarded.
• Midrange: The arithmetic mean of the maximum and minimum values of a data set.
• Midhinge: The arithmetic mean of the two quartiles.
• Trimean: The weighted arithmetic mean of the median and two quartiles.
• Winsorized mean: An arithmetic mean in which extreme values are replaced by values closer to the median.

Reference:
1) Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP. ISBN 0-19-920613-9
2) http://en.wikipedia.org/wiki/Central_tendency
3) Dodge, Y. (2005) The Concise Encyclopedia of Statistics. Springer,