# Basic Statistics and Data Analysis

## Absolute Measure of Dispersion

Absolute Measure of Dispersion gives an idea about the amount of dispersion/ spread in a set of observations. These quantities measures the dispersion in the same units as the units of original data. Absolute measures cannot be used to compare the variation of two or more series/ data set. A measure of absolute dispersion does not in itself, tell whether the variation is large or small.

## Range

Range is the difference between the largest value and the smallest value in the data set. For ungrouped data, let $X_0$ is the smallest value and $X_n$ is the largest  value in a data set then the range (R) is defined as
$R=X_n-X_0$.

For grouped data Range can be calculated in three different ways
R=Mid point of highest class – Mid point of lowest class
R=Upper class limit of highest class-Lower class limit of lower class
R=Upper class boundary of highest class – Lower class boundary of lowest class

## Quartile Deviation (Semi-Interquantile Range)

Quartile deviation defined as the difference between the third and first quartiles, and half of this range is called the semi-interquartile range (SIQD) or simply quartile deviation (QD). $QD=\frac{Q_3-Q_1}{2}$
The Quartile Deviation is superior to range as it is not affected by extremely large or small observations, any how it does not give any information about the position of observation lying outside the two quantities. It is not amenable to mathematical treatment and is greatly affected by sampling variability. Although Quartile Deviation is not widely used as measure of dispersion, but it is used in situations in which extreme observations are thought to be unrepresentative/ misleading. Quartile Deviation is not based on all observation therefore it is affected by extreme observations.

Note: The range “Median ± QD” contains approximately 50% of the data.

## Mean Deviation (Average Deviation)

The Mean Deviation is defined as the arithmetic mean of the deviations measured either from mean or from the median. All these deviations are counted as positive to avoid the difficulty arising from the property that the sum of deviations of observations from their mean is zero.
$MD=\frac{\sum|X-\overline{X}|}{n}\quad$ for ungrouped data for mean
$MD=\frac{\sum f|X-\overline{X}|}{\sum f}\quad$ for grouped data for mean
$MD=\frac{\sum|X-\tilde{X}|}{n}\quad$ for ungrouped data for median
$MD=\frac{\sum f|X-\tilde{X}|}{\sum f}\quad$ for grouped data for median
Mean Deviation can be calculated about other central tendencies but it is least when deviations are taken as median.

The Mean Deviation gives more information than range or the Quartile Deviation as it is based on all the observed values. The Mean Deviation does not give undue weight to occasional large deviations, so it should likely to be used in situation where such deviation are likely to occur.

## Variance and Standard Deviation

This absolute measure of dispersion is defined as the mean of the squares of deviations of all the observations from their mean. Traditionally for population variance is denoted by $\sigma^2$ (sigma square) and for sample data denoted by $S^2$ or $s^2$.
Symbolically
$\sigma^2=\frac{\sum(X_i-\mu)^2}{N}\quad$ Population Variance for ungrouped data
$S^2=\frac{\sum(X_i-\overline{X})^2}{n}\quad$ sample Variance for ungrouded data
$\sigma^2=\frac{\sum f(X_i-\mu)^2}{\sum f}\quad$ Population Variance for grouped data
$\sigma^2=\frac{\sum f (X_i-\overline{X})^2}{\sum f}\quad$ Sample Variance for grouped data

The variance is denoted by Var(X) for random variable X. The term variance was introduced by R. A. Fisher (1890-1982) in 1918. The variance is in square of units and the variance is a large number compared to observation themselves.
Note that there are alternative formulas to compute Variance or Standard Deviations.

The positive square root of the variance is called Standard Deviation (SD) to express the deviation in the same units as the original observation themselves.It is a measure of the average spread about the mean and symbolically defined as
$\sigma^2=\sqrt{\frac{\sum(X_i-\mu)^2}{N}}\quad$ Population Standard for ungrouped data
$S^2=\sqrt{\frac{\sum(X_i-\overline{X})^2}{n}}\quad$ Sample Standard Deviation for ungrouped data
$\sigma^2=\sqrt{\frac{\sum f(X_i-\mu)^2}{\sum f}}\quad$ Population Standard Deviation for grouped data
$\sigma^2=\sqrt{\frac{\sum f (X_i-\overline{X})^2}{\sum f}}\quad$ Sample Standard Deviation for grouped data
Standard Deviation is most useful measure of dispersion is credited with the name Standard Deviation by Karl Pearson (1857-1936).
In some text Sample Standard Deviation is defined as $S^2=\frac{\sum (X_i-\overline{X})^2}{n-1}$ on the basis of the argument that knowledge of any $n-1$ deviations determines the remaining deviations as the sum of n deviations must be zero. In fact this is an unbiased estimator of the population variance $\sigma^2$. The Standard Deviation has a definite mathematical measure, it utilizes all the observed values and is amenable to mathematical treatment but affected by extreme values.

References

# Deciles (Measures of Positions)

The deciles are the values (nine in numbers) of the variable that divide an ordered (sorted, arranged) data set into ten equal parts so that each part represents 1/10 of the sample or population. Deciles are denoted by D1D2, D3,…D10, where First decile (D1) is the value of order statistics that exceeds 1/10 of the observations and less than the remaining 9/10 and the D9 (ninth decile) is the value in order statistic that exceeds 9/10 of the observations and is less than 1/10 remaining observations. Note that the fifth deciles is equal to median. The deciles determine the values for 10%, 20%… and 90% of the data.

## Calculating Deciles for ungrouped Data

To calculate deciles for the ungrouped data, first order the all observation according to the magnitudes of the values, then use the following formula for mth decile.

$D_m= m \times \left( \frac{(n+1)}{10} \right) \mbox{th value; } \qquad \mbox{where} m=1,2,\cdots,9$

Example: Calculate 2nd and 8th deciles of following ordered data 13, 13,13, 20, 26, 27, 31, 34, 34, 34, 35, 35, 36, 37, 38, 41, 41, 41, 45, 47, 47, 47, 50, 51, 53, 54, 56, 62, 67, 82.
Solution:

\begin{eqnarray*}
D_m &=&m \times \{\frac{(n+1)}{10} \} \mbox{th value}\\
&=& 2 \times \frac{30+1}{10}=6.2\\
\end{eqnarray*}

We have to locate the sixth value in the ordered array and then have to more 0.2 of the distance between the sixth and seventh values. i.e. the value of 2nd decile can be calculated as
$6 \mbox{th observation} + \{7 \mbox{th observation} – 6 \mbox{th observation} \}\times 0.2$
as 6th observation is 27 and 7th observation is 31.
The second decile would be $27+\{31-27\} \times 0.2 = 27.8$

Similarly D can be calculated. D8 = 52.6.

## Calculating Deciles for grouped Data

The mth decile for grouped data (in ascending order) can be calculated from the following formula.

$D_m=l+\frac{h}{f}\left(\frac{m.n}{10}-c\right)$

where

l = is the lower class boundary of the class containing mth deciles
h = is the width of the class containing mth deciles
f = is the frequency of the class containing mth deciles
n = is the total number of frequencies
c = is the cumulative frequency of the class preceding to the class containing mth deciles

Example: Calculate the first and third deciles of the following grouped data

Solution: Deciles class for D1 can be calculated from $\left(\frac{m.n}{10}-c\right) = \frac{1 \times 30}{10} = 3$rd observation. As 3rd observation lie in first class (first group) so

\begin{eqnarray*}
D_m&=&l+\frac{h}{f}\left(\frac{m.n}{10}-c\right)\\
D_1&=&85.5+\frac{5}{6}\left(\frac{1\times30}{10}-0\right)\\
&=&88\\
\end{eqnarray*}

Deciles class for D7 is 100.5—105.5 as $\frac{7 \times 30}{10}=21$th observation which is in fourth class (group).
\begin{eqnarray*}
D_m&=&l+\frac{h}{f}\left(\frac{m.n}{10}-c\right)\\
D_7&=&100.5+\frac{5}{6}\left(\frac{7\times30}{10}-20\right)\\
&=&101.333\\
\end{eqnarray*}

# Probability Related Terms

Sets: A set is a well defined collection of distinct objects. The objects making up a set are called its elements. A set is usually capital letters i.e. A, B, C, while its elements are denoted by small letters i.e. a, b, c etc.

Null Set: A set that contains no element is called null set or simply the empty set. It is denoted by { } or Φ.

Subset: If every element of a set A is also an element of a set B, then A is said to be a subset of B and it is denoted by A≠B.

Proper Subset: If A is a subset of B, and B contains at least one element which is not an element of A, then A is said to be a proper subset of B and is denoted by; A $\subset$ B.

Finite and Infinite Sets: A set is finite, if it contains a specific number of elements, i.e. while counting the members of the sets, the counting process comes to an end otherwise the set is an infinite set.

Universal Set: A set consisting of all the elements of the sets under consideration is called the universal set. It is denoted by U.

Disjoint Set: Two sets A and B are said to be disjoint sets, if they have no elements in common i.e. if A U B =Φ, A then A and B are said to be disjoint sets.

Overlapping Sets: Two sets A and B are said to be overlapping sets, if they have at least one element in common, i.e. if A ∩ B ≠Φ and none of them is the subset of the other set then A and B are overlapping sets.

Union of Sets: Union of two sets A and B is a set that contains the elements either belonging to A or B or to both. It is denoted by B and read as A union B.

Intersection of Sets: Intersection of two sets A and B is a set that contains the elements belonging to both A and B. It is denoted by A U B and read as A intersection B.

Difference of Sets: The difference of a set A and a set B is the set that contains the elements of the set A which are not contained in B. The difference of sets A and B is denoted by A−B.

Complement of a Set: Complement of a set A denoted by $\bar{A}$ or $A^c$ and is defined as $\bar{A}$=U−A.

Experiment: Any activity where we observe something or measure something. Or an activity that results in or produces an event is called experiment.

Random Experiment: An experiment, if repeated under identical conditions may not give the same outcome, i.e The outcome of random experiment is uncertain, so that a given outcome is just one sample of many possible outcomes. For random experiment we knows about the all possible outcomes. A random experiment has the following properties;

1. The experiment can be repeated any number of times.
2. A random trial consists of at least two outcomes.

Sample Space: The set of all possible outcomes in a random experiment is called sample space. In coin toss experiment, the sample space is S={Head, Tail}, in card-drawing experiment the sample space has 52 member. Similarly the sample space for a die={1,2,3,4,5,6}.

Event:Event is simply a subset of sample space. In a sample space there can be two or more events consisting of sample points. For coin, the list of all possible event is 4, found by event =2ni.e. i) A1={H}, ii) A2={T}, iii) A3={H,T} and iv) A4are possible event for coin toss experiment.

Simple Event: If an event consists of one sample point, then it is called simple event. For example, when two coins are tossed, the event {TT} is a simple event.

Compound Event: If an event consists of more than one sample point, it is called a compound event. For example, when two dice are rolled, an event B, the sum of two faces is 4 i.e. B={(1,3), (2,3), 3,1)} is a compound event.

Independent Events: Two events A and B are said to be independent, if the occurrence of one does not affect the occurrence of the other. For example, in tossing two coins, the occurrence of a head on one coin does not affect in any way the occurrence of a head or tail on the other coin.

Dependent Events: Two events A and B are said to be dependent, if the occurrence of one event affects the occurrence of the other event.

Mutually Exclusive Events: Two events A and B are said to be mutually exclusive, if they cannot occur at the same time i.e. AUB=Φ. For example, when a coin is tossed, we get either a head or a tail, but not both. That is why they have no common point there, so these two events (head and tail) are mutually exclusive. Similarly, when a die is thrown, the possible outcomes 1, 2, 3, 4, 5, 6 are mutually exclusive.

Equally Likely or Non-Mutually Exclusive Events: Two events A and B are said to be equally likely events when one event is as likely to occur as the other. OR If the experiment is continued a large number of times all the events have the chance of occurring equal number of times. Mathematically, AUB≠Φ. For example when a coin is tossed, head is as likely to occur as tail or vice versa.

Exhaustive Events: When a sample space S is partitioned into some mutually exclusive events, such that their union is the sample space itself, the event are called exhaustive event. OR
Events are said to be collectively exhaustive when the union of mutually exclusive events is the entire sample space S.
Let a die is rolled, the sample space is S={1,2,3,4,5,6}.
Let A={1,2}, B={3,4,5} and C={6}

A, B and C are mutually exclusive events and their union is (AUBUC=S) is the sample space, so the events A, B and C are exhaustive.

# Sampling Error Definition, Example, Formula

Sampling error also called estimation error is the amount of inaccuracy in estimating some value that is caused by only a portion of a population (i.e. sample) rather than the whole population. It is the difference between the statistic (value of sample, such as sample mean) and the corresponding parameter (value of population, such as population mean) is called the sampling error. If $\bar{x}$ is the sample statistic and $\mu$ is the corresponding parameter then the sampling error is $\bar{x} – \mu$.

Exact calculation/ measurements of sampling error is not feasible generally as the true value of population is unknown usually, however it can often be estimated by probabilistic modeling of the sample.

Sampling Error

Cause of Sampling Error

• The cause of the Error discussed may be due to the biased sampling procedure. Every research should select sample(s) that is free from any bias and the sample(s) is representative of the entire population of interest.
• Another cause of this Error is chance. The process of randomization and probability sampling is done to minimize the sampling process error but it is still possible that all the randomized subjects/ objects are not the representative of the population.

Eliminate/ Reduce the Sampling Error

The elimination/ Reduction of sampling error can be done when a proper and unbiased probability sampling technique is used by the researcher and the sample size is large enough.

• Increasing the sample size
The sampling error can be reduced by increasing the sample size. If the sample size n is equal to the population size N, then the sampling error will be zero.
• Improving the sample design i.e. By using the stratification
The population is divided into different groups containing similar units.

Also Read: Sampling and NonSampling Errors

# Bias (Statistical Bias)

Bias is defined as the difference between the expected value of a statistic and the true value of the corresponding parameter. Therefore the bias is a measure of the systematic error of an estimator. The bias indicates the distance of the estimator from the true value of the parameter. For example, if we calculate the mean of large number of unbiased estimators, we will find the correct value.

Gauss, C.F. (1821) during his work on the least squares method gave the concept of an unbiased estimator.

Bias of an estimator of a parameter should not be confused with its degree of precision as degree of precision is a measure of the sampling error.

There are several types of bias which should not be considered as mutually exclusive

• Selection Bias (arise due to systematic differences between the groups compared)
• Exclusion Bias (arise due to the systematic exclusion of certain individuals from the study)
• Analytical Bias (arise due to the way that the results are evaluated)

Mathematically Bias can be Defined as

Let statistics T used to estimate a parameter θ, if E(T)=θ + b(θ) then b(θ) is called the bias of the statistic T, where E(T) represents the expected value of the statistics T. Note that if b(θ)=0, then E(T)=θ. So T is an unbiased estimator of θ.

Reference:
Gauss, C.F. (1821, 1823, 1826). Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, Parts 1, 2 and suppl. Werke 4, 1-108.