# Basic Statistics and Data Analysis

## Standard Deviation

The standard deviation is a widely used concept in statistics and it tells how much variation (spread or dispersion) is in the data set. It can be defined as the positive square root of the mean (average) of the squared deviations of the values from their mean.
To calculate the standard deviation one have to follow these steps:

1. First, find the mean of the data.
2. Take the difference of each data point from the mean of the given data set (which is computed in step 1). Note that, the sum of these differences must be equal to zero or near to zero due to rounding of numbers.
3. Now computed the square the differences that were obtained in step 2, It would be greater than zero, that it, I will be a positive quantity.
4. Now add up all the squared quantities obtained in step 3. We call it the sum of squares of differences.
5. Divide this sum of squares of differences (obtained in step 4) by the total number of observation (available in data) if we have to calculate population standard deviation ($\sigma$). If you want t to compute sample standard deviation ($S$) then divide the sum of squares of differences (obtained in step 4) by the total number of observation minus one ($n-1$) i.e. the degree of freedom. Note $n$ is the number of observations available in the data set.
6. Find the square root (also known as under root) of the quantity obtained in step 5. The resultant quantity in this way known as the standard deviation for given data set.

The sample standard deviation of a set of $n$ observation, $$X_1, X_2, \cdots, X_n$$ denoted by $S$ is
\begin{aligned}
\sigma &=\sqrt{\sum_{i=1}^n \frac{X_i-\overline{X}}{n}}; Population\, Standard\, Deviation\\
S&=\sqrt{\sum_{i=1}^n \frac{X_i-\overline{X}}{n-1}}; Sample\, Standard\, Deviation
\end{aligned}
The standard deviation can be computed from variance too as $S= \sqrt{Variance}$.

The real meaning of the standard deviation is that for a given data set 68% of the data values will lie within the range $\overline{X} \pm \sigma$ i.e. within one standard deviation from mean or simply within one $\sigma$. Similarly, 95% of the data values will lie within the range $\overline{X} \pm 2 \sigma$ and 99% within $\overline{X} \pm 3 \sigma$.

Examples of Standard Deviation and Variance

A large value of standard deviation indicates more spread in the data set which can be interpreted as the inconsistent behaviour of the data collected. It means that the data points tend to away from the mean value. For the case of smaller standard deviation, data points tend to be close (very close) to mean indicating the consistent behaviour of data set.
The standard deviation and variance both are used to measure the risk of a particular investment in finance. The mean of 15% and standard deviation of 2% indicates that it is expected to earn a 15% return on an investment and we have 68% chance that the return will actually be between 13% and 17%. Similarly, there are 95% chance that the return on the investment will yield an 11% to 19% return.

## Skewness: Measure of Asymmetry

The skewed and askew are widely used terminologies that refer to something that is out of order or distorted on one side. Similarly, when referring to the shape of frequency distributions or probability distributions, the term skewness also refers to asymmetry of that distribution. A distribution with an asymmetric tail extending out to the right is referred to as “positively skewed” or “skewed to the right”, while a distribution with an asymmetric tail extending out to the left is referred to as “negatively skewed” or “skewed to the left”. The range of skewness is from minus infinity ($-\infty$) to positive infinity ($+\infty$). In simple words skewness (asymmetry) is measure of symmetry or in other words skewness is the lack of symmetry.

Karl Pearson (1857-1936) first suggested measuring skewness by standardizing the difference between the mean and the mode, such that, $skewness=\frac{\mu-mode}{\text{standard deviation}}$. Since, population modes are not well estimated from sample modes, therefore Stuart and Ord, 1994 suggested that one can estimate the difference between the mean and the mode as being three times the difference between the mean and the median. Therefore, the estimate of skewness will be: $skewness=\frac{3(M-median)}{\text{standard deviation}}$. Many of the statisticians use this measure but after eliminating the ‘3’, that is, $skewness=\frac{M-Median}{\text{standard deviation}}$. This statistic ranges from $-1$ to $+1$. According to Hilderand, 1986, absolute values of skewness above 0.2 indicate great skewness.

Skewness has also been defined with respect to the third moment about the mean, that is $\gamma_1=\frac{\sum(X-\mu)^3}{n\sigma^3}$, which is simply the expected value of the distribution of cubed $Z$ scores. Skewness measured in this way is also sometimes referred to as “Fisher’s skewness”. When the deviations from the mean are greater in one direction than in the other direction, this statistic will deviate from zero in the direction of the larger deviations. From sample data, Fisher’s skewness is most often estimated by: $g_1=\frac{n\sum z^3}{(n-1)(n-2)}$. For large sample sizes ($n > 150$), $g_1$ may be distributed approximately normally, with a standard error of approximately $\sqrt{\frac{6}{n}}$. While one could use this sampling distribution to construct confidence intervals for or tests of hypotheses about $\gamma_1$, there is rarely any value in doing so.

Arthur Lyon Bowley (1869-19570, has also proposed a measure of skewness based on the median and the two quartiles. In a symmetrical distribution, the two quartiles are equidistant from the median but in an asymmetrical distribution, this will not be the case. The Bowley’s coefficient of skewness is $skewness=\frac{q_1+q_3-2\text{median}}{Q_3-Q_1}$. Its value lies between 0 and $\pm1$.

The most commonly used measures of skewness (those discussed here) may produce some surprising results, such as a negative value when the shape of the distribution appears skewed to the right.

It is important for researchers from the behavioral and business sciences to measure skewness when it appears in their data. Great amount of skewness may motivate the researcher to investigate the existence of outliers. When making decisions about which measure of location to report and which inferential statistic to employ, one should take into consideration the estimated skewness of the population. Normal distributions have zero skewness. Of course, a distribution can be perfectly symmetric but may far away from normal distribution. Transformations of variables under study commonly employed to reduce (positive) skewness. These transformation may include square root, log, and reciprocal of variable.

## Standard Error of Estimate

Standard error (SE) is a statistical term used to measure the accuracy within a sample taken from population of interest. The standard error of the mean measures the variation in the sampling distribution of the sample mean, usually denoted by $\sigma_\overline{x}$ is calculated as

$\sigma_\overline{x}=\frac{\sigma}{\sqrt{n}}$

Drawing (obtaining) different samples from the same population of interest usually results in different values of sample means, indicating that there is a distribution of sampled means having its own mean (average values) and variance. The standard error of the mean is considered as the standard deviation of all those possible sample drawn from the same population.

The size of the standard error is affected by standard deviation of the population and number of observations in a sample called the sample size. The larger the standard deviation of the population ($\sigma$), the larger the standard error will be, indicating that there is more variability in the sample means. However larger the number of observations in a sample smaller will be the standard error of estimate, indicating that there is less variability in the sample means, where by less variability we means that the sample is more representative of the population of interest.

If the sampled population is not very larger, we need to make some adjustment in computing the SE of the sample means. For a finite population, in which total number of objects (observations) is $N$ and the number of objects (observations) in a sample is $n$, then the adjustment will be $\sqrt{\frac{N-n}{N-1}}$. This adjustment is called the finite population correction factor. Then the adjusted standard error will be

$\frac{\sigma}{\sqrt{n}} \sqrt{\frac{N-n}{N-1}}$

The SE is used to:

1. measure the spread of values of statistic about the expected value of that statistic
2. construct confidence intervals
3. test the null hypothesis about population parameter(s)

The standard error is computed from sample statistics. To compute SE for simple random samples, assuming that the size of population ($N$) is at least 20 times larger than that of the sample size ($n$).
\begin{align*}
Sample\, mean, \overline{x} & \Rightarrow SE_{\overline{x}} = \frac{n}{\sqrt{n}}\\
Sample\, proportion, p &\Rightarrow SE_{p} \sqrt{\frac{p(1-p)}{n}}\\
Difference\, b/w \, means, \overline{x}_1 – \overline{x}_2 &\Rightarrow SE_{\overline{x}_1-\overline{x}_2}=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\\
Difference\, b/w\, proportions, \overline{p}_1-\overline{p}_2 &\Rightarrow SE_{p_1-p_2}=\sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}
\end{align*}

The standard error is identical to the standard deviation, except that it uses statistics whereas the standard deviation uses the parameter.

## Sum of Squared Deviation from Mean

In statistics, the sum of squared deviation is a measure of the total variability (spread, variation) within a data set. In other words, the sum of squares is a measure of deviation or variation from the mean (average) value of the given data set. A sum of squares calculated by first computing the differences between each data point (observation) and mean of the data set, i.e. $x=X-\overline{X}$. The computed $x$ is known as the deviation score for the given data set. Squaring each of this deviation score and then adding these squared deviation scores gave us the sum of squared deviation (SS), which is represented mathematically as

$SS=\sum(x^2)=\sum(X-\overline{X})^2$

Note that the small letter $x$ usually represents the deviation of each observation from the mean value, while capital letter $X$ represents the variable of interest in statistics.

## Sum of Squares Example

Consider the following data set {5, 6, 7, 10, 12}. To compute the sum of squares of this data set, follow these steps

• Calculate the average of the given data by summing all the values in the data set and then divide this sum of numbers by the total number of observations in the date set. Mathematically, it is $\frac{\sum X_i}{n}=\frac{40}{5}=8$, where 40 is the sum of all numbers $5+6+7+10+12$ and there are 5 observations in number.
• Calculate the difference of each observation in data set from the average computed in step 1, for given data. The differences are
5 – 8 = –3; 6 – 8 = –2; 7 – 8 = –1; 10 – 8 =2 and 12 – 8 = 4
Note that the sum of these differences should be zero. (–3 + –2 + –1 + 2 +4 = 0)
• Now square the each of the differences obtained in step 2. The square of these differences are
9, 4, 1, 4 and 16
• Now add the squared number obtained in step 3. The sum of these squared quantities will be 9 + 4 + 1 + 4 + 16 = 34, which is the sum of the square of the given data set.

In statistics, sum of squares occurs in different contexts such as

• Partitioning of Variance (Partition of Sums of Squares)
• Sum of Squared Deviations (Least Squares)
• Sum of Squared Differences (Mean Squared Error)
• Sum of Squared Error (Residual Sum of Squares)
• Sum of Squares due to Lack of Fit (Lack of Fit Sum of Squares)
• Sum of Squares for Model Predictions (Explained Sum of Squares)
• Sum of Squares for Observations (Total Sum of Squares)
• Sum of Squared Deviation (Squared Deviations)
• Modeling involving Sum of Squares (Analysis of Variance)
• Multivariate Generalization of Sum of Square (Multivariate Analysis of Variance)

As previously discussed, Sum of Square is a measure of the Total Variability of a set of scores around a specific number.

## Range: An Absolute Measure of Dispersion

Measure of Central Tendency provides typical value about the data set, but it does not tell the actual story about data i.e. mean, median and mode are enough to get summary information, though we know about the center of the data. In other words, we can measure the center of the data by looking at averages (mean, median, mode). These measure tell nothing about the spread of data. So for more information about data we need some other measure, such as measure of dispersion or spread.

Spread of data can be measured by calculating the range of data; range tell us over how many numbers of data extends. Range (an absolute measure of dispersion) can be found by subtracting highest value (called upper bound) in data from smallest value (called lower bound) in data. i.e.

Range = Upper Bound – Lowest Bound
OR
Range = Largest Value – Smallest Value

This absolute measure of dispersion have disadvantages as range only describes the width of the data set (i.e. only spread out) measure in same unit as data, but it does not gives the real picture of how data is distributed. If data has outliers, using range to describe the spread of that can be very misleading as range is sensitive to outliers. So we need to be careful in using range as it does not give the full picture of what’s going between the highest and lowest value. It might give misleading picture of the spread of the data because it is based only on the two extreme values. It is therefore an unsatisfactory measure of dispersion.

However range is widely used in statistical process control such as control charts of manufactured products, daily temperature, stock prices etc., applications as it is very easy to calculate. It is an absolute measure of dispersion, its relatives measure known as the coefficient of dispersion defined the the relation

$Coefficient\,\, of\,\, Dispersion = \frac{x_m-x_0}{x_m-x_0}$

Coefficient of dispersion is a pure dimensionless and is used for comparison purpose.