# Basic Statistics and Data Analysis

#### Lecture notes, MCQS of Statistics

Introduction to statistics

## Sum of Squared Deviation from Mean

In statistics, the sum of squared deviation is a measure of the total variability (spread, variation) within a data set. In other words, the sum of squares is a measure of deviation or variation from the mean (average) value of the given data set. A sum of squares calculated by first computing the differences between each data point (observation) and mean of the data set, i.e. $x=X-\overline{X}$. The computed $x$ is known as the deviation score for the given data set. Squaring each of this deviation score and then adding these squared deviation scores gave us the sum of squared deviation (SS), which is represented mathematically as

$SS=\sum(x^2)=\sum(X-\overline{X})^2$

Note that the small letter $x$ usually represents the deviation of each observation from the mean value, while capital letter $X$ represents the variable of interest in statistics.

## Sum of Squares Example

Consider the following data set {5, 6, 7, 10, 12}. To compute the sum of squares of this data set, follow these steps

• Calculate the average of the given data by summing all the values in the data set and then divide this sum of numbers by the total number of observations in the date set. Mathematically, it is $\frac{\sum X_i}{n}=\frac{40}{5}=8$, where 40 is the sum of all numbers $5+6+7+10+12$ and there are 5 observations in number.
• Calculate the difference of each observation in data set from the average computed in step 1, for given data. The differences are
5 – 8 = –3; 6 – 8 = –2; 7 – 8 = –1; 10 – 8 =2 and 12 – 8 = 4
Note that the sum of these differences should be zero. (–3 + –2 + –1 + 2 +4 = 0)
• Now square the each of the differences obtained in step 2. The square of these differences are
9, 4, 1, 4 and 16
• Now add the squared number obtained in step 3. The sum of these squared quantities will be 9 + 4 + 1 + 4 + 16 = 34, which is the sum of the square of the given data set.

In statistics, sum of squares occurs in different contexts such as

• Partitioning of Variance (Partition of Sums of Squares)
• Sum of Squared Deviations (Least Squares)
• Sum of Squared Differences (Mean Squared Error)
• Sum of Squared Error (Residual Sum of Squares)
• Sum of Squares due to Lack of Fit (Lack of Fit Sum of Squares)
• Sum of Squares for Model Predictions (Explained Sum of Squares)
• Sum of Squares for Observations (Total Sum of Squares)
• Sum of Squared Deviation (Squared Deviations)
• Modeling involving Sum of Squares (Analysis of Variance)
• Multivariate Generalization of Sum of Square (Multivariate Analysis of Variance)

As previously discussed, Sum of Square is a measure of the Total Variability of a set of scores around a specific number.

# Data Transformation (Variable Transformation)

A transformation is a rescaling of the data using a function or some mathematical operation on each observation. When data are very strongly skewed (negative or positive), we sometime transform the data so that they are easier to model. In other way, if variable(s) does not fit a normal distribution then one should try a data transformation to fit the assumption of using a parametric statistical test.

The most common data transformation is log (or natural log) transformation, which is often applied when most of the data values cluster around zero relative to the larger values in the data set and all of the observations are positive.

Transformation can also be applied to one or more variables in scatter plot, correlation and regression analysis to make the relationship between the variables more linear; and hence it is easier to model with simple method. Other transformation than log are square root, reciprocal etc.

Reciprocal Transformation
The reciprocal transformation $x$ to $\frac{1}{x}$ or $(-\frac{1}{x})$ is a very strong transformation with a drastic effect on shape of the distribution. Note that this transformation cannot be applied to zero values, but can be applied to negative values. Reciprocal transformation is not useful unless all of the values are positive and reverses the order among values of the same sign i.e. largest becomes smallest etc.

Logarithmic Transformation
The logarithm $x$ to log (base 10) (or natural log, or log base 2) is an other strong transformation that have effect on the shape of distribution. Logarithmic transformation commonly used for reducing right skewness, but cannot be applied to negative or zero values.

Square Root Transformation
The square root x to $x^{\frac{1}{2}}=\sqrt(x)$ transformation have moderate effect on distribution shape and weaker than the logarithm. Square root transformation can be applied to zero values but not negative values.

Goals of transformation
The goals of transformation may be

• one might want to see the data structure differently
• one might want to reduce the skew that assist in modeling
• one might want to straighten a nonlinear (curvilinear) relationship in a scatter plot. In other words a transformation may be used to have approximately equal dispersion, making data easier to handle and interpret

# Sampling theory, Introduction and Reasons to Sample

Often we are interested in drawing some valid conclusions (inferences) about a large group of individuals or objects (called population in statistics). Instead of examining (studying) the entire group (population, which may be difficult or even impossible to examine), we may examine (study) only a small part (portion) of the population (entire group of objects or people). Our objective is to draw valid inferences about certain facts for the population from results found in the sample; a process known as statistical inferences. The process of obtaining samples is called sampling and theory concerning the sampling is called sampling theory.

Example: We may wish to draw conclusions about the percentage of defective bolts produced in a factory during a given 6-day week by examining 20 bolts each day produced at various times during the day. Note that all bolts produced in this case during the week comprise the population, while the 120 selected bolts during 6-days constitutes a sample.

In business, medical, social and psychological sciences etc., research, sampling theory is widely used for gathering information about a population. The sampling process comprises several stages:

• Defining the population of concern
• Specifying the sampling frame (set of items or events possible to measure)
• Specifying a sampling method for selecting the items or events from the sampling frame
• Determining the appropriate sample size
• Implementing the sampling plan
• Sampling and data collecting
• Data which can be selected

When studying the characteristics of a population, there many reasons to study a sample (drawn from population under study) instead of entire population such as:

1. Time: as it is difficult to contact each and every individual of the whole population
2. Cost: The cost or expenses of studying all the items (objects or individual) in a population may be prohibitive
3. Physically Impossible: Some population are infinite, so it will be physically impossible to check the all items in the population, such as populations of fish, birds, snakes, mosquitoes. Similarly it is difficult to study the populations that are constantly moving, being born, or dying.
4. Destructive Nature of items: Some items, objects etc are difficult to study as during testing (or checking) they destroyed, for example a steel wire is stretched until it breaks and breaking point is recorded to have a minimum tensile strength. Similarly different electric and electronic components are check and they are destroyed during testing, making impossible to study the entire population as time, cost and destructive nature of different items prohibits to study the entire population.
5. Qualified and expert staff: For enumeration purposes, highly qualified and expert staff is required which is some time impossible. National and International research organizations, agencies and staff is hired for enumeration purposive which is some time costly, need more time (as rehearsal of activity is required), and some time it is not easy to recruiter or hire a highly qualified staff.
6. Reliability: Using a scientific sampling technique the sampling error can be minimized and the non-sampling error committed in the case of sample survey is also minimum, because qualified investigators are included.

Every sampling system is used to obtain some estimates having certain properties of the population under study. The sampling system should be judged by how good the estimates obtained are. Individual estimates, by chance, may be very close or may differ greatly from the true value (population parameter) and may give a poor measure of the merits of the system.

A sampling system is better judged by frequency distribution of many estimates obtained by repeated sampling, giving a frequency distribution having small variance and mean estimate equal to the true value.

# The Level of Measurements

In statistics, data can be classified according to level of measurement, dictating the calculations that can be done to summarize and present the data (graphically), it also helps to determine, what statistical tests should be performed. For example, suppose there are six colors of candies in a bag and you assign different numbers (codes) to them in such a way that brown candy has a value of 1, yellow 2, green 3, orange 4, blue 5, and red a value of 6. From this bag of candies, adding all the assigned color values and then dividing by the number of candies, yield an average value of 3.68. Does this mean that the average color is green or orange? Of course not. When computing statistic, it is important to recognize the data type, which may be qualitative (nominal and ordinal) and quantitative (Interval and ratio).

The level of measurements has been developed in conjunction with the concepts of numbers and units of measurement. Statisticians classified measurements according to levels. There are four level of measurements, namely, nominal, ordinal, interval and ratio, described below.

Nominal Level of Measurement

In nominal level of measurement, the observation of a qualitative variable can only be classified and counted. There is no particular order to the categories. Mode, frequency table, pie chart and bar graph are usually drawn for this level of measurement.

Ordinal Level of Measurement

In ordinal level of measurement, data classification are presented by sets of labels or names that have relative values (ranking or ordering of values). For example, if you survey 1,000 people and ask them to rate a restaurant on a scale ranging from 0 to 5, where 5 shows higher score (highest liking level) and zero shows the lowest (lowest liking level). Taking the average of these 1,000 people’s response will have meaning. Usually graphs and charts are drawn for ordinal data.

Interval Level of Measurement

Numbers also used to express the quantities, such as temperature, dress size and plane ticket are all quantities. The interval level of measurement allows for the degree of difference between items but no the ratio between them. There is meaningful difference between values, for example 10 degrees Fahrenheit and 15 degrees is 5, and the difference between 50 and 55 degrees is also 5 degrees. It is also important that zero is just a point on the scale, it does not represents the absence of heat, just that it is freezing point.

Ratio Level of Measurement

All of the quantitative data is recorded on the ratio level. It has all the characteristics of the interval level, but in addition, the zero point is meaningful and the ratio between two numbers is meaningful. Examples of ratio level are wages, units of production, weight, changes in stock prices, distance between home and office, height etc.
Many of the inferential test statistics depends on ratio and interval level of measurement. Many author argue that interval and ratio measures should be named as scale.

For Examples about Level of Measurements Visits: Examples of Levels of Measurements

# Degrees of Freedom

The degrees of freedom (df) or number of degrees of freedom refers to the number of observations in a sample minus the number of (population) parameters being estimated from the sample data. All this means that the degrees of freedom is a function of both sample size and the number of independent variables. In other words it is the number of independent observations out of a total of ($n$) observations.

In statistics, the degrees of freedom considered as the number of values in a study that are free to vary. For example (degrees of freedom example in real life), if you have to take ten different courses to graduate, and only ten different courses are offered, then you have nine degrees of freedom. Nine semesters you will be able to choose which class to take; the tenth semester, there will only be one class left to take – there is no choice, if you want to graduate, this is the concept of the degrees of freedom (df) in statistics.

Let a random sample of size n is taken from a population with an unknown mean $\overline{X}$. The sum of the deviations from their means is always equal to zero i.e.$\sum_{i=1}^n (X_i-\overline{X})=0$. This require a constraint on each deviation $X_i-\overline{X}$ used when calculating the variance.

$S^2 =\frac{\sum_{i=1}^n (X_i-\overline{X})^2 }{n-1}$

This constraint (restriction) implies that $n-1$ deviations completely determine the nth deviation. The $n$ deviations (and also the sum of their squares and the variance in the $S^2$ of the sample) therefore $n-1$ degrees of freedom.

A common way to think of degrees of freedom is as the number of independent pieces of information available to estimate another piece of information. More concretely, the number of degrees of freedom is the number of independent observations in a sample of data that are available to estimate a parameter of the population from which that sample is drawn. For example, if we have two observations, when calculating the mean we have two independent observations; however, when calculating the variance, we have only one independent observation, since the two observations are equally distant from the mean.

Single sample: For $n$ observation one parameter (mean) needs to be estimated, that leaves $n-1$ degrees of freedom for estimating variability (dispersion).

Two samples: There are total of $n_1+n_2$ observations ($n_1$ for group1 and $n_2$ for group2) and two means need to be estimated, which leaves $n_1+n_2-2$ degrees of freedom for estimating variability.

Regression with p predictors: There are $n$ observations with $p+1$ parameters needs to be estimated (regression coefficient for each predictor and the intercept). This leaves $n-p-1$ degrees of freedom of error, which accounts for the error degrees of freedom in the ANOVA table.

Several commonly encountered statistical distributions (Student’s t, Chi-Squared, F) have parameters that are commonly referred to as degrees of freedom. This terminology simply reflects that in many applications where these distributions occur, the parameter corresponds to the degrees of freedom of an underlying random vector. If $X_i; i=1,2,\cdots, n$ are independent normal $(\mu, \sigma^2)$ random variables, the statistic (formula) is $\frac{\sum_{i=1}^n (X_i-\overline{X})^2}{\sigma^2}$, follows a chi-squared distribution with $n-1$ degrees of freedom. Here, the degrees of freedom arises from the residual sum of squares in the numerator and in turn the $n-1$ degrees of freedom of the underlying residual vector ${X_i-\overline{X}}$.