Basic Statistics and Data Analysis

Category: Basic Statistics

Introduction to statistics

Sampling theory, Introduction and Reasons to Sample

Often we are interested in drawing some valid conclusions (inferences) about a large group of individuals or objects (called population in statistics). Instead of examining (studying) the entire group (population, which may be difficult or even impossible to examine), we may examine (study) only a small part (portion) of the population (entire group of objects or people). Our objective is to draw valid inferences about certain facts for the population from results found in the sample; a process known as statistical inferences. The process of obtaining samples is called sampling and theory concerning the sampling is called sampling theory.

Example: We may wish to draw conclusions about the percentage of defective bolts produced in a factory during a given 6-day week by examining 20 bolts each day produced at various times during the day. Note that all bolts produced in this case during the week comprise the population, while the 120 selected bolts during 6-days constitutes a sample.

In business, medical, social and psychological sciences etc., research, sampling theory is widely used for gathering information about a population. The sampling process comprises several stages:

• Defining the population of concern
• Specifying the sampling frame (set of items or events possible to measure)
• Specifying a sampling method for selecting the items or events from the sampling frame
• Determining the appropriate sample size
• Implementing the sampling plan
• Sampling and data collecting
• Data which can be selected

When studying the characteristics of a population, there many reasons to study a sample (drawn from population under study) instead of entire population such as:

1. Time: as it is difficult to contact each and every individual of the whole population
2. Cost: The cost or expenses of studying all the items (objects or individual) in a population may be prohibitive
3. Physically Impossible: Some population are infinite, so it will be physically impossible to check the all items in the population, such as populations of fish, birds, snakes, mosquitoes. Similarly it is difficult to study the populations that are constantly moving, being born, or dying.
4. Destructive Nature of items: Some items, objects etc are difficult to study as during testing (or checking) they destroyed, for example a steel wire is stretched until it breaks and breaking point is recorded to have a minimum tensile strength. Similarly different electric and electronic components are check and they are destroyed during testing, making impossible to study the entire population as time, cost and destructive nature of different items prohibits to study the entire population.
5. Qualified and expert staff: For enumeration purposes, highly qualified and expert staff is required which is some time impossible. National and International research organizations, agencies and staff is hired for enumeration purposive which is some time costly, need more time (as rehearsal of activity is required), and some time it is not easy to recruiter or hire a highly qualified staff.
6. Reliability: Using a scientific sampling technique the sampling error can be minimized and the non-sampling error committed in the case of sample survey is also minimum, because qualified investigators are included.

Every sampling system is used to obtain some estimates having certain properties of the population under study. The sampling system should be judged by how good the estimates obtained are. Individual estimates, by chance, may be very close or may differ greatly from the true value (population parameter) and may give a poor measure of the merits of the system.

A sampling system is better judged by frequency distribution of many estimates obtained by repeated sampling, giving a frequency distribution having small variance and mean estimate equal to the true value.

The Level of Measurements

In statistics, data can be classified according to level of measurement, dictating the calculations that can be done to summarize and present the data (graphically), it also helps to determine, what statistical tests should be performed. For example, suppose there are six colors of candies in a bag and you assign different numbers (codes) to them in such a way that brown candy has a value of 1, yellow 2, green 3, orange 4, blue 5, and red a value of 6. From this bag of candies, adding all the assigned color values and then dividing by the number of candies, yield an average value of 3.68. Does this mean that the average color is green or orange? Of course not. When computing statistic, it is important to recognize the data type, which may be qualitative (nominal and ordinal) and quantitative (Interval and ratio).

The level of measurements has been developed in conjunction with the concepts of numbers and units of measurement. Statisticians classified measurements according to levels. There are four level of measurements, namely, nominal, ordinal, interval and ratio, described below.

Nominal Level of Measurement

In nominal level of measurement, the observation of a qualitative variable can only be classified and counted. There is no particular order to the categories. Mode, frequency table, pie chart and bar graph are usually drawn for this level of measurement.

Ordinal Level of Measurement

In ordinal level of measurement, data classification are presented by sets of labels or names that have relative values (ranking or ordering of values). For example, if you survey 1,000 people and ask them to rate a restaurant on a scale ranging from 0 to 5, where 5 shows higher score (highest liking level) and zero shows the lowest (lowest liking level). Taking the average of these 1,000 people’s response will have meaning. Usually graphs and charts are drawn for ordinal data.

Interval Level of Measurement

Numbers also used to express the quantities, such as temperature, dress size and plane ticket are all quantities. The interval level of measurement allows for the degree of difference between items but no the ratio between them. There is meaningful difference between values, for example 10 degrees Fahrenheit and 15 degrees is 5, and the difference between 50 and 55 degrees is also 5 degrees. It is also important that zero is just a point on the scale, it does not represents the absence of heat, just that it is freezing point.

Ratio Level of Measurement

All of the quantitative data is recorded on the ratio level. It has all the characteristics of the interval level, but in addition, the zero point is meaningful and the ratio between two numbers is meaningful. Examples of ratio level are wages, units of production, weight, changes in stock prices, distance between home and office, height etc.
Many of the inferential test statistics depends on ratio and interval level of measurement. Many author argue that interval and ratio measures should be named as scale.

For Examples about Level of Measurements Visits: Examples of Levels of Measurements

Degrees of Freedom

The degrees of freedom (df) or number of degrees of freedom refers to the number of observations in a sample minus the number of (population) parameters being estimated from the sample data. All this means that the degrees of freedom is a function of both sample size and the number of independent variables. In other words it is the number of independent observations out of a total of ($n$) observations.

In statistics, the degrees of freedom considered as the number of values in a study that are free to vary. For example (degrees of freedom example in real life), if you have to take ten different courses to graduate, and only ten different courses are offered, then you have nine degrees of freedom. Nine semesters you will be able to choose which class to take; the tenth semester, there will only be one class left to take – there is no choice, if you want to graduate, this is the concept of the degrees of freedom (df) in statistics.

Let a random sample of size n is taken from a population with an unknown mean $\overline{X}$. The sum of the deviations from their means is always equal to zero i.e.$\sum_{i=1}^n (X_i-\overline{X})=0$. This require a constraint on each deviation $X_i-\overline{X}$ used when calculating the variance.

$S^2 =\frac{\sum_{i=1}^n (X_i-\overline{X})^2 }{n-1}$

This constraint (restriction) implies that $n-1$ deviations completely determine the nth deviation. The $n$ deviations (and also the sum of their squares and the variance in the $S^2$ of the sample) therefore $n-1$ degrees of freedom.

A common way to think of degrees of freedom is as the number of independent pieces of information available to estimate another piece of information. More concretely, the number of degrees of freedom is the number of independent observations in a sample of data that are available to estimate a parameter of the population from which that sample is drawn. For example, if we have two observations, when calculating the mean we have two independent observations; however, when calculating the variance, we have only one independent observation, since the two observations are equally distant from the mean.

Single sample: For $n$ observation one parameter (mean) needs to be estimated, that leaves $n-1$ degrees of freedom for estimating variability (dispersion).

Two samples: There are total of $n_1+n_2$ observations ($n_1$ for group1 and $n_2$ for group2) and two means need to be estimated, which leaves $n_1+n_2-2$ degrees of freedom for estimating variability.

Regression with p predictors: There are $n$ observations with $p+1$ parameters needs to be estimated (regression coefficient for each predictor and the intercept). This leaves $n-p-1$ degrees of freedom of error, which accounts for the error degrees of freedom in the ANOVA table.

Several commonly encountered statistical distributions (Student’s t, Chi-Squared, F) have parameters that are commonly referred to as degrees of freedom. This terminology simply reflects that in many applications where these distributions occur, the parameter corresponds to the degrees of freedom of an underlying random vector. If $X_i; i=1,2,\cdots, n$ are independent normal $(\mu, \sigma^2)$ random variables, the statistic (formula) is $\frac{\sum_{i=1}^n (X_i-\overline{X})^2}{\sigma^2}$, follows a chi-squared distribution with $n-1$ degrees of freedom. Here, the degrees of freedom arises from the residual sum of squares in the numerator and in turn the $n-1$ degrees of freedom of the underlying residual vector ${X_i-\overline{X}}$.

Introduction Odds Ratio

Medical students, students from clinical and psychological sciences, professionals allied to medicine enhancing their understanding and learning of medical literature and researchers from different fields of life usually encounter Odds Ratio (OR) throughout their careers.

Odds ratio is a relative measure of effect, allowing the comparison of the intervention group of a study relative to the comparison or placebo group. When computing Odds Ratio, one would do:

• The numerator is the odds in the intervention arm
• The denominator is the odds in the control or placebo arm= OR

If the outcome is the same in both groups, the ratio will be 1, implying that there is no difference between the two arms of the study. However, if the OR>1, the control group is better than the intervention group while, if the OR<1, the intervention group is better than the control group.

The ratio of the probability of success and failure is known as odds. If the probability of an event is $P_1$ then the odds is:
$OR=\frac{p_1}{1-p_1}$

The Odds Ratio is the ratio of two odds can be used to quantify how much a factor is associated to the response factor in a given model. If the probabilities of occurrences an event are $P_1$ (for first group) and $P_2$ (for second group), then the OR is:
$OR=\frac{\frac{p_1}{1-p_1}}{\frac{p_2}{1-p_2}}$

If predictors are binary then the OR for ith factor, is defined as
$OR_i=e^{\beta}_i$

The regression coefficient $b_1$ from logistic regression is the estimated increase in the log odds of the dependent variable per unit increase in the value of the independent variable. In other words, the exponential function of the regression coefficients $(e^{b_1})$ in the OR associated with a one unit increase in the independent variable.

Median Measure of Central Tendency

Median is the middle most value in the data set when all of the values (observations) in a data set are arranged either in ascending or descending order of their magnitude. Median is also considered as a measure of central tendency which divides the data set in two half, where the first half contains 50% observations below the median value and 50% above the median value. If in a data set there are odd number of observations (data points), the median value is the single most middle value after sorting the data set.

Example: Consider the following data set 5, 9, 8, 4, 3, 1, 0, 8, 5, 3, 5, 6, 3.
To find the median of the given data set, first sort it (either in ascending or descending order), that is
0, 1, 3, 3, 3, 4, 5, 5, 5, 6, 8, 8, 9. The middle most value of the above data after sorting is 5, which is median of the given data set.

When the number of observations in a data set is even then the median value is the average of two middle most values in the sorted data.

Example: Consider the following data set, 5, 9, 8, 4, 3, 1, 0, 8, 5, 3, 5, 6, 3, 2.
To find the median first sort it and then locate the middle most two values, that is,
0, 1, 2, 3, 3, 3, 4, 5, 5, 5, 6, 8, 8, 9. The middle most two values are 4 and 5. So median will be average of these two values, i.e. 4.5 in this case.

The Median is less affected by extreme values in the data set, so median is preferred measure of central tendency when the data set is skewed or not symmetrical.

For large data set it is relatively very difficult to locate median value in sorted data. It will be helpful to use median value using formula. The formula for odd number of observations is
\begin{aligned} Median &=\frac{n+1}{2}th\\ Median &=\frac{n+1}{2}\\ &=\frac{13+1}{2}\\ &=\frac{14}{2}=7th \end{aligned}

The 7th value in sorted data is the median of the given data.

The median formula for even number of observation is
\begin{aligned} Median&=\frac{1}{2}(\frac{n}{2}th + (\frac{n}{2}+1)th)\\ &=\frac{1}{2}(\frac{14}{2}th + (\frac{14}{2}+1)th)\\ &=\frac{1}{2}(7th + 8th )\\ &=\frac{1}{2}(4 + 5)= 4.5 \end{aligned}

Note that median measure of central tendency, cannot be found for categorical data.