Measure of Dispersion or Variability (2012)

Introduction to Measure of Dispersion

The measure of location or averages or central tendency is not sufficient to describe the characteristics of a distribution, because two or more distributions may have averages that are exactly alike, even though the distributions are dissimilar in other aspects. On the other hand, a measure of central tendency represents the typical value of the data set. To give a sensible description of data, a numerical quantity called the measure of dispersion/ variability or scatter that describes the spread of the values in a set of data has two types of measures of dispersion or variability:

measures-of-dispersion
  1. Absolute Measures
  2. Relative Measures

A measure of central tendency together with a measure of dispersion gives an adequate description of data as compared to the use of a measure of location only, because the averages or measures of central tendency only describe the balancing point of the data set, it does not provide any information about the degree to which the data tend to spread or scatter about the average value. So, the Measure of dispersion indicates the characteristic of the central tendency measure. The smaller the variability of a given set, the more the values of the measure of averages will represent the data set.

Absolute Measure of Dispersion

Absolute measures are defined in such a way that they have units such as meters, grams, etc., the same as those of the original measurements. Absolute measures cannot be used to compare the variation/spread of two or more data sets.
Most Common absolute measures of variability are:

Relative Measures of Dispersion

The relative measures have no units as these are ratios, coefficients, or percentages. Relative measures are independent of units of measurement and are useful for comparing data of different natures.

  • Coefficient of Variation
  • Coefficient of Mean Deviation
  • Coefficient of Quartile Deviation
  • Coefficient of Standard Deviation

Different terms are used for the measure of dispersion or variability such as variability, spread, scatterness, the measure of uncertainty, deviation, etc.

References:
http://www2.le.ac.uk/offices/careers/ld/resources/numeracy/variability

R Language Frequently Asked Questions

Testing of Hypothesis (2012)

Introduction

The objective of testing hypotheses (Testing of Statistical Hypothesis) is to determine if an assumption about some characteristic (parameter) of a population is supported by the information obtained from the sample.

Testing of Hypothesis

The terms hypothesis testing or testing of the hypothesis are used interchangeably. A statistical hypothesis (different from a simple hypothesis) is a statement about a characteristic of one or more populations such as the population mean. This statement may or may not be true. The validity of the statement is checked based on information obtained by sampling from the population.
Testing of Hypothesis refers to the formal procedures used by statisticians to accept or reject statistical hypotheses that include:

i) Formulation of Null and Alternative Hypothesis

Null hypothesis

A hypothesis formulated for the sole purpose of rejecting or nullifying it is called the null hypothesis, usually denoted by H0. There is usually a “not” or a “no” term in the null hypothesis, meaning that there is “no change”.

For Example, The null hypothesis is that the mean age of M.Sc. students is 20 years. Statistically, it can be written as $H_0:\mu = 20$. Generally speaking, the null hypothesis is developed for testing.
We should emphasize that if the null hypothesis is not rejected based on the sample data we cannot say that the null hypothesis is true. In another way, failing to reject the null hypothesis does not prove that the $H_0$ is true, it means that we have failed to disprove $H_0$.

For the null hypothesis, we usually state that “there is no significant difference between “A” and “B”. For example, “the mean tensile strength of copper wire is not significantly different from some standard”.

Alternative Hypothesis

Any hypothesis different from the null hypothesis is called an alternative hypothesis denoted by $H_1$. Or we can say that a statement is accepted if the sample data provide sufficient evidence that the null hypothesis is false. The alternative hypothesis is also referred to as the research hypothesis.

It is important to remember that no matter how the problem is stated, the null hypothesis will always contain the equal sign, and the equal sign will never appear in the alternate hypothesis. It is because the null hypothesis is the statement being tested and we need a specific value to include in our calculations. The alternative hypothesis for the example given in the null hypothesis is $H_1:\mu \ne 20$.

Simple and Composite Hypothesis

If a statistical hypothesis completely specifies the form of the distribution as well as the value of all parameters, then it is called a simple hypothesis. For example, suppose the age distribution of the first-year college student follows $N(16, 25)$, and the null hypothesis is $H_0: \mu =16$ then this null hypothesis is called a simple hypothesis, and if a statistical hypothesis does not completely specify the form of the distribution, then it is called a composite hypothesis. For example, $H_1:\mu < 16$ or $H_1:\mu > 16$.

ii) Level of Significance

The level of significance (significance level) is denoted by the Greek letter alpha ($\alpha$). It is also called the level of risk (as there is the risk you take of rejecting the null hypothesis when it is true). The level of significance is defined as the probability of making a type-I error. It is the maximum probability with which we would be willing to risk a type-I error. It is usually specified before any sample is drawn so that the results obtained will not influence our choice.

In practice 10% (0.10) 5% (0.05) and 1% (0.01) levels of significance are used in testing a given hypothesis. A 5% level of significance means that there are about 5 chances out of 100 that we would reject the true hypothesis i.e. we are 95% confident that we have made the right decision. The hypothesis that has been rejected at a 0.05 level of significance means that we could be wrong with a probability of 0.05.

Selection of Level of Significance

In Testing of Hypothesis, the selection of the level of significance depends on the field of study. Traditionally 0.05 level is selected for business science-related problems, 0.01 for quality assurance, and 0.10 for political polling and social sciences.

Type-I and Type-II Errors

Whenever we accept or reject a statistical hypothesis based on sample data, there are always some chances of making incorrect decisions. Accepting a true null hypothesis or rejecting a false null hypothesis leads to a correct decision, and accepting a false hypothesis or rejecting a true hypothesis leads to an incorrect decision. These two types of errors are called type-I errors and type-II errors.
type-I error: Rejecting the null hypothesis when it is ($H_0$) true.
type-II error: Accepting the null hypothesis when $H_1$ is true.

iii) Test Statistics

The third step of Testing the Hypothesis is a procedures that enable us to decide whether to accept or reject the hypothesis or to determine whether observed samples differ significantly from expected results. These are called tests of hypothesis, tests of significance, or rules of decision. We can also say that test statistics is a value calculated from sample information, used to determine whether to reject the null hypothesis.

The test statistics for mean $\mu$ when $\sigma$ is known is $Z= \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n} } }$, where Z-value is based on the sampling distribution of $\overline{X}$, which follows the normal distribution with mean $\mu_{\overline{X}}$ equal to $\mu$ and standard deviation $\sigma_{\overline{X}}$ which is equal to $\frac{\sigma}{\sqrt{n}}$. Thus we determine whether the difference between $\overline{X}$ and $\mu$ is statistically significant by finding the number of standard deviations $\overline{X}$  from $\mu$ using the Z statistics. Other test statistics are also available such as $t$, $F$, and $\chi^2$, etc.

iv) Critical Region (Formulating Decision Rule)

It must be decided before the sample is drawn under what conditions (circumstances) the null hypothesis will be rejected. A dividing line must be drawn defining “Probable” and “Improbable” sample values given that the null hypothesis is a true statement. Simply a decision rule must be formulated having specific conditions under which the null hypothesis should be rejected or should not be rejected. This dividing line defines the region or area of rejection of those values that are large or small that the probability of their occurrence under a null hypothesis is rather remote i.e. Dividing line defines the set of possible values of the sample statistic that leads to rejecting the null hypothesis called the critical region.

Testing of Hypothesis

One-tailed and two-tailed tests of significance

In testing of hypothesis if the rejection region is on the left or right tail of the curve then it is called a one-tailed hypothesis. It happens when the null hypothesis is tested against an alternative hypothesis having a “greater than” or a “less than” type.

and if the rejection region is on the left and right tail (both sides) of the curve then it is called a two-tailed hypothesis. It happens when the null hypothesis is tested against an alternative hypothesis having a “not equal to sign” type.

v) Making a Decision

In this last step of testing hypotheses, the computed value of the test statistic is compared with the critical value. If the sample statistic falls within the rejection region, the null hypothesis will be rejected or otherwise accepted. Note that only one of two decisions is possible in hypothesis testing, either accept or reject the null hypothesis. Instead of “accepting” the null hypothesis ($H_0$), some researchers prefer to phrase the decision as “Do not reject $H_0$” “We fail to reject $H_0$” or “The sample results do not allow us to reject $H_0$”.

Data Analysis in R Language

Hypothesis Testing Frequently Asked Questions

  • What is a statistical hypothesis?
  • What is a null hypothesis?
  • What is an alternative hypothesis?
  • How null and alternative hypotheses are mathematically represented?
  • What is the level of significance (level of risk)?
  • What are type-I errors and type-II errors?
  • What is the test statistics for one sample?
  • What is the test statistics for the two samples?
  • What is the critical region?
  • How decision is made in hypothesis testing?
  • What is a simple and composite hypothesis?
  • What is the calculated test value?

The Deciles: Measure of Position Made Easy (2012)

The deciles are the values (nine in number) of the variable that divides an ordered (sorted, arranged) data set into ten equal parts so that each part represents $\frac{1}{10}$ of the sample or population and are denoted by $D_1, D_2, \cdots D_9$, where First decile ($D_1$) is the value of order statistics that exceed 1/10 of the observations and less than the remaining $\frac{9}{10}$. The $D_9$ (ninth decile) is the value in order statistic that exceeds $\frac{9}{10}$ of the observations and is less than $\frac{1}{10}$ remaining observations. Note that the fifth deciles are equal to the median. The deciles determine the values for 10%, 20%… and 90% of the data.

Calculating Deciles for Ungrouped Data

To calculate the decile for the ungrouped data, first order all observations according to the magnitudes of the values, then use the following formula for $m$th decile.

\[D_m= m \times \left( \frac{(n+1)}{10} \right) \mbox{th value; } \qquad \mbox{where} m=1,2,\cdots,9\]

Example: Calculate the 2nd and 8th deciles of the following ordered data 13, 13,13, 20, 26, 27, 31, 34, 34, 34, 35, 35, 36, 37, 38, 41, 41, 41, 45, 47, 47, 47, 50, 51, 53, 54, 56, 62, 67, 82.
Solution:

\begin{eqnarray*}
D_m &=&m \times \{\frac{(n+1)}{10} \} \mbox{th value}\\
&=& 2 \times \frac{30+1}{10}=6.2\\
\end{eqnarray*}

We have to locate the sixth value in the ordered array and then move 0.2 of the distance between the sixth and seventh values. i.e. the value of the 2nd decile can be calculated as
\[6 \mbox{th observation} + \{7 \mbox{th observation} – 6 \mbox{th observation} \}\times 0.2\]
as 6th observation is 27 and 7th observation is 31.
The second decile would be $27+\{31-27\} \times 0.2 = 27.8$

Similarly, $D_8$ can be calculated. $D_8=52.6$.

Calculating Deciles for Grouped Data

The following formula can calculate the $m$th decile for grouped data (in ascending order).

\[D_m=l+\frac{h}{f}\left(\frac{m.n}{10}-c\right)\]

where

$l$ = is the lower class boundary of the class containing $m$th deciles
$h$ = is the width of the class containing $m$th deciles
$f$ = is the frequency of the class containing $m$th deciles
$n$ = is the total number of frequencies
$c$ = is the cumulative frequency of the class preceding the class containing $m$th deciles

Example: Calculate the first and third decile(s) of the following grouped data

The Deciles: Measure of Position

Solution: The Decile class for $D_1$ can be calculated from $\left(\frac{m.n}{10}-c\right) = \frac{1 \times 30}{10} = 3$rd observation. As 3rd observation lies in the first class (first group) so

\begin{eqnarray*}
D_m&=&l+\frac{h}{f}\left(\frac{m.n}{10}-c\right)\\
D_1&=&85.5+\frac{5}{6}\left(\frac{1\times30}{10}-0\right)\\
&=&88\\
\end{eqnarray*}

The Decile class for $D_7$ is 100.5—105.5 as $\frac{7 \times 30}{10}=21$th observation which is in fourth class (group).
\begin{eqnarray*}
D_m&=&l+\frac{h}{f}\left(\frac{m.n}{10}-c\right)\\
D_7&=&100.5+\frac{5}{6}\left(\frac{7\times30}{10}-20\right)\\
&=&101.333\\
\end{eqnarray*}

https://itfeature.com statistics data analytics

Learn R Language