# Basic Statistics and Data Analysis

## Introduction

The objective of testing of statistical hypothesis is to determine if an assumption about some characteristic (parameter) of a population is supported by the information obtained from the sample.

The terms hypothesis testing or testing of hypothesis are used interchangeably. A statistical hypothesis (different from simple hypothesis) is a statement about a characteristic of one or more populations such as the population mean. This statement may or may not be true. Validity of statement is checked on the basis of information obtained by sampling from the population.
Testing to Hypothesis refers to the formal procedures used by statisticians to accept or reject statistical hypotheses that includes:

## i) Formulation of Null and Alternative Hypothesis

### Null hypothesis

A hypothesis formulated for the sole purpose of rejecting or nullifying it is called null hypothesis, usually denoted by H0. There is usually a “not” or a “no” term in the null hypothesis, meaning that there is “no change”.

For Example: The null hypothesis is that the mean age of M.Sc. student is 20 years. Statistically it can be written as H0:μ=20. Generally speaking, the null hypothesis is developed for the purpose of testing.
We should emphasized that , if the null hypothesis is not rejected on the basis of the sample data we cannot say that the null hypothesis is true. In other way, failing to reject the null hypothesis does not prove that the H0 is true, it means that we have failed to disprove H0.

For null hypothesis we usually state that “there is no significant difference between “A” and “B” or “the mean tensile strength of copper wire is not significantly different from some standard”.

### Alternative Hypothesis

Any hypothesis different from the null hypothesis is called an alternative hypothesis denoted by H1. Or we can say that a statement that is accepted if the sample data provide sufficient evidence that the null hypothesis is false. Alternative hypothesis also referred to as the research hypothesis.

It is important to remember that no matter how the problem stated, null hypothesis will always contain the equal sign, and equal sign will never appear in the alternate hypothesis. It is because the null hypothesis is the statement being tested and we need a specific value to include in our calculations. The alternative hypothesis for example given in null hypothesis is H1:μ≠20.

### Simple and Composite Hypothesis

If a statistical hypothesis completely specifies the form of the distribution as well as the value of all parameters, then it is called a simple hypothesis. For example, Suppose the age distribution of the first year college student follows N(16, 25), and null hypothesis is H0:μ=16 then this null hypothesis is called simple hypothesis. and If a statistical hypothesis is not completely specifies the form of the distribution, then it is called composite hypothesis. For example H1:μ<16 or H1:μ>16.

## ii) Level of Significance

The level of significance (significance level) is denoted by the Greek letter alpha (α). It is also called the level of risk (as there is the risk you take of rejecting the null hypothesis when it is really true). Level of significance is defined as the probability of making a type-I error. It is the maximum probability with which we would be willing to risk a type-I error. It is usually specified before any sample is drawn so that results obtained will not influence our choice.

In practice 10% (0.10) 5% (0.05) and 1% (0.01) level of significance is used in testing a given hypothesis. 5% level of significance means that there are about 5 chances out of 100 that we would reject the true hypothesis i.e. we are 95% confident that we have made the right decision. The hypothesis that has been rejected at 0.05 level of significance means that we could be wrong with probability 0.05.

### Selection of Level of Significance

Selection of level of significance depends on field of study. Traditionally 0.05 level is selected for business science related problems, 0.01 for quality assurance and 0.10 for political polling and social sciences.

### Type-I and Type-II Errors

Whenever we accept or reject a statistical hypothesis on the basis of sample data, there is always some chances of making incorrect decisions. Accepting a true null hypothesis or rejecting a false null hypothesis leads to a correct decision, and accepting a false hypothesis or rejecting a true hypothesis leads to incorrect decision. These two types of errors are called type-I error and type-II error.
type-I error: Rejecting null hypothesis when it is (H0) true.
type-II error: Accepting null hypothesis when H1 is true.

## iii) Test Statistics

Procedures which enable us to decide whether to accept or reject hypothesis or to determine whether observed sample differ significantly from expected results are called tests of hypothesis, tests of significance or rules of decision. We can also say that a test statistics is a value calculated from sample information, used to determine whether to reject the null hypothesis. The test statistics for mean $\mu$ when $\sigma$ is known is $Z= \frac{\bar{X}-\mu}{\sigma/\sqrt{n}}$, where Z-value is based on the sampling distribution of $\bar{X}$, which follows the normal distribution with mean $\mu_{\bar{X}}$ equal to $\mu$ and standard deviation $\sigma_{\bar{X}}$ which is equal to $\sigma/\sqrt{n}$. Thus we determine that whether the difference between $\bar{X}$ and $\mu$ is statistically significant by finding the number of standard deviations $\bar{X}$  from $\mu$ using the Z statistics. Other test statistics are also available such as t, F, $\chi^2$ etc.

## iv) Critical Region (Formulating Decision Rule)

It must be decided, before the sample is drawn that under what conditions (circumstance) the null hypothesis will be rejected. A dividing line must be drawn defining “Probable” and “Improbable” sample values given that the null hypothesis is a true statement. Simply a decision rule must be formulated having specific conditions under which the null hypothesis should be rejected or should not be rejected. This dividing line defines the region or area of rejection of those values which are large or small that the probability of their occurrence under a null hypothesis is rather remote i.e. Dividing line defines the set of possible values of the sample statistic that leads to reject the null hypothesis called the critical region.

### One tailed and two tailed tests of significance

If the rejection region is on the left or right tail of the curve then it is called one tailed hypothesis. It happens when the null hypothesis is tested against an alternative hypothesis having a “greater than” or a “less than” type.

and if the rejection region is on the left and right tail (both side) of the curve then it is called two tailed hypothesis. It happens when the null hypothesis is tested against an alternative hypothesis having a “not equal to sign” type.

## v) Making a Decision

In this step, computed value of test statistic is compared with the critical value. If the sample statistic falls within the rejection region, the null hypothesis will be rejected otherwise accepted. Note that only one of two decisions is possible in hypothesis testing, either accept or reject the null hypothesis. Instead of “accepting” the null hypothesis (H0), some researchers prefer to phrase the decision as “Do not reject H0” or “We fail to reject H0” or “The sample results do not allow us to reject H0“.

# Testing of Hypothesis

The researcher is similar to the prosecuting attorney is the sense that the researcher brings the null hypothesis “to trial” when she believes there is probability strong evidence against the null.

• Just as the prosecutor usually believes that the person on trial is not innocent, the researcher usually believes that the null hypothesis is not true.
• In the court system the jury must assume (by law) that the person is innocent until the evidence clearly calls this assumption into question; analogously, in hypothesis testing the researcher must assume (in order to use hypothesis testing) that the null hypothesis is true until the evidence calls this assumption into question.

# Specifying the Null and Alternative Hypothesis

## 1) The t-test for independent samples, 2) One-way analysis of variance, 3) The t-test for correlation coefficients?, 4) The t-test for a regression coefficient.

[latexpage]

In each of these, the null hypothesis says there is no relationship and the alternative hypothesis says that there is a relationship.

1. In this case the null hypothesis says that the two population means (i.e., $\mu_1$ and  $\mu_2$) are equal; the alternative hypothesis says that they are not equal.
2. In this case the null hypothesis says that all of the population means are equal; the alternative hypothesis says that at least two of the means are not equal.
3. In this case the null hypothesis says that the population correlation (i.e., $\rho$) is zero; the alternative hypothesis says that it is not equal to zero.
4. In this case the null hypothesis says that the population regression coefficient ($\beta$) is zero, and the alternative says that it is not equal to zero.

## Introduction

A t-test for independent groups is useful when the same variable has been measured in two independent groups and the researcher wants to know whether the difference between group means is statistically significant. “Independent groups” means that the groups have different people in them and that the people in the different groups have not been matched or paired in any way.

## Objectives

The independent t-test compares the means of two unrelated/independent groups measured on the Interval or ratio scale. The SPSS t-test procedure allows the testing of hypothesis when variances are assumed to be equal or when are not equal and also provide the t-value for both assumptions. This test also provide the relevant descriptive statistics for both of the groups.

## Assumptions

• Variable can be classified in two groups independent of each other.
• Variable is Measured on interval or ratio scale.
• Measured variable is approximately normally distributed
• Both groups have similar variances  (variances are homogeneity)

## Data

Suppose a researcher want to discover whether left and right handed telephone operators differed in the time it took them to answer calls. The data for reaction time were obtained (RT’s measured in seconds):

 Subject no. RTs (Left) Subject no. RTs (Right) 1 500 11 392 2 513 12 445 3 300 13 271 4 561 14 523 5 483 15 421 6 502 16 489 7 539 17 501 8 467 18 388 9 420 19 411 10 480 20 467 Mean 476.5 430.8 Variance Ŝ2 5341.167 5298.84

The mean reaction times suggest that the left-handers were slower but does a t-test confirm this?

## Independent Sample t Test using SPSS

Perform the Following step by running the SPSS and entering the data set in SPSS data view

1. Click Analyze > Compare Means > Independent-Samples T Test… on the top menu as shown below.

Menu option for independent sample t test

2. Select continuous variables that you want to test from the list.

Dialog box for independent sample t test

3. Click on the arrow to send the variable in the “Test Variable(s)” box. You can also double click the variable to send it in “Test Variable” Box.
4. Select the categorical/grouping variable so that group comparison can be made and send it to the “Grouping Variable” box.
5. Click on the “Define Groups” button. A small dialog box will appear asking about the name/code used in variable view for the groups. We used 1 for males and 2 for females. Click Continue button when you’re done. Then click OK when you’re ready to get the output.  See the Pictures for Visual view.

Define Group for Independent sample t test

## Output

Independent sample t test output

First Table in output is about descriptive statistics concerning your variables. Number of observations, mean, variance, and standard error is available for both of the groups (male and female)

Second Table in output is important one concerning testing of hypothesis. You will see that there are two t-tests. You have to know which one to use. When comparing groups having approximately similar variances use the first t-test. Levene’s test checks for this. If the significance for Levene’s test is 0.05 or below, then it means that the “Equal Variances Not Assumed” test should be used (second one), Otherwise use the “Equal Variances Assumed” test (first one).  Here the significance is 0.287, so we’ll be using the “Equal Variances” first row in the second table.

In output table “t” is calculated t-value from test statistics, in example t-value is 1.401

df stands for degrees of freedom, in example we have 18 degree of freedom

Sig (two tailed) means two tailed significance value (P-Value), in example sig value is greater than 0.05 (significance level).

## Decision

As the P-value 0.178 id greater than our 0.05 significance level we fail to reject the null hypothesis. (two tailed case)

As the P-value 0.089 id greater than our 0.05 significance level we fail to reject the null hypothesis. (one tail case with 0.05 significance level)

As the P-value 0.089 id smaller than our 0.10 significance level we reject the null hypothesis and accept the alternative hypothesis. (one tail case with 0.10 significance level). In this case, it means that left handler have slower reaction time as compared to right handler on average.