## What is Standard Error of Sampling? (2012)

The standard error (SE) of a statistic is the standard deviation of the sampling distribution of that statistic. The standard error of sampling reflects how much sampling fluctuation a statistic will show. The inferential (deductive) statistics involved in constructing confidence intervals and significance testing are based on standard errors. Increasing the sample size decreases the standard error.

In practical applications, the true value of the standard deviation of the error is unknown. As a result, the term standard error is often used to refer to an estimate of this unknown quantity.

The size of the SE is affected by two values.

1. The Standard Deviation of the population affects the standard errors. The larger the population’s standard deviation ($\sigma$), the larger is SE i.e. $\frac {\sigma}{\sqrt{n}}$. If the population is homogeneous (which results in a small population standard deviation), the SE will also be small.
2. The standard errors are affected by the number of observations in a sample. A large sample will result in a small SE of estimate (indicates less variability in the sample means)

#### Application of Standard Error of Sampling

The SEs are used in different statistical tests such as

• to measure the distribution of the sample means
• to build confidence intervals for means, proportions, differences between means, etc., for cases when population standard deviation is known or unknown.
• to determine the sample size
• in control charts for control limits for means
• in comparison tests such as z-test, t-test, Analysis of Variance,
• in relationship tests such as Correlation and Regression Analysis (standard error of regression), etc.

#### (1) Standard Error Formula Means

The SE for the mean or standard deviation of the sampling distribution of the mean measures the deviation/ variation in the sampling distribution of the sample mean, denoted by $\sigma_{\bar{x}}$ and calculated as the function of the standard deviation of the population and respective size of the sample i.e

$\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}}$                      (used when population is finite)

If the population size is infinite then ${\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}} \times \sqrt{\frac{N-n}{N}}}$ because $\sqrt{\frac{N-n}{N}}$ tends towards 1 as N tends to infinity.

When the population’s standard deviation ($\sigma$) is unknown, we estimate it from the sample standard deviation. In this case SE formula is $\sigma_{\bar{x}}=\frac{S}{\sqrt{n}}$

#### (2) Standard Error Formula for Proportion

The SE for a proportion can also be calculated in the same manner as we calculated the standard error of the mean, denoted by $\sigma_p$ and calculated as $\sigma_p=\frac{\sigma}{\sqrt{n}}\sqrt{\frac{N-n}{N}}$.

In case of finite population $\sigma_p=\frac{\sigma}{\sqrt{n}}$
in case of infinite population $\sigma=\sqrt{p(1-p)}=\sqrt{pq}$, where $p$ is the probability that an element possesses the studied trait and $q=1-p$ is the probability that it does not.

#### (3) Standard Error Formula for Difference Between Means

The SE for the difference between two independent quantities is the square root of the sum of the squared standard errors of both quantities i.e $\sigma_{\bar{x}_1+\bar{x}_2}=\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}$, where $\sigma_1^2$ and $\sigma_2^2$ are the respective variances of the two independent population to be compared and $n_1+n_2$ are the respective sizes of the two samples drawn from their respective populations.

Unknown Population Variances
Suppose the variances of the two populations are unknown. In that case, we estimate them from the two samples i.e. $\sigma_{\bar{x}_1+\bar{x}_2}=\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}}$, where $S_1^2$ and $S_2^2$ are the respective variances of the two samples drawn from their respective population.

Equal Variances are assumed
In case when it is assumed that the variance of the two populations are equal, we can estimate the value of these variances with a pooled variance $S_p^2$ calculated as a function of $S_1^2$ and $S_2^2$ i.e

$S_p^2=\frac{(n_1-1)S_1^2+(n_2-1)S_2^2}{n_1+n_2-2}$
$\sigma_{\bar{x}_1}+{\bar{x}_2}=S_p \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}$

#### (4) Standard Error for Difference between Proportions

The SE of the difference between two proportions is calculated in the same way as the SE of the difference between means is calculated i.e.
\begin{eqnarray*}
\sigma_{p_1-p_2}&=&\sqrt{\sigma_{p_1}^2+\sigma_{p_2}^2}\\
&=& \sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}
\end{eqnarray*}
where $p_1$ and $p_2$ are the proportion for infinite population calculated for the two samples of sizes $n_1$ and $n_2$.

1. Define the Standard Error of Mean.
2. Standard Error is affected by which two values?
3. Write the formula of the standard error of mean, proportion, and difference between means.
4. What is the application of standard error of mean in Sampling?
5. Discuss the importance of standard error?

Hypothesis Testing in R Language

Online General Knowledge Quiz

## Multivariate Analysis (2012)

Multivariate Analysis term is used to include all statistics for more than two variables that are simultaneously analyzed.

Multivariate analysis is based upon an underlying probability model known as the Multivariate Normal Distribution (MND). The objective of scientific investigations to which multivariate methods most naturally lend themselves includes.

### Objectives of Multivariate Analysis

The following are some basic objectives of multivariate analysis.

• Data reduction or structural simplification
The phenomenon being studied is represented as simply as possible without sacrificing valuable information. It is hoped that this will make interpretation easier.
• Sorting and Grouping
Graphs of similar objects or variables are created, based on measured characteristics. Alternatively, rules for classifying objects into well-defined groups may be required.
• Investigation of the dependence among variables
The nature of the relationships among variables is of interest. Are all the variables mutually independent or are one or more variables dependent based on observation of the other variables?
• Prediction
Relationships between variables must be determined for predicting the values of one or more variables based on observation of the other variables.
• Hypothesis Construction and testing
Specific statistical hypotheses, formulated in terms of the parameter of the multivariate population, are tested. This may be done to validate assumptions or to reinforce prior convictions.

Applications: Multivariate analysis is used in various fields:

• Social sciences (understanding factors influencing voting behavior)
• Business (analyzing customer demographics and purchase patterns)
• Finance (evaluating risk factors in investment portfolios)
• Natural sciences (studying the relationships between different environmental variables)

### Multivariate Data Sets

We are concerned with analyzing measurements made on several variables or characteristics. These measurements (data) must frequently be arranged and displayed in various ways (graphs, tabular form, etc.). Preliminary concepts underlying these first steps of data organization are

#### Array

Multivariate data arise whenever an investigator, seeking to understand a social or physical phenomenon, selects a number of variables $p\ge$ of variables or characteristics to record. The values of these variables are all recorded for each distinct item, individual, or experimental unit.

$X_{jk}$ notation is used to indicate the particular value of the kth variable that is observed on the jth item or trial. i.e. $X_{jk}$ measurement of the kth variable on the jth item. So, $n$ measurements on $p$ variables can be displayed as

$\begin{array}{ccccccc} . & V_1 & V_2 & \dots & V_k & \dots & V_p \\ Item 1 & x_{11} & x_{12} & \dots & x_{1k} & \dots & x_{1p} \\ Item 2 & x_{21} & x_{22} & \dots & x_{2k} & \dots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ Item j & x_{j1} & x_{j2} & \dots & x_{jk} & \dots & x_{jp} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ Item n & x_{n1} & x_{n2} & \dots & x_{nk} & \dots & x_{np} \\ \end{array}$

These data can be displayed as rectangular arrays $X$ of $n$ rows and $p$ columns

$X=\begin{pmatrix} x_{11} & x_{12} & \dots & x_{1k} & \dots & x_{1p} \\ x_{21} & x_{22} & \ddots & x_{2k} & \ddots & x_{2p} \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\ x_{j1} & x_{j2} & \ddots & x_{jk} & \ddots & x_{jp} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \dots & x_{nk} & \dots & x_{np} \end{pmatrix}$

This $X$ array contains the data consisting of all of the observations on all of the variables.

Example: suppose we have data for the number of books sold and the total amount of each sale.

Variable 1 (Sales in Dollars)
$\begin{array}{ccccc} Data Values: & 42 & 52 & 48 & 63 \\ Notation: & x_{11} & x_{21} & x_{31} & x_{41} \end{array}$

Variable 2 (Number of Books sold)
$\begin{array}{ccccc} Data Values: & 4 & 2 & 8 & 3 \\ Notation: & x_{12} & x_{22} & x_{33} & x_{42} \end{array}$

The information, available in the multivariate data sets can be assessed by calculating certain summary numbers, known as multivariate analysis: multivariate descriptive statistics such as Arithmetic Mean, Sample Mean (the measure of location), Average of the Squares of the distances of all of the numbers from the mean (variation/spread i.e. Measure of Spread or Variation).

MCQs General Knowledge

R Programming Language

## Introduction to Measure of Dispersion

The measure of location or averages or central tendency is not sufficient to describe the characteristics of a distribution, because two or more distributions may have averages that are exactly alike, even though the distributions are dissimilar in other aspects. On the other hand, a measure of central tendency represents the typical value of the data set. To give a sensible description of data, a numerical quantity called the measure of dispersion/ variability or scatter that describes the spread of the values in a set of data has two types of measures of dispersion or variability:

1. Absolute Measures
2. Relative Measures

A measure of central tendency together with a measure of dispersion gives an adequate description of data as compared to the use of a measure of location only, because the averages or measures of central tendency only describe the balancing point of the data set, it does not provide any information about the degree to which the data tend to spread or scatter about the average value. So, the Measure of dispersion indicates the characteristic of the central tendency measure. The smaller the variability of a given set, the more the values of the measure of averages will represent the data set.

### Absolute Measure of Dispersion

Absolute measures are defined in such a way that they have units such as meters, grams, etc., the same as those of the original measurements. Absolute measures cannot be used to compare the variation/spread of two or more data sets.
Most Common absolute measures of variability are:

### Relative Measures of Dispersion

The relative measures have no units as these are ratios, coefficients, or percentages. Relative measures are independent of units of measurement and are useful for comparing data of different natures.

• Coefficient of Variation
• Coefficient of Mean Deviation
• Coefficient of Quartile Deviation
• Coefficient of Standard Deviation

Different terms are used for the measure of dispersion or variability such as variability, spread, scatterness, the measure of uncertainty, deviation, etc.

References:
http://www2.le.ac.uk/offices/careers/ld/resources/numeracy/variability

## Introduction

The objective of testing hypotheses (Testing of Statistical Hypothesis) is to determine if an assumption about some characteristic (parameter) of a population is supported by the information obtained from the sample.

### Testing of Hypothesis

The terms hypothesis testing or testing of the hypothesis are used interchangeably. A statistical hypothesis (different from a simple hypothesis) is a statement about a characteristic of one or more populations such as the population mean. This statement may or may not be true. The validity of the statement is checked based on information obtained by sampling from the population.
Testing of Hypothesis refers to the formal procedures used by statisticians to accept or reject statistical hypotheses that include:

### i) Formulation of Null and Alternative Hypothesis

#### Null hypothesis

A hypothesis formulated for the sole purpose of rejecting or nullifying it is called the null hypothesis, usually denoted by H0. There is usually a “not” or a “no” term in the null hypothesis, meaning that there is “no change”.

For Example, The null hypothesis is that the mean age of M.Sc. students is 20 years. Statistically, it can be written as $H_0:mu = 20$. Generally speaking, the null hypothesis is developed for testing.
We should emphasize that if the null hypothesis is not rejected based on the sample data we cannot say that the null hypothesis is true. In another way, failing to reject the null hypothesis does not prove that the $H_0$ is true, it means that we have failed to disprove $H_0$.

For the null hypothesis, we usually state that “there is no significant difference between “A” and “B”. For example, “the mean tensile strength of copper wire is not significantly different from some standard”.

#### Alternative Hypothesis

Any hypothesis different from the null hypothesis is called an alternative hypothesis denoted by $H_1$. Or we can say that a statement is accepted if the sample data provide sufficient evidence that the null hypothesis is false. The alternative hypothesis is also referred to as the research hypothesis.

It is important to remember that no matter how the problem is stated, the null hypothesis will always contain the equal sign, and the equal sign will never appear in the alternate hypothesis. It is because the null hypothesis is the statement being tested and we need a specific value to include in our calculations. The alternative hypothesis for the example given in the null hypothesis is $H_1:mu ne 20$.

#### Simple and Composite Hypothesis

If a statistical hypothesis completely specifies the form of the distribution as well as the value of all parameters, then it is called a simple hypothesis. For example, suppose the age distribution of the first-year college student follows $N(16, 25)$, and the null hypothesis is $H_0: mu =16$ then this null hypothesis is called a simple hypothesis, and if a statistical hypothesis does not completely specify the form of the distribution, then it is called a composite hypothesis. For example, $H_1:mu < 16$ or $H_1:mu > 16$.

### ii) Level of Significance

The level of significance (significance level) is denoted by the Greek letter alpha ($alpha$). It is also called the level of risk (as there is the risk you take of rejecting the null hypothesis when it is true). The level of significance is defined as the probability of making a type-I error. It is the maximum probability with which we would be willing to risk a type-I error. It is usually specified before any sample is drawn so that the results obtained will not influence our choice.

In practice 10% (0.10) 5% (0.05) and 1% (0.01) levels of significance are used in testing a given hypothesis. A 5% level of significance means that there are about 5 chances out of 100 that we would reject the true hypothesis i.e. we are 95% confident that we have made the right decision. The hypothesis that has been rejected at a 0.05 level of significance means that we could be wrong with a probability of 0.05.

#### Selection of Level of Significance

In Testing of Hypothesis, the selection of the level of significance depends on the field of study. Traditionally 0.05 level is selected for business science-related problems, 0.01 for quality assurance, and 0.10 for political polling and social sciences.

#### Type-I and Type-II Errors

Whenever we accept or reject a statistical hypothesis based on sample data, there are always some chances of making incorrect decisions. Accepting a true null hypothesis or rejecting a false null hypothesis leads to a correct decision, and accepting a false hypothesis or rejecting a true hypothesis leads to an incorrect decision. These two types of errors are called type-I errors and type-II errors.
type-I error: Rejecting the null hypothesis when it is ($H_0$) true.
type-II error: Accepting the null hypothesis when $H_1$ is true.

### iii) Test Statistics

The third step of Testing the Hypothesis is a procedures that enable us to decide whether to accept or reject the hypothesis or to determine whether observed samples differ significantly from expected results. These are called tests of hypothesis, tests of significance, or rules of decision. We can also say that test statistics is a value calculated from sample information, used to determine whether to reject the null hypothesis.

The test statistics for mean $mu$ when $sigma$ is known is $Z= frac{bar{X}-mu}{sigma/sqrt{n}}$, where Z-value is based on the sampling distribution of $bar{X}$, which follows the normal distribution with mean $mu_{bar{X}}$ equal to $mu$ and standard deviation $sigma_{bar{X}}$ which is equal to $sigma/sqrt{n}$. Thus we determine whether the difference between $bar{X}$ and $mu$ is statistically significant by finding the number of standard deviations $bar{X}$  from $mu$ using the Z statistics. Other test statistics are also available such as $t$, $F$, and $chi^2$, etc.

### iv) Critical Region (Formulating Decision Rule)

It must be decided before the sample is drawn under what conditions (circumstances) the null hypothesis will be rejected. A dividing line must be drawn defining “Probable” and “Improbable” sample values given that the null hypothesis is a true statement. Simply a decision rule must be formulated having specific conditions under which the null hypothesis should be rejected or should not be rejected. This dividing line defines the region or area of rejection of those values that are large or small that the probability of their occurrence under a null hypothesis is rather remote i.e. Dividing line defines the set of possible values of the sample statistic that leads to rejecting the null hypothesis called the critical region.

#### One-tailed and two-tailed tests of significance

In testing of hypothesis if the rejection region is on the left or right tail of the curve then it is called a one-tailed hypothesis. It happens when the null hypothesis is tested against an alternative hypothesis having a “greater than” or a “less than” type.

and if the rejection region is on the left and right tail (both sides) of the curve then it is called a two-tailed hypothesis. It happens when the null hypothesis is tested against an alternative hypothesis having a “not equal to sign” type.

### v) Making a Decision

In this last step of testing hypotheses, the computed value of the test statistic is compared with the critical value. If the sample statistic falls within the rejection region, the null hypothesis will be rejected or otherwise accepted. Note that only one of two decisions is possible in hypothesis testing, either accept or reject the null hypothesis. Instead of “accepting” the null hypothesis ($H_0$), some researchers prefer to phrase the decision as “Do not reject $H_0$” “We fail to reject $H_0$” or “The sample results do not allow us to reject $H_0$”.

Data Analysis in R Language