Basic Statistics and Data Analysis

Lecture notes, MCQS of Statistics

Data Collection Methods

There are many methods to collect data, but these methods can be classified in four main methods (sources) of collecting data to use in statistical inference. These are (i) Survey Method (ii) Simulation (iii) Controlled Experiments (iv) Observational Study.

Survey Method

A very popular and widely used method is the survey, where people with special training go out and record observations of, the number of vehicles, traveling along a road, the acres of fields that farmers are using to grow a particular food crop; the number of house-holds that own more than one motor vehicle, the number of passenger using Metro transport and so on. Here the person making the study has no direct control over generating the data that can be recorded, although the recording methods need care and control.

Simulation

In Simulation, a computer model for the operation of a (industrial)  system is setup in which an important measurement is a percentage purity of a (chemical) product. A very large number of realizations of the model can be run in order to look for any pattern in the results. Here the success of the approach depends on how well that measurement can be explained by the model and this has to be tested by carrying out at least a small amount of work on the actual system in operation.

Controlled Experiments

An experiment is possible when the background conditions can be controlled, at least to some extent. For example, we may be interested in choosing the best type of a grass seed to use in sport field.

The first stage of work is to grow all the competing varieties of seed at the same place and make suitable records of their growth and development. The competing varieties should be grown in quite small units close together in the field as in the figure below

Controlled Experiment

 

This is the controlled experiment as it has certain constraints such as;

i) River on right side
ii) Shadow of trees on left side
iii) There are 3 different varieties (say, v1, v2, v3) and are distributed in 12 units.

In diagram below, much more control of local environmental conditions than there would have been of one variety had been replaced in strip in the shelter of the trees, another close by the river while third one is more exposed in center of the field;

Controlled experiment 2There are 3 experimental units. One is close to stream and other is to trees while third one is between them which is most beneficial than others. It is now our choice where to place any one of them at any of the side.

Observational Study

Like experiments, observational studies try to understand cause-and-effect relationships. However, unlike experiments, the researcher is not able to control (1) how subjects are assigned to groups and/or (2) which treatments each group receives.

 

Note that small units of land or plots are called experimental units or simply units.

There is no “right” side for a unit, it depends on the type of the crop, the work that is to be done on it and the measurements that are to be taken. Similarly, the measurements upon which inferences are eventually going to be based are to be taken as accurately as possible. The unit must therefore need not be so large as to make recording very tedious because that leads to errors and inaccuracy. On the other hand, if a unit is very small there is the danger that relatively minor physical errors in recording, can lead to a large percentage errors.

Experimenters and statisticians who collaborate with them, need to gain a good knowledge of their experimental material or units as a research program proceeds.

 

Download Data collection methods pdf file:

 

Basic Principles of Experimental Design

The basic principles of experimental design are (i) Randomization, (ii) Replication and (iii) Local Control.

  1. Randomization

    Randomization is the corner stone underlying the use of statistical methods in experimental designs.  Randomization is the random process of assigning treatments to the experimental units. The random process implies that every possible allotment of treatments has the same probability. For example, if number of treatment = 3 (say, A, B, and C) and replication = r = 4, then the number of elements = t x r = 3 x 4 = 12 = n. Replication means that each treatment will appear 4 times as r = 4. Let the design is

    ACBC
    CBAB
    ACBA

    Note from the design elements 1, 7, 9, 12 are reserved for treatment A, element 3, 6, 8 and 11 are reserved for Treatment B and elements 2, 4, 5 and 10 are reserved for Treatment C. P(A)= 4/12, P(B)= 4/12, and P(C)=4/12, meaning that Treatment A, B and C has equal chances of its selection.

  2. Replication

    By replication we means that repetition of the basic experiments. For example, If we need to compare grain yield of two varieties of wheat then each variety is applied to more than one experimental units. The number of times these are applied on experimental units is called their number of replication. It has two important properties:

    • It allows the experimenter to obtain an estimate of the experimental error.
    • The more replication would provide the increased precision by reducing the standard error (SE) of mean as $s_{\overline{y}}=\tfrac{s}{\sqrt{r}}$, where $s$ is sample standard deviation and $r$ is number of replications. Note that increase in $r$ value $s_{\overline{y}}$ (standard error of $\overline{y}$).
  3. Local Control

    It has been observed that all extraneous source of variation are not removed by randomization and replication, i.e. unable to control extraneous source of variation.
    Thus we need to a refinement in the experimental technique. In other words we need to choose a design in such a way that all extraneous source of variation are brought under control. For this purpose we make use of local control, a term referring to the amount of (i) balancing, (ii) blocking and (iii) grouping of experimental units.

Balancing: Balancing means that the treatment should be assigned to the experimental units in such a way that the result is a balanced arrangement of treatment.

Blocking: Blocking means that the like experimental units should be collected together to far relatively homogeneous groups. A block is also a replicate.

The main objective/ purpose of local control is to increase the efficiency of experimental design by decreasing the experimental error.

 

Standard Error of Estimate

Standard error (SE) is a statistical term used to measure the accuracy within a sample taken from population of interest. The standard error of the mean measures the variation in the sampling distribution of the sample mean, usually denoted by $\sigma_\overline{x}$ is calculated as

\[\sigma_\overline{x}=\frac{\sigma}{\sqrt{n}}\]

Drawing (obtaining) different samples from the same population of interest usually results in different values of sample means, indicating that there is a distribution of sampled means having its own mean (average values) and variance. The standard error of the mean is considered as the standard deviation of all those possible sample drawn from the same population.

The size of the standard error is affected by standard deviation of the population and number of observations in a sample called the sample size. The larger the standard deviation of the population ($\sigma$), the larger the standard error will be, indicating that there is more variability in the sample means. However larger the number of observations in a sample smaller will be the standard error of estimate, indicating that there is less variability in the sample means, where by less variability we means that the sample is more representative of the population of interest.

If the sampled population is not very larger, we need to make some adjustment in computing the SE of the sample means. For a finite population, in which total number of objects (observations) is $N$ and the number of objects (observations) in a sample is $n$, then the adjustment will be $\sqrt{\frac{N-n}{N-1}}$. This adjustment is called the finite population correction factor. Then the adjusted standard error will be

\[\frac{\sigma}{\sqrt{n}} \sqrt{\frac{N-n}{N-1}}\]

The SE is used to:

  1. measure the spread of values of statistic about the expected value of that statistic
  2. construct confidence intervals
  3. test the null hypothesis about population parameter(s)

The standard error is computed from sample statistics. To compute SE for simple random samples, assuming that the size of population ($N$) is at least 20 times larger than that of the sample size ($n$).
\begin{align*}
Sample\, mean, \overline{x} & \Rightarrow SE_{\overline{x}} = \frac{n}{\sqrt{n}}\\
Sample\, proportion, p &\Rightarrow SE_{p} \sqrt{\frac{p(1-p)}{n}}\\
Difference\, b/w \, means, \overline{x}_1 – \overline{x}_2 &\Rightarrow SE_{\overline{x}_1-\overline{x}_2}=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\\
Difference\, b/w\, proportions, \overline{p}_1-\overline{p}_2 &\Rightarrow SE_{p_1-p_2}=\sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}
\end{align*}

The standard error is identical to the standard deviation, except that it uses statistics whereas the standard deviation uses the parameter.

 

For more about SE follow the link Standard Error of Estimate

 

Latin Square Design (LSD)

Latin Square Design (LSD)

In Latin Square Design the treatments are grouped into replicates in two different ways, such that each row and each column is a complete block, and the grouping for balanced arrangement is performed by imposing the restriction that each of the treatment must appears once and only once in each of the row and only once in each of the column. The experimental material should be arranged and the experiment conducted in such a way that the differences among the rows and columns represents major source of variation.

Hence a Latin Square Design is an arrangement of k treatments in a k x k squares, where the treatments are grouped in blocks in two directions. It should be noted that in a Latin Square Design the number of rows, the number of columns and the number of treatments must be equal.

In other words unlike Randomized Completely Block Design (RCBD) and Completely Randomized Design (CRD) a Latin Square Design is a two restrictional design, which provided the facility of two blocking factor which are used to control the effect of two variable that influences the response variable. Latin Square Design is called Latin Square because each Latin letter represents the treatment that occurs once in a row and once in a column in such a way that in respect of one criterion (restriction) rows are completely homogeneous blocks and in respect of other criterion (second restriction) columns are completely homogeneous blocks.

The application of Latin Square Design is mostly in animal science, agriculture and industrial research etc. A daily life example can be a simple game called Sudoku puzzle is also a special case of Latin square design. The main assumption is that there is no contact between treatments, rows and columns effect.

The general model is defined as
\[Y_{ijk}=\mu+\alpha_i+\beta_j+\tau_k +\varepsilon_{ijk}\]

where $i=1,2,\cdots,t; j=1,2,\cdots,t$ and $k=1,2,\cdots,t$ with $t$ treatments, $t$ rows and $t$ columns,
$\mu$ is the overall mean (general mean) based on all of the observation,
$\alpha_i$ is the effect of ith row,
$\beta_j$ is the effect of jth rows,
$\tau_k$ is the effect of kth column.
$\varepsilon_{ijk}$ is the corresponding error term which is assumed to be independent and normally distributed with mean zero and constant variance i.e $\varepsilon_{ijk}\sim N(0, \sigma^2)$.

Latin Square Design Experimental Layout

Suppose we have 4 treatments (namely: A, B, C and D), then it means that we have

Number of Treatments = Number of Rows = Number of Columns =4

And the Latin Square Design’s Layout can be for example

A
$Y_{111}$
B
$Y_{122}$
C
$Y_{133}$
D
$Y_{144}$
B
$Y_{212}$
C
$Y_{223}$
D
$Y_{234}$
A
$Y_{241}$
C
$Y_{313}$
D
$Y_{324}$
A
$Y_{331}$
B
$Y_{342}$
D
$Y_{414}$
A
$Y_{421}$
B
$Y_{432}$
C
$Y_{443}$

The number in subscript represents row, block and treatment number respectively. For example $Y_{421}$ means first treatment in 4th row, second block (column).

 

Creating Matrices in Mathematica

A matrix is an array of numbers arranged in rows and columns. In Mathematica matrices are expressed as a list of rows, each of which is a list itself. It means a matrix is a list of lists. If a matrix has n rows and m columns then we call it an n by m matrix. The value(s) in the ith row and jth column is called the i, j entry.

In mathematica, matrices can be entered with the { } notation, constructed from a formula or imported from a data file. There are also commands for creating diagonal matrices, constant matrices and other special matrix types.

Creating matrices in Mathematica

  • Create a matrix using { } notation
    mat={{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}
    but output will not be in matrix form, to get in matrix form use command like
    mat//MatrixForm
  • Creating matrix using Table command
    mat1=Table[b{row, column},
    {row, 1, 4, 1}, {column, 1, 2, 1}]
    ];
    MatrixForm[mat1]
  • Creating symbolic matrix such as
    mat2=Table[xi+xj , {i, 1, 4}, {j, 1, 3}]
    mat2//MatrixForm
  • Creating a diagonal matrix with nonzero entries at its diagonal
    DiagonalMatrix[{1, 2, 3, r}]//MatrixForm
  • Creating a matrix with same entries i.e. a constant matrix
    ConstantArray[3, {2, 4}]//MatrixForm
  • Creating an identity matrix of order n × n
    IdentityMatrix[4]

Matrix Operations in Mathematica

In mathematica matrix operations can be performed on both numeric and symbolic matrices.

  • To find the determinant of a matrix
    Det[mat]
  • To find the transpose of a matrix
    Transpose[mat]
  • To find the inverse of a matrix for linear system
    Inverse[mat]
  • To find the Trace of a matrix i.e. sum of diagonal elements in a matrix
    Tr[mat]
  • To find Eigenvalues of a matrix
    Eigenvalues[mat]
  • To find Eigenvector of a matrix
    Eigenvector[mat]
  • To find both Eigenvalues and Eigenvectors together
    Eigensystem[mat]

Note that +, *, ^ operators all automatically work element-wise.

Displaying matrix and its elements

  • mat[[1]]         displays the first row of a matrix where mat is a matrix create above
  • mat[[1, 2]]     displays the element from first row and second column, i.e. m12 element of the matrix
  • mat[[All, 2]]  displays the 2nd column of matrix


References

Copy Right © 2011 ITFEATURE.COM
error: Content is protected !!