Data Collection Methods

There are many methods to collect data. These Data Collection Methods can be classified into four main methods (sources) of collecting data: used in statistical inference.

Data Collection Methods

The Data Collection Methods are (i) Survey Method (ii) Simulation (iii) Controlled Experiments (iv) Observational Study. Let us discuss Data Collection Methods one by one in detail.

(i) Survey Method

A very popular and widely used method is the survey, where people with special training go out and record observations of, the number of vehicles, traveling along a road, the acres of fields that farmers are using to grow a particular food crop; the number of households that own more than one motor vehicle, the number of passengers using Metro transport and so on. Here the person making the study has no direct control over generating the data that can be recorded, although the recording methods need care and control.

(ii) Simulation

Simulation is also one of the most important data collection methods. In Simulation, a computer model for the operation of an (industrial)  system is set up in which an important measurement is the percentage purity of a (chemical) product. A very large number of realizations of the model can be run to look for any pattern in the results. Here the success of the approach depends on how well the model can explain that measurement and this has to be tested by carrying out at least a small amount of work on the actual system in operation.

(iii) Controlled Experiments

An experiment is possible when the background conditions can be controlled, at least to some extent. For example, we may be interested in choosing the best type of grass seed to use in the sports field.

The first stage of work is to grow all the competing varieties of seed at the same place and make suitable records of their growth and development. The competing varieties should be grown in quite small units close together in the field as in the figure below

Data Collection Methods: Controlled Experiments

This is a controlled experiment as it has certain constraints such as;

i) River on the right side
ii) Shadow of trees on the left side
iii) There are 3 different varieties (say, $v_1, v_2, v_3$) and are distributed in 12 units.

In the diagram below, much more control of local environmental conditions than there would have been if one variety had been replaced in the strip in the shelter of the trees, another close by the river while the third one is more exposed in the center of the field;

Data Collection Methods: Controlled Experiments 2

There are 3 experimental units. One is close to the stream and the other is to trees while the third one is between them which is more beneficial than the others. It is now our choice where to place any one of them on any of the sides.

(iv) Observational Study

Like experiments, observational studies try to understand cause-and-effect relationships. However, unlike experiments, the researcher is not able to control (1) how subjects are assigned to groups and/or (2) which treatments each group receives.

Note that small units of land or plots are called experimental units or simply units.

There is no “right” side for a unit, it depends on the type of crop, the work that is to be done on it, and the measurements that are to be taken. Similarly, the measurements upon which inferences are eventually going to be based are to be taken as accurately as possible. The unit must, therefore, need not be so large as to make recording very tedious because that leads to errors and inaccuracy. On the other hand, if a unit is very small there is the danger that relatively minor physical errors in recording can lead to large percentage errors.

Experimenters and statisticians who collaborate with them, need to gain a good knowledge of their experimental material or units as a research program proceeds.

R Data Analysis and Statistics

MCQs Mathematics Intermediate

Basic Principles of DOE (Design of Experiments)

The basic principles of doe (design of experiments or experimental design) are (i) Randomization, (ii) Replication, and (iii) Local Control. Let us discuss these important principles of experimental design in detail below.

Principles of DOE (Design of Experiments)

  1. Randomization

    Randomization is the cornerstone underlying the use of statistical methods in experimental designs.  Randomization is the random process of assigning treatments to the experimental units. The random process implies that every possible allotment of treatments has the same probability. For example, if the number of treatments = 3 (say, $A, B$, and C) and replication =$r = 4$, then the number of elements = $t \times r$ = 3 \times 4 = 12 = n$. Replication means that each treatment will appear 4 times as $r = 4$. Let the design is

    A C B C
    C B A B
    A C B A
    Note from the design elements 1, 7, 9, and 12 are reserved for Treatment $A$, elements 3, 6, 8, and 11 are reserved for Treatment $B$, and elements 2, 4, 5, and 10 are reserved for Treatment $C$. $P(A)= \frac{4}{12}, P(B)= 4/12$, and $P(C)=\frac{4}{12}$, meaning that Treatment $A, B$, and $C$ have equal chances of its selection.
  2. Replication

    By replication, we mean the repetition of the basic experiments. For example, If we need to compare the grain yield of two varieties of wheat then each variety is applied to more than one experimental unit. The number of times these are applied to experimental units is called their number of replications. It has two important properties:

    • It allows the experimenter to obtain an estimate of the experimental error.
    • More replication would provide the increased precision by reducing the standard error (SE) of mean as $s_{\overline{y}}=\tfrac{s}{\sqrt{r}}$, where $s$ is sample standard deviation and $r$ is a number of replications. Note that increase in $r$ value $s_{\overline{y}}$ (standard error of $\overline{y}$).
  3. Local Control

    Local control is the last important principle among the principles of doe. It has been observed that all extraneous source of variation is not removed by randomization and replication, i.e. unable to control the extraneous source of variation.
    Thus we need to refine the experimental technique. In other words, we need to choose a design in such a way that all extraneous source of variation is brought under control. For this purpose, we make use of local control, a term referring to the amount of (i) balancing, (ii) blocking, and (iii) grouping of experimental units.

Principles of doe

Balancing: Balancing means that the treatment should be assigned to the experimental units in such a way that the result is a balanced arrangement of treatment.

Blocking: Blocking means that the like experimental units should be collected together to form relatively homogeneous groups. A block is also called a replicate.

The main objective/ purpose of local control is to increase the efficiency of experimental design by decreasing experimental error.

This is all about the Basic Principles of the Experimental Design. To learn more about DOE visit the link: Design of Experiments.

Statistics help https://itfeature.com

Real Life Example

Imagine a bakery trying to improve the quality of its bread. Factors that could affect bread quality include

  • Flour type,
  • Water
  • Temperature, and
  • Yeast quantity

By using DOE, the bakery can systematically test different combinations of these factors to determine the optimal recipe.

Randomization: Randomly assign different bread batches to different combinations of flour type, water temperature, and yeast quantity.

Replication: Bake multiple loaves of bread for each combination to assess variability.

Local Control: If the oven has different temperature zones, bake similar bread batches in the same zone to reduce temperature variation.

By following the Basic Principles of Design of Experiments, the bakery can efficiently identify the best recipe for its bread, improving product quality and reducing waste.

Learn R Programming Language

Online MCQs Test Website

Read more about the Objective of Design of Experiments

Standard Error 2: A Quick Guide

Introduction to Standard Errors (SE)

Standard error (SE) is a statistical term used to measure the accuracy within a sample taken from a population of interest. The standard error of the mean measures the variation in the sampling distribution of the sample mean, usually denoted by $\sigma_\overline{x}$ is calculated as

\[\sigma_\overline{x}=\frac{\sigma}{\sqrt{n}}\]

Drawing (obtaining) different samples from the same population of interest usually results in different values of sample means, indicating that there is a distribution of sampled means having its mean (average values) and variance. The standard error of the mean is considered the standard deviation of all those possible samples drawn from the same population.

Size of the Standard Error

The size of the standard error is affected by the standard deviation of the population and the number of observations in a sample called the sample size. The larger the population’s standard deviation ($\sigma$), the larger the standard error will be, indicating more variability in the sample means. However, the larger the number of observations in a sample, the smaller the estimate’s SE, indicating less variability in the sample means. In contrast, by less variability, we mean that the sample is more representative of the population of interest.

Adjustments in Computing SE of Sample Means

If the sampled population is not very large, we need to make some adjustments in computing the SE of the sample means. For a finite population, in which the total number of objects (observations) is $N$ and the number of objects (observations) in a sample is $n$, then the adjustment will be $\sqrt{\frac{N-n}{N-1}}$. This adjustment is called the finite population correction factor. Then the adjusted standard error will be

\[\frac{\sigma}{\sqrt{n}} \sqrt{\frac{N-n}{N-1}}\]

Uses of Standard Error

  1. It measures the spread of values of statistics about the expected value of that statistic. It helps us understand how well a sample represents the entire population.
  2. It is used to construct confidence intervals, which provide a range of values likely to contain the true population parameter.
  3. It helps to test the null hypothesis about population parameter(s), such as t-tests and z-tests. It helps determine the significance of differences between sample means or between a sample mean and a population mean.
  4. It helps in determining the required sample size for a study to achieve the desired level of precision.
  5. By comparing standard errors of different samples or estimates, one can assess the relative variability and reliability of those estimates.
Standard Error

The SE is computed from sample statistic. To compute SE for simple random samples, assuming that the size of the population ($N$) is at least 20 times larger than that of the sample size ($n$).
\begin{align*}
Sample\, mean, \overline{x} & \Rightarrow SE_{\overline{x}} = \frac{n}{\sqrt{n}}\\
Sample\, proportion, p &\Rightarrow SE_{p} \sqrt{\frac{p(1-p)}{n}}\\
Difference\, b/w \, means, \overline{x}_1 – \overline{x}_2 &\Rightarrow SE_{\overline{x}_1-\overline{x}_2}=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\\
Difference\, b/w\, proportions, \overline{p}_1-\overline{p}_2 &\Rightarrow SE_{p_1-p_2}=\sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}
\end{align*}

Summary

The SE provides valuable insights about the reliability and precision of sample-based estimates. By understanding SE, a researcher can make more informed decisions and draw more accurate conclusions from the data under study. The SE is identical to the standard deviation, except that it uses statistics whereas the standard deviation uses the parameter.

FAQS about SE

  1. What is the SE, and how it is computed?
  2. What are the uses of SE?
  3. From which is the size of the SE affected?
  4. When will the SE be large?
  5. When will the SE be small?
  6. What will be the standard error for proportion?

For more about SE follow the link Standard Error of Estimate

R for Data Analysis

MCQs Mathematics Intermediate Second Year

Latin Square Designs (LSD) Definition and Introduction

Introduction to Latin Square Designs

In Latin Square Designs the treatments are grouped into replicates in two different ways, such that each row and each column is a complete block, and the grouping for balanced arrangement is performed by restricting that each of the treatments must appear once and only once in each of the rows and only once in each of the column. The experimental material should be arranged and the experiment conducted in such a way that the differences among the rows and columns represent a major source of variation.

Hence a Latin Square Design is an arrangement of $k$ treatments in a $k\times k$ squares, where the treatments are grouped in blocks in two directions. It should be noted that in a Latin Square Design the number of rows, the number of columns, and the number of treatments must be equal.

In other words unlike Randomized Completely Block Design (RCBD) and Completely Randomized Design (CRD) a Latin Square Design is a two-restriction design, which provides the facility of two blocking factors that are used to control the effect of two variables that influence the response variable. Latin Square Design is called Latin Square because each Latin letter represents the treatment that occurs once in a row and once in a column in such a way that for one criterion (restriction), rows are completely homogeneous blocks, and concerning another criterion (second restriction) columns are completely homogeneous blocks.

Application of Latin Square Designs

The application of Latin Square Designs is mostly in animal science, agriculture, industrial research, etc. A daily life example can be a simple game called Sudoku puzzle is also a special case of Latin square designs. The main assumption is that there is no contact between treatments, rows, and columns effect.

Latin Square Designs

The general model is defined as
\[Y_{ijk}=\mu+\alpha_i+\beta_j+\tau_k +\varepsilon_{ijk}\]

where $i=1,2,\cdots,t; j=1,2,\cdots,t$ and $k=1,2,\cdots,t$ with $t$ treatments, $t$ rows and $t$ columns,
$\mu$ is the overall mean (general mean) based on all of the observations,
$\alpha_i$ is the effect of the $i$th row,
$\beta_j$ is the effect of $j$th rows,
$\tau_k$ is the effect of the $k$th column.
$\varepsilon_{ijk}$ is the corresponding error term which is assumed to be independent and normally distributed with mean zero and constant variance i.e $\varepsilon_{ijk}\sim N(0, \sigma^2)$.

Latin Square Designs Experimental Layout

Suppose we have 4 treatments (namely: $A, B, C$, and $D$), then it means that we have

Number of Treatments = Number of Rows = Number of Columns =4

The Latin Square Designs Layout can be for example

A
$Y_{111}$
B
$Y_{122}$
C
$Y_{133}$
D
$Y_{144}$
B
$Y_{212}$
C
$Y_{223}$
D
$Y_{234}$
A
$Y_{241}$
C
$Y_{313}$
D
$Y_{324}$
A
$Y_{331}$
B
$Y_{342}$
D
$Y_{414}$
A
$Y_{421}$
B
$Y_{432}$
C
$Y_{443}$

The number in subscript represents a row, block, and treatment number respectively. For example, $Y_{421}$ means the first treatment in the 4th row, the second block (column).

Latin Square Designs

Benefits of using Latin Square Designs

  • Efficiency: It allows to examination of multiple factors (treatments) within a single experiment, reducing the time and resources needed.
  • Controlling Variability: By ensuring a balanced distribution of treatments across rows and columns, one can effectively control for two sources of variation that might otherwise influence the results.

Limitations

The following limitations need to be considered:

  • Number of Treatments: The number of rows and columns in the Latin square must be equal to the number of treatments. This means it works best with a small to moderate number of treatments.
  • Interaction Effects: Latin squares are good for analyzing the main effects of different factors, but they cannot account for interaction effects between those factors.

Matrices and Determinants Quizzes