Autocorrelation in Time Series Data (2015)

The post is about autocorrelation in time series data. The autocorrelation (serial correlation, or cross-autocorrelation) function (the diagnostic tool) helps to describe the evaluation of a process through time. Inference based on autocorrelation function is often called an analysis in the time domain.

Autocorrelation of a random process is the measure of correlation (relationship) between observations at different distances apart. These coefficients (correlation or autocorrelation) often provide insight into the probability model which generated the data. One can say that autocorrelation is a mathematical tool for finding repeating patterns in the data series.

The detection of autocorrelation in time series data is usually used for the following two purposes:

  1. Help to detect the non-randomness in data (the first i.e. lag 1 autocorrelation is performed)
  2. Help in identifying an appropriate time series model if the data are not random (autocorrelation is usually plotted for many lags)

For simple correlation, let there are $n$ pairs of observations on two variables $x$ and $y$, then the usual correlation coefficient (Pearson’s coefficient of correlation) is

\[r=\frac{\sum(x_i-\overline{x})(y_i-\overline{y})}{\sqrt{\sum (x_i-\overline{x})^2 \sum (y_i-\overline{y})^2 }}\]

A similar idea can be used in time series to see whether successive observations are correlated or not. Given $N$ observations $x_1, x_2, \cdots, x_N$ on a discrete time series, we can form ($n-1$) pairs of observations such as $(x_1, x_2), (x_2, x_3), \cdots, (x_{n-1}, x_n)$. Here in each pair first observation is as one variable ($x_t$) and the second observation is as the second variable ($x_{t+1}$). So the correlation coefficient between $x_t$ and $x_{t+1}$ is

\[r_1\frac{ \sum_{t=1}^{n-1} (x_t- \overline{x}_{(1)} ) (x_{t+1}-\overline{x}_{(2)})  }    {\sqrt{ [\sum_{t=1}^{n-1} (x_t-\overline{x}_{(1)})^2] [ \sum_{t=1}^{n-1} (y_t-\overline{y}_{(1)})^2 ] } }\]

where

$\overline{x}_{(1)}=\sum_{t=1}^{n-1} \frac{x_t}{n-1}$ is the mean of first $n-1$ observations

$\overline{x}_{(2)}=\sum_{t=2}^{n} \frac{x_t}{n-1}$ is the mean of last $n-1$ observations

Note that: The assumption is that the observations in autocorrelation are equally spaced (equi-spaced).

It is called autocorrelation or serial correlation coefficient. For large $n$, $r_1$ is approximately

\[r_1=\frac{\frac{\sum_{t=1}^{n-1} (x_t-\overline{x})(x_{t+1}-\overline{x}) }{n-1}}{ \frac{\sum_{t=1}^n (x_t-\overline{x})^2}{n}}\]

or

\[r_1=\frac{\sum_{t=1}^{n-1} (x_t-\overline{x})(x_{t+1}-\overline{x}) } { \sum_{t=1}^n (x_t-\overline{x})^2}\]

For $k$ distance apart i.e., for $k$ lags

\[r_k=\frac{\sum_{t=1}^{n-k} (x_t-\overline{x})(x_{t+k}-\overline{x}) } { \sum_{t=1}^n (x_t-\overline{x})^2}\]

An $r_k$ value of $\pm \frac{2}{\sqrt{n} }$ denotes a significant difference from zero and signifies an autocorrelation.

Patterns of Autocorrelation and Non-Autocorrelation
Autocorrelation in time series

Applications of Autocorrelation in Time Series

There are several applications of autocorrelation in Time Series Data. Some of them are described below.

  • Autocorrelation analysis is widely used in fluorescence correlation spectroscopy.
  • Autocorrelation is used to measure the optical spectra and to measure the very short-duration light pulses produced by lasers.
  • Autocorrelation is used to analyze dynamic light scattering data for the determination of the particle size distributions of nanometer-sized particles in a fluid. A laser shining into the mixture produces a speckle pattern. The autocorrelation of the signal can be analyzed in terms of the diffusion of the particles. From this, knowing the fluid viscosity, the sizes of the particles can be calculated using Autocorrelation.
  • The small-angle X-ray scattering intensity of a nano-structured system is the Fourier transform of the spatial autocorrelation function of the electron density.
  • In optics, normalized autocorrelations and cross-correlations give the degree of coherence of an electromagnetic field.
  • In signal processing, autocorrelation can provide information about repeating events such as musical beats or pulsar frequencies, but it cannot tell the position in time of the beat. It can also be used to estimate the pitch of a musical tone.
  • In music recording, autocorrelation is used as a pitch detection algorithm before vocal processing, as a distortion effect or to eliminate undesired mistakes and inaccuracies.
  • In statistics, spatial autocorrelation between sample locations also helps one estimate mean value uncertainties when sampling a heterogeneous population.
  • In astrophysics, auto-correlation is used to study and characterize the spatial distribution of galaxies in the Universe and multi-wavelength observations of Low Mass X-ray Binaries.
  • In an analysis of Markov chain Monte Carlo data, autocorrelation must be taken into account for correct error determination.

Further Reading: Autocorrelation in time series

Completely Randomized Design (CRD)

Introduction to Completely Randomized Design (CRD)

The simplest and non-restricted experimental design, in which the occurrence of each treatment has an equal number of chances, each treatment can be accommodated in the plan, and the replication of each treatment is unequal is known to be a completely randomized design (CRD). In this regard, this design is known as an unrestricted (a design without any condition) design that has one primary factor. In general form, it is also known as a one-way analysis of variance.

Example of CRD

There are three treatments named $A, B$, and $C$ placed randomly in different experimental units.

CAC
BAA
BBC

We can see that from the table above:

  • There may or may not be a repetition of the treatment
  • The only source of variation is the treatment
  • Specific treatment doesn’t need to come in a specific unit.
  • There are three treatments such that each treatment appears three times having P(A)=P(B)=P(C)=3/9.
  • Each treatment appears an equal number of times (it may be unequal i.e. unbalanced)
  • The total number of experimental units is 9.
Completely Randomized Design

Some Advantages of Completely Randomized Design (CRD)

  1. The main advantage of this design is that the analysis of data is simplest even if some unit does not respond due to any reason.
  2. Another advantage of this design is that it provides a maximum degree of freedom for error.
  3. This design is mostly used in laboratory experiments where all the other factors are under the control of the researcher. For example, in a tube experiment, CRD is best because all the factors are under control.

An assumption regarding completely randomized design (CRD) is that the observation in each level of a factor will be independent of each other.

Statistical Model of CRD

The general model with one factor can be defined as

\[Y_{ij}=\mu + \eta_i +e_{ij}\]

where$i=1,2,\cdots,t$ and $j=1,2,\cdots, r_i$ with $t$ treatments and $r$ replication. $\mu$ is the overall mean based on all observations. $eta_i$ is the effect of ith treatment response. $e_{ij}$ is the corresponding error term which is assumed to be independent and normally distributed with mean zero and constant variance for each.

Importance of CRD

  • Simplicity: CRD is the easiest design to implement, in which treatments are assigned randomly to eliminate complex layouts and make them manageable for beginners.
  • Fairness: Randomization ensures each experimental unit has an equal chance of receiving any treatment. The randomization reduces the bias and strengthens the validity of the comparisons between treatments.
  • Flexibility: CRD can accommodate a wide range of experiments with different numbers of treatments and replicates. One can also adjust the design to fit the specific needs.
  • Data Analysis: CRD boasts the simplest form of statistical analysis compared to other designs. This makes it easier to interpret the results and conclude the experiment.
  • Efficiency: CRD allows for utilizing the entire experimental material, maximizing the data collected.

When CRD is a Good Choice

  • Laboratory experiments: Due to the controlled environment, CRD works well for isolating the effects of a single factor in lab settings.
  • Limited treatments: If there are a small number of treatment groups, CRD is a manageable and efficient option.
  • Initial investigations: CRD can be a good starting point for initial explorations of a factor’s effect before moving on to more complex designs.

Summary

The advantages and importance of CRD make it a valuable starting point for many experiments, particularly in controlled laboratory settings. However, it is important to consider limitations like the assumption of homogeneous experimental units, which might not always be realistic in field experiments.

Read from Wikipedia: Completely Randomized Design (CRD)

Sampling Theory, Introduction, and Reasons to Sample (2015)

Introduction to Sampling Theory

Often we are interested in drawing some valid conclusions (inferences) about a large group of individuals or objects (called population in statistics). Instead of examining (studying) the entire group (population, which may be difficult or even impossible to examine), we may examine (study) only a small part (portion) of the population (an entire group of objects or people). Our objective is to draw valid inferences about certain facts about the population from results found in the sample; a process known as statistical inferences. The process of obtaining samples is called sampling and the theory concerning the sampling is called sampling theory.

Example

Example: We may wish to conclude the percentage of defective bolts produced in a factory during a given 6-day week by examining 20 bolts each day produced at various times during the day. Note that all bolts produced in this case during the week comprise the population, while the 120 selected bolts during 6 days constitute a sample.

In business, medical, social, and psychological sciences, etc., research, sampling theory is widely used for gathering information about a population. The sampling process comprises several stages:

  • Defining the population of concern
  • Specifying the sampling frame (set of items or events possible to measure)
  • Specifying a sampling method for selecting the items or events from the sampling frame
  • Determining the appropriate sample size
  • Implementing the sampling plan
  • Sampling and data collecting
  • Data that can be selected

Reasons to Study a Sample

When studying the characteristics of a population, there are many reasons to study a sample (drawn from the population under study) instead of the entire population such as:

  1. Time: it is difficult to contact every individual in the whole population
  2. Cost: The cost or expenses of studying all the items (objects or individuals) in a population may be prohibitive
  3. Physically Impossible: Some populations are infinite, so it will be physically impossible to check all items in the population, such as populations of fish, birds, snakes, and mosquitoes. Similarly, it is difficult to study the populations that are constantly moving, being born, or dying.
  4. Destructive Nature of items: Some items, objects, etc. are difficult to study as during testing (or checking) they are destroyed, for example, a steel wire is stretched until it breaks and the breaking point is recorded to have a minimum tensile strength. Similarly different electric and electronic components are checked and they are destroyed during testing, making it impossible to study the entire population as time, cost and destructive nature of different items prohibit to study of the entire population.
  5. Qualified and expert staff: For enumeration purposes, highly qualified and expert staff is required which is sometimes impossible. National and International research organizations, agencies, and staff are hired for enumeration purposive which is sometimes costly, needs more time (as a rehearsal of activity is required), and sometimes it is not easy to recruit or hire highly qualified staff.
  6. Reliability: Using a scientific sampling technique the sampling error can be minimized and the non-sampling error committed in the case of a sample survey is also minimal because qualified investigators are included.

Summary

Every sampling system is used to obtain some estimates having certain properties of the population under study. The sampling system should be judged by how good the estimates obtained are. Individual estimates, by chance, may be very close or may differ greatly from the true value (population parameter) and may give a poor measure of the merits of the system.

A sampling system is better judged by the frequency distribution of many estimates obtained by repeated sampling, giving a frequency distribution having a small variance and a mean estimate equal to the true value.

Click the link to Learn Sampling Theory, Sampling Frame, and Sampling Unit

Sampling Theory, Introduction and Reason to Sample

Learn R Programming Language

Design of Experiments Overview (2015)

Objectives of Design of Experiments

Regarding the Design of Experiments: an experiment is usually a test trial or series of tests. The objective of the experiment may either be

  1. Confirmation
  2. Exploration

Designing an experiment means, providing a plan and actual procedure for laying out the experiment. It is a design of any information-gathering exercise where variation is present under the full or no control of the experimenter. The experimenter in the design of experiments is often interested in the effect of some process or intervention (the treatment) on some objects (the experimental units) such as people, parts of people, groups of people, plants, animals, etc. So the experimental design is an efficient procedure for planning experiments so that the data obtained can be analyzed to yield objective conclusions.

In the observational study, the researchers observe individuals and measure variables of interest but do not attempt to influence the response variable, while in an experimental study, the researchers deliberately (purposely) impose some treatment on individuals and then observe the response variables. When the goal is to demonstrate cause and effect, the experiment is the only source of convincing data.

Design of Experiments

Statistical Design

By the Statistical Experimental Design, we refer to the process of planning the experiment, so that the appropriate data will be collected, which may be analyzed by statistical methods resulting in valid and objective conclusions. Thus there are two aspects to any experimental problem:

  1. The design of the experiments
  2. The statistical analysis of the data

Many experimental designs differ from each other primarily in the way, in which the experimental units are classified, before the application of treatment.

Design of Experiments (DOE) helps in

  • Identifying the relationships between cause and effect
  • Provide some understanding of interactions among causative factors
  • Determining the level at which to set the controllable factors to optimize reliability
  • Minimizing the experimental error i.e., noise
  • Improving the robustness of the design or process to variation

Learn more about Design of Experiments Terminology

Basic Principles of Design of Experiments

Online Multiple Choice Questions and Quiz Website