# Autocorrelation Time Series Data

Autocorrelation (serial correlation, or cross-autocorrelation) function (the diagnostic tool) helps to describe the evaluation of a process through time. Inference based on autocorrelation function is often called an analysis in the time domain.

Autocorrelation of a random process, is the measure of correlation (relationship) between observations at different distances apart. This coefficients (correlation or autocorrelation) often provide insight into the probability model which generated the data. One can say that autocorrelation is a mathematical tool for finding repeating patterns in the data series.

Autocorrelation is usually used for the following two purposes:

1. Help to detect the non-randomness in data (the first i.e. lag 1 autocorrelation is performed)
2. Help in identifying an appropriate time series model if the data are not random (autocorrelation are usually plotted for many lags)

For simple correlation, let there are $n$ pairs of observations on two variables $x$ and $y$, then the usual correlation coefficient (Pearson’s coefficient of correlation) is

$r=\frac{\sum(x_i-\overline{x})(y_i-\overline{y})}{\sqrt{\sum (x_i-\overline{x})^2 \sum (y_i-\overline{y})^2 }}$

Similar idea can be used to time series to see either successive observations are correlated or not. Given $N$ observations $x_1, x_2, \cdots, x_N$ on a discrete time series, we can form ($n-1$) pairs of observations such as $(x_1, x_2), (x_2, x_3), \cdots, (x_{n-1}, x_n)$. Here in each pair first observation is as one variable ($x_t$) and the second observation is as second variable ($x_{t+1}$). So the correlation coefficient between $x_t$ and $x_{t+1}$ is

$r_1\frac{ \sum_{t=1}^{n-1} (x_t- \overline{x}_{(1)} ) (x_{t+1}-\overline{x}_{(2)}) } {\sqrt{ [\sum_{t=1}^{n-1} (x_t-\overline{x}_{(1)})^2] [ \sum_{t=1}^{n-1} (y_t-\overline{y}_{(1)})^2 ] } }$

where

$\overline{x}_{(1)}=\sum_{t=1}^{n-1} \frac{x_t}{n-1}$ is the mean of first $n-1$ observations

$\overline{x}_{(2)}=\sum_{t=2}^{n} \frac{x_t}{n-1}$ is the mean of last $n-1$ observations

Note that: The assumption is that the observations in autocorrelation are equally spaced (equi-spaced).

It is called autocorrelation or serial correlation coefficient. For large $n$, $r_1$ is approximately

$r_1=\frac{\frac{\sum_{t=1}^{n-1} (x_t-\overline{x})(x_{t+1}-\overline{x}) }{n-1}}{ \frac{\sum_{t=1}^n (x_t-\overline{x})^2}{n}}$

or

$r_1=\frac{\sum_{t=1}^{n-1} (x_t-\overline{x})(x_{t+1}-\overline{x}) } { \sum_{t=1}^n (x_t-\overline{x})^2}$

For $k$ distance apart i.e., for $k$ lags

$r_k=\frac{\sum_{t=1}^{n-k} (x_t-\overline{x})(x_{t+k}-\overline{x}) } { \sum_{t=1}^n (x_t-\overline{x})^2}$

An $r_k$ value of $\pm \frac{2}{\sqrt{n} }$ denotes a significant difference from zero and signifies an autocorrelation.

## Application of Autocorrelation

• Autocorrelation analysis is widely used in fluorescence correlation spectroscopy.
• Autocorrelation is used to measurement the optical spectra and to measure the very-short-duration light pulses produced by lasers.
• Autocorrelation is used to analyze dynamic light scattering data for the determination of the particle size distributions of nanometer-sized particles in a fluid. A laser shining into the mixture produces a speckle pattern. Autocorrelation of the signal can be analyzed in terms of the diffusion of the particles. From this, knowing the fluid viscosity, the sizes of the particles can be calculated using Autocorrelation.
• The small-angle X-ray scattering intensity of a nano-structured system is the Fourier transform of the spatial autocorrelation function of the electron density.
• In optics, normalized autocorrelations and cross-correlations give the degree of coherence of an electromagnetic field.
• In signal processing, autocorrelation can provide information about repeating events such as musical beats or pulsar frequencies, but it cannot tell the position in time of the beat. It can also be used to estimate the pitch of a musical tone.
• In music recording, autocorrelation is used as a pitch detection algorithm prior to vocal processing, as a distortion effect or to eliminate undesired mistakes and inaccuracies.
• In statistics, spatial autocorrelation between sample locations also helps one estimate mean value uncertainties when sampling a heterogeneous population.
• In astrophysics, auto-correlation is used to study and characterize the spatial distribution of galaxies in the Universe and in multi-wavelength observations of Low Mass X-ray Binaries.
• In analysis of Markov chain Monte Carlo data, autocorrelation must be taken into account for correct error determination.

Be Sociable, Share!

# Completely Randomized Design (CRD)

A simplest and non–restricted experimental design, in which occurrence of each treatment has equal number of chances, each treatment can be accommodate in the plan, and the replication of each treatment is unequal is known to be completely randomized design (CRD). In this regard this design is known as unrestricted (a design without any condition) design that have one primary factor. In general form it is also known as one-way analysis of variance.

Let we have three treatments names A, B, and C placed randomly in different experimental units.

 C A C B A A B B C

We can see that from the table above:

• There may or may not be repetition of treatment
• Only source of variation is treatment
• It is not necessary that specific treatment comes in specific unit.
• There are three treatments such that each treatment appears three times having P(A)=P(B)=P(C)=3/9.
• Each treatment is appearing equal number of times (it may be unequal i.e. unbalance)
• The total number of experimental units are 9.

### Some Advantages of Completely Randomized Design (CRD)

1. The main advantage of this design is that the analysis of data is simplest even if some unit of does not response due to any reason.
2. Another advantage of this design is that is provided maximum degree of freedom for error.
3. This design is mostly used in laboratory experiment where all the other factors are in under control of the researcher. For example in a tube experiment CRD in best because all the factors are under control.

An assumption regarded to completely randomized design (CRD) is that the observation in each level of a factor will be independent from each other.

The general model with one factor can be defined as

$Y_{ij}=\mu + \eta_i +e_{ij}$

Where$i=1,2,\cdots,t$ and $j=1,2,\cdots, r_i$ with $t$ treatments and $r$ replication. $\mu$ is the overall mean based on all observation. $eta_i$ is the effect of ith treatment response. $e_{ij}$ is the corresponding error term which is assumed to be independent and normally distributed with mean zero and constant variance for each.

Read from WikiPedia: Completely Randomized Design (CRD)

Be Sociable, Share!

# Sampling theory, Introduction and Reasons to Sample

Often we are interested in drawing some valid conclusions (inferences) about a large group of individuals or objects (called population in statistics). Instead of examining (studying) the entire group (population, which may be difficult or even impossible to examine), we may examine (study) only a small part (portion) of the population (entire group of objects or people). Our objective is to draw valid inferences about certain facts for the population from results found in the sample; a process known as statistical inferences. The process of obtaining samples is called sampling and theory concerning the sampling is called sampling theory.

Example: We may wish to draw conclusions about the percentage of defective bolts produced in a factory during a given 6-day week by examining 20 bolts each day produced at various times during the day. Note that all bolts produced in this case during the week comprise the population, while the 120 selected bolts during 6-days constitutes a sample.

In business, medical, social and psychological sciences etc., research, sampling theory is widely used for gathering information about a population. The sampling process comprises several stages:

• Defining the population of concern
• Specifying the sampling frame (set of items or events possible to measure)
• Specifying a sampling method for selecting the items or events from the sampling frame
• Determining the appropriate sample size
• Implementing the sampling plan
• Sampling and data collecting
• Data which can be selected

When studying the characteristics of a population, there many reasons to study a sample (drawn from population under study) instead of entire population such as:

1. Time: as it is difficult to contact each and every individual of the whole population
2. Cost: The cost or expenses of studying all the items (objects or individual) in a population may be prohibitive
3. Physically Impossible: Some population are infinite, so it will be physically impossible to check the all items in the population, such as populations of fish, birds, snakes, mosquitoes. Similarly it is difficult to study the populations that are constantly moving, being born, or dying.
4. Destructive Nature of items: Some items, objects etc are difficult to study as during testing (or checking) they destroyed, for example a steel wire is stretched until it breaks and breaking point is recorded to have a minimum tensile strength. Similarly different electric and electronic components are check and they are destroyed during testing, making impossible to study the entire population as time, cost and destructive nature of different items prohibits to study the entire population.
5. Qualified and expert staff: For enumeration purposes, highly qualified and expert staff is required which is some time impossible. National and International research organizations, agencies and staff is hired for enumeration purposive which is some time costly, need more time (as rehearsal of activity is required), and some time it is not easy to recruiter or hire a highly qualified staff.
6. Reliability: Using a scientific sampling technique the sampling error can be minimized and the non-sampling error committed in the case of sample survey is also minimum, because qualified investigators are included.

Every sampling system is used to obtain some estimates having certain properties of the population under study. The sampling system should be judged by how good the estimates obtained are. Individual estimates, by chance, may be very close or may differ greatly from the true value (population parameter) and may give a poor measure of the merits of the system.

A sampling system is better judged by frequency distribution of many estimates obtained by repeated sampling, giving a frequency distribution having small variance and mean estimate equal to the true value.

Be Sociable, Share!

# Design of Experiments Overview

An experiment is usually a test or trial or series of tests. The objective of the experiment may either be

1. Confirmation
2. Exploration

Designing of an experiment means, providing a plan and actual procedure of laying out the experiment. It is a design of any information gathering exercise where variation is present under the full or no control of the experimenter. The experimenter in design of experiments is often interested in the effect of some process or intervention (the treatment) on some objects (the experimental units) such as people, parts of people, groups of people, plants, animals etc. So the design of experiment is an efficient procedure for planning experiments so that the data obtained can be analyzed to yield and object conclusions.

In observational study the researchers observe individuals and measure variables of interest but do not attempt to influence the response variable, while in an experimental study, the researchers deliberately (purposely) impose some treatment on individuals and then observe the response variables. When the goal is to demonstrate cause and effect, experiment is the only source of convincing data.

## Statistical Design

By the statistical design of experiments, we refer to the process of planning the experiment, so that the appropriate data will be collected, which may be analyzed by statistical methods resulting in valid and objective conclusions. Thus there are two aspects to any experimental problem:

1. The design of the experiments
2. The statistical analysis of the data

There are many experimental design which differ from each other primarily in the way, in which the experimental units are classified, before the application of treatment.

## Design of experiment (DOE) helps in

• Identifying the relationships between cause and effect
• Provide some understanding of interactions among causative factors
• Determining the level at which to set the controllable factors in order to optimize reliability
• Minimizing the experimental error i.e., noise
• Improving the robustness of the design or process to variation

Be Sociable, Share!

# The Level of Measurements

In statistics, data can be classified according to level of measurement, dictating the calculations that can be done to summarize and present the data (graphically), it also helps to determine, what statistical tests should be performed. For example, suppose there are six colors of candies in a bag and you assign different numbers (codes) to them in such a way that brown candy has a value of 1, yellow 2, green 3, orange 4, blue 5, and red a value of 6. From this bag of candies, adding all the assigned color values and then dividing by the number of candies, yield an average value of 3.68. Does this mean that the average color is green or orange? Of course not. When computing statistic, it is important to recognize the data type, which may be qualitative (nominal and ordinal) and quantitative (Interval and ratio).

The level of measurements has been developed in conjunction with the concepts of numbers and units of measurement. Statisticians classified measurements according to levels. There are four level of measurements, namely, nominal, ordinal, interval and ratio, described below.

Nominal Level of Measurement

In nominal level of measurement, the observation of a qualitative variable can only be classified and counted. There is no particular order to the categories. Mode, frequency table, pie chart and bar graph are usually drawn for this level of measurement.

Ordinal Level of Measurement

In ordinal level of measurement, data classification are presented by sets of labels or names that have relative values (ranking or ordering of values). For example, if you survey 1,000 people and ask them to rate a restaurant on a scale ranging from 0 to 5, where 5 shows higher score (highest liking level) and zero shows the lowest (lowest liking level). Taking the average of these 1,000 people’s response will have meaning. Usually graphs and charts are drawn for ordinal data.

Interval Level of Measurement

Numbers also used to express the quantities, such as temperature, dress size and plane ticket are all quantities. The interval level of measurement allows for the degree of difference between items but no the ratio between them. There is meaningful difference between values, for example 10 degrees Fahrenheit and 15 degrees is 5, and the difference between 50 and 55 degrees is also 5 degrees. It is also important that zero is just a point on the scale, it does not represents the absence of heat, just that it is freezing point.

Ratio Level of Measurement

All of the quantitative data is recorded on the ratio level. It has all the characteristics of the interval level, but in addition, the zero point is meaningful and the ratio between two numbers is meaningful. Examples of ratio level are wages, units of production, weight, changes in stock prices, distance between home and office, height etc.
Many of the inferential test statistics depends on ratio and interval level of measurement. Many author argue that interval and ratio measures should be named as scale.