Covariance and Correlation (2015)

Introduction to Covariance and Correlation

Covariance and correlation are very important terminologies in statistics. Covariance measures the degree to which two variables co-vary (i.e. vary/change together). If the greater values of one variable (say, $X_i$) correspond with the greater values of the other variable (say, $X_j$), i.e. if the variables tend to show similar behavior, then the covariance between two variables ($X_i$, $X_j$) will be positive.

Similarly, if the smaller values of one variable correspond with the smaller values of the other variable, then the covariance between two variables will be positive. In contrast, if the greater values of one variable (say, $X_i$) mainly correspond to the smaller values of the other variables (say, $X_j$), i.e. both of the variables tend to show opposite behavior, then the covariance will be negative.

In other words, positive covariance between two variables means they (both of the variables) vary/change together in the same direction relative to their expected values (averages). It means that if one variable moves above its average value, the other variable tends to be above its average value.

Similarly, if covariance is negative between the two variables, then one variable tends to be above its expected value, while the other variable tends to be below its expected value. If covariance is zero then it means that there is no linear dependency between the two variables.

Mathematical Representation of Covariance

Mathematically covariance between two random variables $X_i$ and $X_j$ can be represented as
\[COV(X_i, X_j)=E[(X_i-\mu_i)(X_j-\mu_j)]\]
where
$\mu_i=E(X_i)$ is the average of the first variable
$\mu_j=E(X_j)$ is the average of the second variable

\begin{aligned}
COV(X_i, X_j)&=E[(X_i-\mu_i)(X_j-\mu_j)]\\
&=E[X_i X_j – X_i E(X_j)-X_j E(X_i)+E(X_i)E(X_j)]\\
&=E(X_i X_j)-E(X_i)E(X_j) – E(X_j)E(X_i)+E(X_i)E(X_j)\\
&=E(X_i X_j)-E(X_i)E(X_j)
\end{aligned}

Covariance

Note that, the covariance of a random variable with itself is the variance of the random variable, i.e. $COV(X_i, X_i)=VAR(X)$. If $X_i$ and $X_j$ are independent, then $E(X_i X_j)=E(X_i)E(X_j)$ and $COV(X_i, X_j)=E(X_i X_j)-E(X_i) E(X_j)=0$.

Covariance and Correlation

Covariance and Correlation

Correlation and covariance are related measures but not equivalent statistical measures.

Equation of Correlation (Normalized Covariance

The correlation between two variables (Let, $X_i$ and $X_j$) is their normalized covariance, defined as
\begin{aligned}
\rho_{i,j}&=\frac{E[(X_i-\mu_i)(X_j-\mu_j)]}{\sigma_i \sigma_j}\\
&=\frac{n \sum XY – \sum X \sum Y}{\sqrt{(n \sum X^2 -(\sum X)^2)(n \sum Y^2 – (\sum Y)^2)}}
\end{aligned}
where $\sigma_i$ is the standard deviation of $X_i$ and $\sigma_j$ is the standard deviation of $X_j$.

Note that correlation is dimensionless, i.e. a number that is free of the measurement unit and its values lie between -1 and +1 inclusive. In contrast, covariance has a unit of measure–the product of the units of two variables.

For further reading about Correlation follow these posts

R Frequently Asked Questions

Data Transformation (Variable Transformation)

The data transformation is a rescaling of the data using a function or some mathematical operation on each observation. When data are very strongly skewed (negative or positive), we sometimes transform the data so that they are easier to model. In another way, if variable(s) does not fit a normal distribution then one should try a DatavTransformation to fit the assumption of using a parametric statistical test.

The most common data transformation is log (or natural log) transformation, which is often applied when most of the data values cluster around zero relative to the larger values in the data set and all of the observations are positive.

Data Transformation Techniques

Variable transformation can also be applied to one or more variables in scatter plot, correlation, and regression analysis to make the relationship between the variables more linear; hence it is easier to model with a simple method. Other transformations than log are square root, reciprocal, etc.

Reciprocal Transformation

The reciprocal transformation $x$ to $\frac{1}{x}$ or $(-\frac{1}{x})$ is a very strong transformation with a drastic effect on the shape of the distribution. Note that this transformation cannot be applied to zero values, but can be applied to negative values. Reciprocal transformation is not useful unless all of the values are positive and reverses the order among values of the same sign i.e. largest becomes smallest etc.

Logarithmic Transformation

The logarithm $x$ to log (base 10) (or natural log, or log base 2) is another strong transformation that affects the shape of the distribution. Logarithmic transformation is commonly used for reducing right skewness, but cannot be applied to negative or zero values.

Square Root Transformation

The square root x to $x^{\frac{1}{2}}=\sqrt(x)$ transformation has a moderate effect on the distribution shape and is weaker than the logarithm. Square root transformation can be applied to zero values but not negative values.

Data Transformation

The purpose of data transformation is:

  • Convert data from one format or structure to another (like changing a messy spreadsheet into a table).
  • Clean and prepare data for analysis (fixing errors, inconsistencies, and missing values).
  • Standardize data for easier integration and comparison (making sure all your data uses the same units and formats).

Goals of transformation

The goals of transformation may be

  • one might want to see the data structure differently
  • one might want to reduce the skew that assists in modeling
  • one might want to straighten a nonlinear (curvilinear) relationship in a scatter plot. In other words, a transformation may be used to have approximately equal dispersion, making data easier to handle and interpret
Data Transformation (Variable Transformation)

There are many techniques used in data transformation, these techniques are:

  • Cleaning and Filtering: Identifying and removing errors, missing values, and duplicates.
  • Data Normalization: Ensuring data consistency across different fields.
  • Aggregation: Summarizing data by combining similar values.

Benefits of Data Transformation

The Benefits of data transformation and data cleaning are:

  • Improved data quality: Fewer errors and inconsistencies lead to more reliable results.
  • Easier analysis: Structured data is easier to work with for data analysts and scientists.
  • Better decision-making: Accurate insights from clean data lead to better choices.
https://itfeature.com

Data transformation is a crucial step in the data pipeline, especially in tasks like data warehousing, data integration, and data wrangling.

FAQS about Data Transformation

  • What is data transformation?
  • When data transformation is done?
  • What is the most common data transformation?
  • What is the reciprocal Data Transformation?
  • When reciprocal transformation is not useful?
  • What is a logarithmic transformation?
  • When logarithmic transformation is not applied to the data?
  • What is the square root transformation?
  • When square root transformation cannot be applied?
  • What is the main purpose of data transformation?
  • What are the goals of transformation?
  • What is the data normalization?
  • What is the data aggregation?
  • What is the cleaning and filtering?
  • What are the benefits of data transformation?

Online MCQs Test Website

Introduction to R Language

Autocorrelation in Time Series Data (2015)

The post is about autocorrelation in time series data. The autocorrelation (serial correlation, or cross-autocorrelation) function (the diagnostic tool) helps to describe the evaluation of a process through time. Inference based on autocorrelation function is often called an analysis in the time domain.

Autocorrelation of a random process is the measure of correlation (relationship) between observations at different distances apart. These coefficients (correlation or autocorrelation) often provide insight into the probability model which generated the data. One can say that autocorrelation is a mathematical tool for finding repeating patterns in the data series.

The detection of autocorrelation in time series data is usually used for the following two purposes:

  1. Help to detect the non-randomness in data (the first i.e. lag 1 autocorrelation is performed)
  2. Help in identifying an appropriate time series model if the data are not random (autocorrelation is usually plotted for many lags)

For simple correlation, let there are $n$ pairs of observations on two variables $x$ and $y$, then the usual correlation coefficient (Pearson’s coefficient of correlation) is

\[r=\frac{\sum(x_i-\overline{x})(y_i-\overline{y})}{\sqrt{\sum (x_i-\overline{x})^2 \sum (y_i-\overline{y})^2 }}\]

A similar idea can be used in time series to see whether successive observations are correlated or not. Given $N$ observations $x_1, x_2, \cdots, x_N$ on a discrete time series, we can form ($n-1$) pairs of observations such as $(x_1, x_2), (x_2, x_3), \cdots, (x_{n-1}, x_n)$. Here in each pair first observation is as one variable ($x_t$) and the second observation is as the second variable ($x_{t+1}$). So the correlation coefficient between $x_t$ and $x_{t+1}$ is

\[r_1\frac{ \sum_{t=1}^{n-1} (x_t- \overline{x}_{(1)} ) (x_{t+1}-\overline{x}_{(2)})  }    {\sqrt{ [\sum_{t=1}^{n-1} (x_t-\overline{x}_{(1)})^2] [ \sum_{t=1}^{n-1} (y_t-\overline{y}_{(1)})^2 ] } }\]

where

$\overline{x}_{(1)}=\sum_{t=1}^{n-1} \frac{x_t}{n-1}$ is the mean of first $n-1$ observations

$\overline{x}_{(2)}=\sum_{t=2}^{n} \frac{x_t}{n-1}$ is the mean of last $n-1$ observations

Note that: The assumption is that the observations in autocorrelation are equally spaced (equi-spaced).

It is called autocorrelation or serial correlation coefficient. For large $n$, $r_1$ is approximately

\[r_1=\frac{\frac{\sum_{t=1}^{n-1} (x_t-\overline{x})(x_{t+1}-\overline{x}) }{n-1}}{ \frac{\sum_{t=1}^n (x_t-\overline{x})^2}{n}}\]

or

\[r_1=\frac{\sum_{t=1}^{n-1} (x_t-\overline{x})(x_{t+1}-\overline{x}) } { \sum_{t=1}^n (x_t-\overline{x})^2}\]

For $k$ distance apart i.e., for $k$ lags

\[r_k=\frac{\sum_{t=1}^{n-k} (x_t-\overline{x})(x_{t+k}-\overline{x}) } { \sum_{t=1}^n (x_t-\overline{x})^2}\]

An $r_k$ value of $\pm \frac{2}{\sqrt{n} }$ denotes a significant difference from zero and signifies an autocorrelation.

Patterns of Autocorrelation and Non-Autocorrelation
Autocorrelation in time series

Applications of Autocorrelation in Time Series

There are several applications of autocorrelation in Time Series Data. Some of them are described below.

  • Autocorrelation analysis is widely used in fluorescence correlation spectroscopy.
  • Autocorrelation is used to measure the optical spectra and to measure the very short-duration light pulses produced by lasers.
  • Autocorrelation is used to analyze dynamic light scattering data for the determination of the particle size distributions of nanometer-sized particles in a fluid. A laser shining into the mixture produces a speckle pattern. The autocorrelation of the signal can be analyzed in terms of the diffusion of the particles. From this, knowing the fluid viscosity, the sizes of the particles can be calculated using Autocorrelation.
  • The small-angle X-ray scattering intensity of a nano-structured system is the Fourier transform of the spatial autocorrelation function of the electron density.
  • In optics, normalized autocorrelations and cross-correlations give the degree of coherence of an electromagnetic field.
  • In signal processing, autocorrelation can provide information about repeating events such as musical beats or pulsar frequencies, but it cannot tell the position in time of the beat. It can also be used to estimate the pitch of a musical tone.
  • In music recording, autocorrelation is used as a pitch detection algorithm before vocal processing, as a distortion effect or to eliminate undesired mistakes and inaccuracies.
  • In statistics, spatial autocorrelation between sample locations also helps one estimate mean value uncertainties when sampling a heterogeneous population.
  • In astrophysics, auto-correlation is used to study and characterize the spatial distribution of galaxies in the Universe and multi-wavelength observations of Low Mass X-ray Binaries.
  • In an analysis of Markov chain Monte Carlo data, autocorrelation must be taken into account for correct error determination.

Further Reading: Autocorrelation in time series

Completely Randomized Design (CRD)

Introduction to Completely Randomized Design (CRD)

The simplest and non-restricted experimental design, in which the occurrence of each treatment has an equal number of chances, each treatment can be accommodated in the plan, and the replication of each treatment is unequal is known to be a completely randomized design (CRD). In this regard, this design is known as an unrestricted (a design without any condition) design that has one primary factor. In general form, it is also known as a one-way analysis of variance.

Example of CRD

There are three treatments named $A, B$, and $C$ placed randomly in different experimental units.

CAC
BAA
BBC

We can see that from the table above:

  • There may or may not be a repetition of the treatment
  • The only source of variation is the treatment
  • Specific treatment doesn’t need to come in a specific unit.
  • There are three treatments such that each treatment appears three times having P(A)=P(B)=P(C)=3/9.
  • Each treatment appears an equal number of times (it may be unequal i.e. unbalanced)
  • The total number of experimental units is 9.
Completely Randomized Design

Some Advantages of Completely Randomized Design (CRD)

  1. The main advantage of this design is that the analysis of data is simplest even if some unit does not respond due to any reason.
  2. Another advantage of this design is that it provides a maximum degree of freedom for error.
  3. This design is mostly used in laboratory experiments where all the other factors are under the control of the researcher. For example, in a tube experiment, CRD is best because all the factors are under control.

An assumption regarding completely randomized design (CRD) is that the observation in each level of a factor will be independent of each other.

Statistical Model of CRD

The general model with one factor can be defined as

\[Y_{ij}=\mu + \eta_i +e_{ij}\]

where$i=1,2,\cdots,t$ and $j=1,2,\cdots, r_i$ with $t$ treatments and $r$ replication. $\mu$ is the overall mean based on all observations. $eta_i$ is the effect of ith treatment response. $e_{ij}$ is the corresponding error term which is assumed to be independent and normally distributed with mean zero and constant variance for each.

Importance of CRD

  • Simplicity: CRD is the easiest design to implement, in which treatments are assigned randomly to eliminate complex layouts and make them manageable for beginners.
  • Fairness: Randomization ensures each experimental unit has an equal chance of receiving any treatment. The randomization reduces the bias and strengthens the validity of the comparisons between treatments.
  • Flexibility: CRD can accommodate a wide range of experiments with different numbers of treatments and replicates. One can also adjust the design to fit the specific needs.
  • Data Analysis: CRD boasts the simplest form of statistical analysis compared to other designs. This makes it easier to interpret the results and conclude the experiment.
  • Efficiency: CRD allows for utilizing the entire experimental material, maximizing the data collected.

When CRD is a Good Choice

  • Laboratory experiments: Due to the controlled environment, CRD works well for isolating the effects of a single factor in lab settings.
  • Limited treatments: If there are a small number of treatment groups, CRD is a manageable and efficient option.
  • Initial investigations: CRD can be a good starting point for initial explorations of a factor’s effect before moving on to more complex designs.

Summary

The advantages and importance of CRD make it a valuable starting point for many experiments, particularly in controlled laboratory settings. However, it is important to consider limitations like the assumption of homogeneous experimental units, which might not always be realistic in field experiments.

Read from Wikipedia: Completely Randomized Design (CRD)