Principal Component Regression (PCR)

Principal Component Regression (PCR) is a statistical technique that combines two powerful methods: Principal Component Analysis (PCA) and linear regression.

The transformation of the original data set into a new set of uncorrelated variables is called principal components. This kind of transformation ranks the new variables according to their importance (that is, variables are ranked according to the size of their variance and eliminate those of least importance). After transformation, a least square regression on this reduced set of principal components is performed, called principal component regression.

Principal Component Regression (PCR)

Principal Component Regression (PCR) is not scale invariant, therefore, one should scale and center data first. Therefore, given a p-dimensional random vector $x=(x_1, x_2, …, x_p)^t$ with covariance matrix $\sum$ and assume that $\sum$ is positive definite. Let $V=(v_1,v_2, \cdots, v_p)$ be a $(p \times p)$-matrix with orthogonal column vectors that is $v_i^t\, v_i=1$, where $i=1,2, \cdots, p$ and $V^t =V^{-1}$. The linear transformation

\begin{aligned}
z&=V^t x\\
z_i&=v_i^t x
\end{aligned}

The variance of the random variable $z_i$ is
\begin{aligned}
Var(Z_i)&=E[v_i^t\, x\, x^t\,\, v_i]\\
&=v_i^t \sum v_i
\end{aligned}

Maximizing the variance $Var(Z_i)$ under the conditions $v_i^t v_i=1$ with Lagrange gives
\[\phi_i=v_i^t \sum v_i -a_i(v_i^t v_i-1)\]

Setting the partial derivation to zero, we get
\[\frac{\partial \phi_i}{\partial v_i} = 2 \sum v_i – 2a_i v_i=0\]

which is
\[(\sum – a_i I)v_i=0\]

In matrix form
\[\sum V= VA\]
of
\[\sum = VAV^t\]

where $A=diag(a_1, a_2, \cdots, a_p)$. This is known as the eigenvalue problem, $v_i$ are the eigenvectors of $\sum$ and $a_i$ the corresponding eigenvalues such that $a_1 \ge a_2 \cdots \ge a_p$. Since $\sum$ is positive definite, all eigenvalues are real and non-negative numbers.

$z_i$ is named the ith principal component of $x$ and we have
\[Cov(z)=V^t Cov(x) V=V^t \sum V=A\]

The variance of the ith principal component matches the eigenvalue $a_i$, while the variances are ranked in descending order. This means that the last principal component will have the smallest variance. The principal components are orthogonal to all the other principal components (they are even uncorrelated) since $A$ is a diagonal matrix.

In the following, for regression, we will use $q$, that is,($1\le q \le p$) principal components. The regression model for observed data $X$ and $y$ can then be expressed as

\begin{aligned}
y&=X\beta+\varepsilon\\
&=XVV^t\beta+\varepsilon\\
&= Z\theta+\varepsilon
\end{aligned}

with the $n\times q$ matrix of the empirical principal components $Z=XV$ and the new regression coefficients $\theta=V^t \beta$. The solution of the least squares estimation is

\begin{aligned}
\hat{\theta}_k=(z_k^t z_k)^{-1}z_k^ty
\end{aligned}

and $\hat{\theta}=(\theta_1, \cdots, \theta_q)^t$

Since the $z_k$ are orthogonal, the regression is a sum of univariate regressions, that is
\[\hat{y}_{PCR}=\sum_{k=1}^q \hat{\theta}_k z_k\]

Since $z_k$ are linear combinations of the original $x_j$, the solution in terms of coefficients of the $x_j$ can be expressed as
\[\hat{\beta}_{PCR} (q)=\sum_{k=1}^q \hat{\theta}_k v_k=V \hat{\theta}\]

Principal Component Regression PCR

Note that if $q=p$, we would get back the usual least squares estimates for the full model. For $q<p$, we get a “reduced” regression.

Why use Principal Component Regression?

  • Reduces Dimensionality: When dealing with a large number of predictors, PCR can help reduce the complexity of the model.
  • Handles multicollinearity: If there is a high correlation among predictors (multicollinearity), PCR can address this issue.
  • Improves interpretability: In some cases, the principal components can be easier to interpret than the original variables.

Important Points to Remember

  • PCR is an unsupervised technique for dimensionality reduction.
  • The number of principal components used in the regression model is a crucial parameter.
  • PCR can be compared to Partial Least Squares Regression (PLS), another dimensionality reduction technique that considers the relationship between predictors and the response variable.

R Language Interview Questions

Online MCQs Test Website

Canonical Correlation Analysis (2016)

The bivariate correlation analysis measures the strength of the relationship between two variables. One may be required to find the strength of the relationship between two sets of variables. In this case, canonical correlation is an appropriate technique for measuring the strength of the relationship between two sets of variables. Canonical correlation is appropriate in the same situations where multiple regression would be, but where there are multiple inter-correlated outcome variables. Canonical correlation analysis determines a set of canonical variates, orthogonal linear combinations of the variables within each set that best explain the variability both within and between sets.

Examples: Canonical Correlation Analysis

  • In medicine, individuals’ lifestyles and eating habits may affect their different health measures determined by several health-related variables such as hypertension, weight, anxiety, and tension level.
  • In business, the marketing manager of a consumer goods firm may be interested in finding the relationship between the types of products purchased and consumers’ lifestyles and personalities.

From the above two examples, one set of variables is the predictor or independent while the other set of variables is the criterion or dependent set. The objective of canonical correlation is to determine if the predictor set of variables affects the criterion set of variables.

Note that it is unnecessary to designate the two sets of variables as dependent and independent. In this case, the objective of canonical correlation is to ascertain the relationship between the two sets of variables.

Canonical Correlation Analysis

The objective of canonical correlation is similar to conducting a principal components analysis on each set of variables. In principal components analysis, the first new axis results in a new variable that accounts for the maximum variance in the data. In contrast, in canonical correlation analysis, a new axis is identified for each set of variables such that the correlation between the two resulting new variables is maximum.

Canonical correlation analysis can also be considered a data reduction technique as only a few canonical variables may be needed to adequately represent the association between the two sets of variables. Therefore, an additional objective of canonical correlation is to determine the minimum number of canonical correlations needed to adequately represent the association between two sets of variables.

Canonical Correlation Analysis (2016)

Learn R Programming

Computer MCQs Test Online

Data Collection Methods

There are many methods to collect data. These Data Collection Methods can be classified into four main methods (sources) of collecting data: used in statistical inference.

Data Collection Methods

The Data Collection Methods are (i) Survey Method (ii) Simulation (iii) Controlled Experiments (iv) Observational Study. Let us discuss Data Collection Methods one by one in detail.

(i) Survey Method

A very popular and widely used method is the survey, where people with special training go out and record observations of, the number of vehicles, traveling along a road, the acres of fields that farmers are using to grow a particular food crop; the number of households that own more than one motor vehicle, the number of passengers using Metro transport and so on. Here the person making the study has no direct control over generating the data that can be recorded, although the recording methods need care and control.

(ii) Simulation

Simulation is also one of the most important data collection methods. In Simulation, a computer model for the operation of an (industrial)  system is set up in which an important measurement is the percentage purity of a (chemical) product. A very large number of realizations of the model can be run to look for any pattern in the results. Here the success of the approach depends on how well the model can explain that measurement and this has to be tested by carrying out at least a small amount of work on the actual system in operation.

(iii) Controlled Experiments

An experiment is possible when the background conditions can be controlled, at least to some extent. For example, we may be interested in choosing the best type of grass seed to use in the sports field.

The first stage of work is to grow all the competing varieties of seed at the same place and make suitable records of their growth and development. The competing varieties should be grown in quite small units close together in the field as in the figure below

Data Collection Methods: Controlled Experiments

This is a controlled experiment as it has certain constraints such as;

i) River on the right side
ii) Shadow of trees on the left side
iii) There are 3 different varieties (say, $v_1, v_2, v_3$) and are distributed in 12 units.

In the diagram below, much more control of local environmental conditions than there would have been if one variety had been replaced in the strip in the shelter of the trees, another close by the river while the third one is more exposed in the center of the field;

Data Collection Methods: Controlled Experiments 2

There are 3 experimental units. One is close to the stream and the other is to trees while the third one is between them which is more beneficial than the others. It is now our choice where to place any one of them on any of the sides.

(iv) Observational Study

Like experiments, observational studies try to understand cause-and-effect relationships. However, unlike experiments, the researcher is not able to control (1) how subjects are assigned to groups and/or (2) which treatments each group receives.

Note that small units of land or plots are called experimental units or simply units.

There is no “right” side for a unit, it depends on the type of crop, the work that is to be done on it, and the measurements that are to be taken. Similarly, the measurements upon which inferences are eventually going to be based are to be taken as accurately as possible. The unit must, therefore, need not be so large as to make recording very tedious because that leads to errors and inaccuracy. On the other hand, if a unit is very small there is the danger that relatively minor physical errors in recording can lead to large percentage errors.

Experimenters and statisticians who collaborate with them, need to gain a good knowledge of their experimental material or units as a research program proceeds.

R Data Analysis and Statistics

MCQs Mathematics Intermediate

Basic Principles of DOE (Design of Experiments)

The basic principles of doe (design of experiments or experimental design) are (i) Randomization, (ii) Replication, and (iii) Local Control. Let us discuss these important principles of experimental design in detail below.

Principles of DOE (Design of Experiments)

  1. Randomization

    Randomization is the cornerstone underlying the use of statistical methods in experimental designs.  Randomization is the random process of assigning treatments to the experimental units. The random process implies that every possible allotment of treatments has the same probability. For example, if the number of treatments = 3 (say, $A, B$, and C) and replication =$r = 4$, then the number of elements = $t \times r$ = 3 \times 4 = 12 = n$. Replication means that each treatment will appear 4 times as $r = 4$. Let the design is

    A C B C
    C B A B
    A C B A
    Note from the design elements 1, 7, 9, and 12 are reserved for Treatment $A$, elements 3, 6, 8, and 11 are reserved for Treatment $B$, and elements 2, 4, 5, and 10 are reserved for Treatment $C$. $P(A)= \frac{4}{12}, P(B)= 4/12$, and $P(C)=\frac{4}{12}$, meaning that Treatment $A, B$, and $C$ have equal chances of its selection.
  2. Replication

    By replication, we mean the repetition of the basic experiments. For example, If we need to compare the grain yield of two varieties of wheat then each variety is applied to more than one experimental unit. The number of times these are applied to experimental units is called their number of replications. It has two important properties:

    • It allows the experimenter to obtain an estimate of the experimental error.
    • More replication would provide the increased precision by reducing the standard error (SE) of mean as $s_{\overline{y}}=\tfrac{s}{\sqrt{r}}$, where $s$ is sample standard deviation and $r$ is a number of replications. Note that increase in $r$ value $s_{\overline{y}}$ (standard error of $\overline{y}$).
  3. Local Control

    Local control is the last important principle among the principles of doe. It has been observed that all extraneous source of variation is not removed by randomization and replication, i.e. unable to control the extraneous source of variation.
    Thus we need to refine the experimental technique. In other words, we need to choose a design in such a way that all extraneous source of variation is brought under control. For this purpose, we make use of local control, a term referring to the amount of (i) balancing, (ii) blocking, and (iii) grouping of experimental units.

Principles of doe

Balancing: Balancing means that the treatment should be assigned to the experimental units in such a way that the result is a balanced arrangement of treatment.

Blocking: Blocking means that the like experimental units should be collected together to form relatively homogeneous groups. A block is also called a replicate.

The main objective/ purpose of local control is to increase the efficiency of experimental design by decreasing experimental error.

This is all about the Basic Principles of the Experimental Design. To learn more about DOE visit the link: Design of Experiments.

Statistics help https://itfeature.com

Real Life Example

Imagine a bakery trying to improve the quality of its bread. Factors that could affect bread quality include

  • Flour type,
  • Water
  • Temperature, and
  • Yeast quantity

By using DOE, the bakery can systematically test different combinations of these factors to determine the optimal recipe.

Randomization: Randomly assign different bread batches to different combinations of flour type, water temperature, and yeast quantity.

Replication: Bake multiple loaves of bread for each combination to assess variability.

Local Control: If the oven has different temperature zones, bake similar bread batches in the same zone to reduce temperature variation.

By following the Basic Principles of Design of Experiments, the bakery can efficiently identify the best recipe for its bread, improving product quality and reducing waste.

Learn R Programming Language

Online MCQs Test Website

Read more about the Objective of Design of Experiments