Principal Component Regression (PCR)

Principal Component Regression (PCR) is a statistical technique that combines two powerful methods: Principal Component Analysis (PCA) and linear regression.

The transformation of the original data set into a new set of uncorrelated variables is called principal components. This kind of transformation ranks the new variables according to their importance (that is, variables are ranked according to the size of their variance and eliminate those of least importance). After transformation, a least square regression on this reduced set of principal components is performed, called principal component regression.

Principal Component Regression (PCR)

Principal Component Regression (PCR) is not scale invariant, therefore, one should scale and center data first. Therefore, given a p-dimensional random vector $x=(x_1, x_2, …, x_p)^t$ with covariance matrix $\sum$ and assume that $\sum$ is positive definite. Let $V=(v_1,v_2, \cdots, v_p)$ be a $(p \times p)$-matrix with orthogonal column vectors that is $v_i^t\, v_i=1$, where $i=1,2, \cdots, p$ and $V^t =V^{-1}$. The linear transformation

\begin{aligned}
z&=V^t x\\
z_i&=v_i^t x
\end{aligned}

The variance of the random variable $z_i$ is
\begin{aligned}
Var(Z_i)&=E[v_i^t\, x\, x^t\,\, v_i]\\
&=v_i^t \sum v_i
\end{aligned}

Maximizing the variance $Var(Z_i)$ under the conditions $v_i^t v_i=1$ with Lagrange gives
\[\phi_i=v_i^t \sum v_i -a_i(v_i^t v_i-1)\]

Setting the partial derivation to zero, we get
\[\frac{\partial \phi_i}{\partial v_i} = 2 \sum v_i – 2a_i v_i=0\]

which is
\[(\sum – a_i I)v_i=0\]

In matrix form
\[\sum V= VA\]
of
\[\sum = VAV^t\]

where $A=diag(a_1, a_2, \cdots, a_p)$. This is known as the eigenvalue problem, $v_i$ are the eigenvectors of $\sum$ and $a_i$ the corresponding eigenvalues such that $a_1 \ge a_2 \cdots \ge a_p$. Since $\sum$ is positive definite, all eigenvalues are real and non-negative numbers.

$z_i$ is named the ith principal component of $x$ and we have
\[Cov(z)=V^t Cov(x) V=V^t \sum V=A\]

The variance of the ith principal component matches the eigenvalue $a_i$, while the variances are ranked in descending order. This means that the last principal component will have the smallest variance. The principal components are orthogonal to all the other principal components (they are even uncorrelated) since $A$ is a diagonal matrix.

In the following, for regression, we will use $q$, that is,($1\le q \le p$) principal components. The regression model for observed data $X$ and $y$ can then be expressed as

\begin{aligned}
y&=X\beta+\varepsilon\\
&=XVV^t\beta+\varepsilon\\
&= Z\theta+\varepsilon
\end{aligned}

with the $n\times q$ matrix of the empirical principal components $Z=XV$ and the new regression coefficients $\theta=V^t \beta$. The solution of the least squares estimation is

\begin{aligned}
\hat{\theta}_k=(z_k^t z_k)^{-1}z_k^ty
\end{aligned}

and $\hat{\theta}=(\theta_1, \cdots, \theta_q)^t$

Since the $z_k$ are orthogonal, the regression is a sum of univariate regressions, that is
\[\hat{y}_{PCR}=\sum_{k=1}^q \hat{\theta}_k z_k\]

Since $z_k$ are linear combinations of the original $x_j$, the solution in terms of coefficients of the $x_j$ can be expressed as
\[\hat{\beta}_{PCR} (q)=\sum_{k=1}^q \hat{\theta}_k v_k=V \hat{\theta}\]

Principal Component Regression PCR

Note that if $q=p$, we would get back the usual least squares estimates for the full model. For $q<p$, we get a “reduced” regression.

Why use Principal Component Regression?

  • Reduces Dimensionality: When dealing with a large number of predictors, PCR can help reduce the complexity of the model.
  • Handles multicollinearity: If there is a high correlation among predictors (multicollinearity), PCR can address this issue.
  • Improves interpretability: In some cases, the principal components can be easier to interpret than the original variables.

Important Points to Remember

  • PCR is an unsupervised technique for dimensionality reduction.
  • The number of principal components used in the regression model is a crucial parameter.
  • PCR can be compared to Partial Least Squares Regression (PLS), another dimensionality reduction technique that considers the relationship between predictors and the response variable.

R Language Interview Questions

Online MCQs Test Website

Canonical Correlation Analysis

The bivariate correlation analysis measures the strength of the relationship between two variables. One may be required to find the strength of the relationship between two sets of variables. In this case, canonical correlation analysis is an appropriate technique for measuring the strength of the relationship between two sets of variables.

Introduction to Canonical Correlation Analysis

Canonical Correlation Analysis (CCA) is a multivariate technique that is appropriate in the same situations where multiple regression would occur, but where there are multiple inter-correlated outcome variables. Canonical correlation analysis determines a set of canonical variates, orthogonal linear combinations of the variables within each set that best explain the variability both within and between sets.

Examples: Canonical Correlation Analysis

  • In medicine, individuals’ lifestyles and eating habits may affect their different health measures determined by several health-related variables such as hypertension, weight, anxiety, and tension level.
  • In business, the marketing manager of a consumer goods firm may be interested in finding the relationship between the types of products purchased and consumers’ lifestyles and personalities.

From the above two examples, one set of variables is the predictor or independent while the other set of variables is the criterion or dependent set. The objective of canonical correlation is to determine if the predictor set of variables affects the criterion set of variables.

Note that it is unnecessary to designate the two sets of variables as dependent and independent. In this case, the objective of canonical correlation is to ascertain the relationship between the two sets of variables.

Canonical Correlation Analysis

Objectives of Canonical Correlation

The objective of canonical correlation is similar to conducting a principal components analysis on each set of variables. In principal components analysis, the first new axis results in a new variable that accounts for the maximum variance in the data. In contrast, in canonical correlation analysis, a new axis is identified for each set of variables such that the correlation between the two resulting new variables is maximum.

Canonical correlation analysis can also be considered a data reduction technique as only a few canonical variables may be needed to adequately represent the association between the two sets of variables. Therefore, an additional objective of canonical correlation is to determine the minimum number of canonical correlations needed to adequately represent the association between two sets of variables.

Why Use Canonical Correlation Analysis Instead of Simple Correlation?

  • Useful when both $X$ and $Y$ are multivariate (for example, in genomics, where genes and traits are both multi-dimensional).
  • Handles multiple variables simultaneously (unlike Pearson correlation, which is pairwise).
  • Reduces dimensionality by finding the most correlated combinations.

Real-Life Examples of Canonical Correlation Analysis

  • Psychology & Behavioral Sciences: Studying the relationship between: Set X: Psychological traits (for example, stress, anxiety, and personality scores). Set Y: Behavioral outcomes (for example, academic performance, and job satisfaction). Here the canonical correlation will help the researchers to understand how mental health factors influence real-world behavior.
  • Medicine & Healthcare: Analyzing the link between: Set X: Biomarkers (such as blood pressure and cholesterol levels). Set Y: Disease symptoms (for example, severity of diabetes, and heart disease risk). The canonical correlation may be used to Identify which biological factors most strongly correlate with health outcomes.
  • Economics & Finance: Examining the relationship between: Set X: Macroeconomic indicators (e.g., GDP growth, inflation). Set Y: Stock market performance (e.g., S&P 500 returns, sector trends). The canonical correlation analysis may help policymakers and investors to understand how economic conditions affect markets.
  • Marketing & Consumer Research: Linking: Set X: Customer demographics (for example, age, income, and education). Set Y: Purchasing behavior (for example, brand preference, spending habits). The canonical correlation analysis may guide targeted advertising by identifying which customer traits influence buying decisions.
  • Neuroscience & Brain Imaging (fMRI Studies): Exploring connections between: Set X: Brain activity patterns (from fMRI scans). Set Y: Cognitive tasks (memory and decision-making) may Help neuroscientists understand how brain regions work together during tasks.
  • Environmental Science: Studying the impact of: Set X: Climate variables (temperature and rainfall). Set Y: Crop yields (wheat, rice production) may assist in predicting agricultural outcomes based on weather patterns.
  • Computer Vision & Machine Learning: Multimodal data fusion: Set X: Image features (from facial recognition). Set Y: Text data (user captions or tags). The canonical correlation analysis may be used to improve AI systems by finding relationships between images and their descriptions.
  • Social Media & Recommendation Systems: Analyzing: Set X: User engagement metrics (likes and shares). Set Y: Content features (post length and topic). Canonical correlation may help platforms optimize content recommendations.

Limitations of Canonical Correlation Analysis

  • Assumes linear relationships (nonlinear extensions like Kernel CCA exist).
  • Sensitive to multicollinearity (high correlations within X or Y sets).
  • Requires large sample sizes for reliable results.

Canonical Correlation Analysis Example in Python

#import required libraries

import numpy as np
from sklearn.cross_decomposition import CCA
import pandas as pd
# ----------------------
### Example Data

data = {
    'Exercise': [3, 5, 2, 4, 6, 1],
    'Diet': [8, 6, 4, 7, 5, 3],
    'Blood_Pressure': [120, 115, 140, 130, 125, 145],
    'Cholesterol': [180, 170, 210, 190, 175, 220]
}
df = pd.DataFrame(data)

X = df[['Exercise', 'Diet']].values  # Set X
Y = df[['Blood_Pressure', 'Cholesterol']].values  # Set Y

# -------------------------
cca = CCA(n_components=1)  # We want 1 canonical component
cca.fit(X, Y)

# Transform data into canonical variables
X_c, Y_c = cca.transform(X, Y)

print("Canonical Correlations:", np.corrcoef(X_c.T, Y_c.T)[0, 1])
print("X Weights (Exercise, Diet):", cca.x_weights_)
print("Y Weights (BP, Cholesterol):", cca.y_weights_)

Results:

  • Canonical Correlation: 0.92 (strong relationship)
  • X Weights: [0.78, -0.62] → More exercise & better diet reduce health risks.
  • Y Weights: [-0.85, 0.53] → Lower BP and cholesterol are linked.

Canonical Correlation Analysis Example in R Language

# load libraries

install.packages("CCA")  # If not installed
library(CCA)
# -------------

# Example data
df <- data.frame(
  Exercise = c(3, 5, 2, 4, 6, 1),
  Diet = c(8, 6, 4, 7, 5, 3),
  Blood_Pressure = c(120, 115, 140, 130, 125, 145),
  Cholesterol = c(180, 170, 210, 190, 175, 220)
)

X <- df[, c("Exercise", "Diet")]  # Set X
Y <- df[, c("Blood_Pressure", "Cholesterol")]  # Set Y

# -------------
result <- cc(X, Y)
summary(result)

print(paste("Canonical Correlation:", result$cor[1]))  # First canonical correlation
print("X Weights (Exercise, Diet):")
print(result$xcoef)
print("Y Weights (BP, Cholesterol):")
print(result$ycoef)

Results

  • Canonical Correlation: 0.92
  • X Weights: Exercise (0.78), Diet (-0.62)
  • Y Weights: BP (-0.85), Cholesterol (0.53)

Decision from Example

  1. High canonical correlation (0.92) → Strong link between lifestyle (X) and health (Y).
  2. Exercise reduces Blood Pressure (negative weight in Y).
  3. Poor diet (low score) correlates with high cholesterol.
Canonical Correlation Analysis

Learn R Programming

Computer MCQs Test Online

FAQs about Canonical Correlation Analysis

  • Explain canonical correlation analysis.
  • Give some examples of canonical correlation.
  • What are the objects of Canonical Correlation Analysis (CCA)?
  • What are the limitations of Canonical Correlation Analysis (CCA)?
  • Give some real-life applications where canonical correlation analysis may help in finding the relationship between two sets of variables.

Multivariate Data Sets: Descriptive Statistics (2010)

Much of the information contained in the multivariate data sets can be assessed by calculating certain summary numbers, known as multivariate descriptive statistics such as Arithmetic mean (a measure of location), an average of the squares of the distances of all of the numbers from the mean (variation/spread i.e. measure of spread or variation), etc. Here we will discuss descriptive statistics and multivariate data sets.

Multivariate data sets are used in various fields, such as:

  • Social Sciences: Analyzing factors influencing social phenomena like voting behavior, educational attainment, or health outcomes.
  • Business: Understanding customer demographics and purchase patterns, market research, risk assessment, and financial modeling.
  • Natural Sciences: Studying relationships between environmental variables, analyzing climate data, or exploring genetic factors influencing diseases.

Multivariate Data Sets: Descriptive Analysis

We shall rely most heavily on descriptive statistics which is a measure of location, variation, and linear association. For descriptive statistics multivariate data set, let us start with a measure of location, a measure of spread, sample covariance, and sample correlation coefficient.

Measure of Location

The arithmetic average of $n$ measurements $(x_{11}, x_{21}, x_{31},x_{41})$ on the first variable (defined in Multivariate Analysis: An Introduction) is

Sample Mean = $\bar{x}=\frac{1}{n} \sum _{j=1}^{n}x_{j1} \mbox{ where } j =1, 2,3,\cdots , n $

The sample mean for $n$ measurements on each of the p variables (there will be p sample means)

$\bar{x}_{k} =\frac{1}{n} \sum _{j=1}^{n}x_{jk} \mbox{ where }  k  = 1, 2, \cdots , p$

Measure of Spread

Measure of spread (variance) for $n$ measurements on the first variable for multivariate data sets can be found as
$s_{1}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )^{2} $ where $\bar{x}_{1} $ is sample mean of the $x_{j}$’s for p variables.

Measure of spread (variance) for $n$ measurements on all variables can be found as

$s_{k}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2}  \mbox{ where } k=1,2,\dots ,p \mbox{ and } j=1,2,\cdots ,p$

The Square Root of the sample variance is the sample standard deviation i.e

$S_{l}^{2} =S_{kk} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2}  \mbox{ where }  k=1,2,\cdots ,p$

Multivariate Data sets

Sample Covariance

Consider $n$pairs of measurement on each of Variable 1 and Variable 2
\[\left[\begin{array}{c} {x_{11} } \\ {x_{12} } \end{array}\right],\left[\begin{array}{c} {x_{21} } \\ {x_{22} } \end{array}\right],\cdots ,\left[\begin{array}{c} {x_{n1} } \\ {x_{n2} } \end{array}\right]\]
That is $x_{j1}$ and $x_{j2}$ are observed on the jth experimental item $(j=1,2,\cdots ,n)$. So a measure of linear association between the measurements of  $V_1$ and $V_2$ for multivariate data sets is provided by the sample covariance
\[s_{12} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )(x_{j2} -\bar{x}_{2}  )\]
(the average product of the deviation from their respective means) therefore

$s_{ik} =\frac{1}{n} \sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )(x_{jk} -\bar{x}_{k}  )$;  $i=1,2,\cdots, p$ and $k=1,2,\cdots, p$.

It measures the association between the kth variable.

Variance is the most commonly used measure of dispersion (variation) in the data and it is directly proportional to the amount of variation or information available in the data.

Sample Correlation Coefficient

For Multivariate Data Sets, the sample correlation coefficient for the ith and kth variables is

\[r_{ik} =\frac{s_{ik} }{\sqrt{s_{ii} } \sqrt{s_{kk} } } =\frac{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{j} )(x_{jk} -\bar{x}_{k} ) }{\sqrt{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )^{2}  } \sqrt{\sum _{j=1}^{n}(x_{jk} -\bar{x}_{k}  )^{2} } } \]
$\mbox{ where } i=1,2,..,p \mbox{ and}  k=1,2,\dots ,p$

Note that $r_{ik} =r_{ki} $ for all $i$ and $k$, and $r$ lies between $-1$ and $+1$. $r$ measures the strength of the linear association. If $r=0$ the lack of linear association between the components exists. The sign of $r$ indicates the direction of the association.

Other Multivariate Analysis

Multiple Regression: It is used to model the relationship between a dependent variable (DV) and multiple independent variables (IV).

Principal Component Analysis (PCA): It reduces the dimensionality of data by identifying a smaller set of uncorrelated variables that capture most of the data’s variance.

Cluster Analysis: It groups the data points into clusters based on their similarities, helping identify subgroups within the data.

Discriminant Analysis: It classifies data points into predefined groups based on their characteristics.

Learn the use of matrices in R Language

Online MCQs Economics