Principal Component Regression (PCR) is a statistical technique that combines two powerful methods: Principal Component Analysis (PCA) and linear regression.
Table of Contents
The transformation of the original data set into a new set of uncorrelated variables is called principal components. This kind of transformation ranks the new variables according to their importance (that is, variables are ranked according to the size of their variance and eliminate those of least importance). After transformation, a least square regression on this reduced set of principal components is performed, called principal component regression.
Principal Component Regression (PCR)
Principal Component Regression (PCR) is not scale invariant, therefore, one should scale and center data first. Therefore, given a p-dimensional random vector $x=(x_1, x_2, …, x_p)^t$ with covariance matrix $\sum$ and assume that $\sum$ is positive definite. Let $V=(v_1,v_2, \cdots, v_p)$ be a $(p \times p)$-matrix with orthogonal column vectors that is $v_i^t\, v_i=1$, where $i=1,2, \cdots, p$ and $V^t =V^{-1}$. The linear transformation
\begin{aligned}
z&=V^t x\\
z_i&=v_i^t x
\end{aligned}
The variance of the random variable $z_i$ is
\begin{aligned}
Var(Z_i)&=E[v_i^t\, x\, x^t\,\, v_i]\\
&=v_i^t \sum v_i
\end{aligned}
Maximizing the variance $Var(Z_i)$ under the conditions $v_i^t v_i=1$ with Lagrange gives
\[\phi_i=v_i^t \sum v_i -a_i(v_i^t v_i-1)\]
Setting the partial derivation to zero, we get
\[\frac{\partial \phi_i}{\partial v_i} = 2 \sum v_i – 2a_i v_i=0\]
which is
\[(\sum – a_i I)v_i=0\]
In matrix form
\[\sum V= VA\]
of
\[\sum = VAV^t\]
where $A=diag(a_1, a_2, \cdots, a_p)$. This is known as the eigenvalue problem, $v_i$ are the eigenvectors of $\sum$ and $a_i$ the corresponding eigenvalues such that $a_1 \ge a_2 \cdots \ge a_p$. Since $\sum$ is positive definite, all eigenvalues are real and non-negative numbers.
$z_i$ is named the ith principal component of $x$ and we have
\[Cov(z)=V^t Cov(x) V=V^t \sum V=A\]
The variance of the ith principal component matches the eigenvalue $a_i$, while the variances are ranked in descending order. This means that the last principal component will have the smallest variance. The principal components are orthogonal to all the other principal components (they are even uncorrelated) since $A$ is a diagonal matrix.
In the following, for regression, we will use $q$, that is,($1\le q \le p$) principal components. The regression model for observed data $X$ and $y$ can then be expressed as
\begin{aligned}
y&=X\beta+\varepsilon\\
&=XVV^t\beta+\varepsilon\\
&= Z\theta+\varepsilon
\end{aligned}
with the $n\times q$ matrix of the empirical principal components $Z=XV$ and the new regression coefficients $\theta=V^t \beta$. The solution of the least squares estimation is
\begin{aligned}
\hat{\theta}_k=(z_k^t z_k)^{-1}z_k^ty
\end{aligned}
and $\hat{\theta}=(\theta_1, \cdots, \theta_q)^t$
Since the $z_k$ are orthogonal, the regression is a sum of univariate regressions, that is
\[\hat{y}_{PCR}=\sum_{k=1}^q \hat{\theta}_k z_k\]
Since $z_k$ are linear combinations of the original $x_j$, the solution in terms of coefficients of the $x_j$ can be expressed as
\[\hat{\beta}_{PCR} (q)=\sum_{k=1}^q \hat{\theta}_k v_k=V \hat{\theta}\]
Note that if $q=p$, we would get back the usual least squares estimates for the full model. For $q<p$, we get a “reduced” regression.
Why use Principal Component Regression?
- Reduces Dimensionality: When dealing with a large number of predictors, PCR can help reduce the complexity of the model.
- Handles multicollinearity: If there is a high correlation among predictors (multicollinearity), PCR can address this issue.
- Improves interpretability: In some cases, the principal components can be easier to interpret than the original variables.
Important Points to Remember
- PCR is an unsupervised technique for dimensionality reduction.
- The number of principal components used in the regression model is a crucial parameter.
- PCR can be compared to Partial Least Squares Regression (PLS), another dimensionality reduction technique that considers the relationship between predictors and the response variable.
R Language Interview Questions