Discover the fundamentals of Ridge Regression, a powerful biased regression technique for handling multicollinearity and overfitting. Learn its canonical form, key differences from Lasso Regression (L1 vs L2 regularization), and why it’s essential for robust predictive modeling. Perfect for ML beginners and data scientists!
Introduction
In cases of near multicollinearity, the Ordinary Least Squares (OLS) estimator may perform worse compared to non-linear or biased estimators. For near multicollinearity, the variance of regression coefficients ($\beta$’s, where $\beta=(X’X)^{-1}X’Y$), given by $\sigma^2(X’X)^{-1}$ can be very large. While in terms of the Mean Squared Error (MSE) criterion, a biased estimator with less dispersion may be more efficient.
Table of Contents
Understanding Ridge Regression
Ridge regression (RR) is a popular biased regression technique used to address multicollinearity and overfitting in linear regression models. Unlike ordinary least squares (OLS), RR introduces a regularization term (L2 penalty) to shrink coefficients, improving model stability and generalization.
Addition of the matrix $KI_p$ (where $K$ is a scalar to $X’X$ yields a more stable matrix $(X’X+KI_p)$. The ridge estimator of $\beta$ ($(X’X+KI_p)^{-1}X’Y$) should have a smaller dispersion than the OLS estimator.
Why Use Ridge Regression
OLS regression can produce high variance when predictors are highly correlated (multicollinearity). Ridge regression helps by:
- Reducing overfitting by penalizing large coefficients
- Improving model stability in the presence of multicollinearity
- Providing better predictions when data has many predictors
Canonical Form
Let $P$ denote the orthogonal matrix whose elements are the eigenvectors of $X’X$ and let $\Lambda$ be the (diagonal) matrix containing the eigenvalues. Consider the spectral decomposition;
\begin{align*}
X’X &= P\Lambda P’\\
\alpha = P’\beta\\
X^* &= XP\\
C &= X’^*Y
\end{align*}
The mode $Y=X\beta + \varepsilon$ can be written as
$$Y = X^*\alpha + \varepsilon$$
The OLS estimator of $\alpha$ is
\begin{align*}
\hat{\alpha} &= (X’^*X*)^{-1}X’^* Y\\
&=(P’X’ XP)^{-1}C = \Lambda^{-1}C
\end{align*}
In scalar notation $$\hat{\alpha}_i=\frac{C_i}{\lambda_i},\quad i=1,2,\cdots,P_i\tag{(A)}$$
From $\hat{\beta}_R = (X’X+KI_p)^{-1}X’Y$, it follows that the principle of RR is to add a constant $K$ to the denominator of ($A$), to obtain:
$$\hat{\alpha}_i^R = \frac{C_i}{\lambda_i + K}$$
Grob criticized this approach, that all eigenvalues of $X’X$ are equal, while for the purpose of stabilization, it would be reasonable to add rather large values to small eigenvalues but small values to large eigenvalues. This is the general ridge (GR) estimator. it is
$$\hat{\alpha}_i^R = \frac{C_i}{\lambda_i+K_i}$$
Ridge Regression vs Lasso Regression
Both are regularized regression techniques, but:
Feature | L2 | L1 |
---|---|---|
Shrinkage | Shrinks coefficients evenly | Can shrink coefficients to zero |
Use Case | Multicollinearity, many predictors | Feature selection, sparse models |
Ridge regression is a powerful biased regression method that improves prediction accuracy by adding L2 regularization. It’s especially useful when dealing with multicollinearity and high-dimensional data.
Learn R Programming Language