Understanding Ridge Regression

Post Views: 710

Discover the fundamentals of Ridge Regression, a powerful biased regression technique for handling multicollinearity and overfitting. Learn its canonical form, key differences from Lasso Regression (L1 vs L2 regularization), and why it’s essential for robust predictive modeling. Perfect for ML beginners and data scientists!

Introduction

In cases of near multicollinearity, the Ordinary Least Squares (OLS) estimator may perform worse compared to non-linear or biased estimators. For near multicollinearity, the variance of regression coefficients ($\beta$’s, where $\beta=(X’X)^{-1}X’Y$), given by $\sigma^2(X’X)^{-1}$ can be very large. While in terms of the Mean Squared Error (MSE) criterion, a biased estimator with less dispersion may be more efficient.

Ridge Regression, Bias Variance Trade off

Ridge regression (RR) is a popular biased regression technique used to address multicollinearity and overfitting in linear regression models. Unlike ordinary least squares (OLS), RR introduces a regularization term (L2 penalty) to shrink coefficients, improving model stability and generalization.

Addition of the matrix $KI_p$ (where $K$ is a scalar to $X’X$ yields a more stable matrix $(X’X+KI_p)$. The ridge estimator of $\beta$ ($(X’X+KI_p)^{-1}X’Y$) should have a smaller dispersion than the OLS estimator.

Why Use Ridge Regression

OLS regression can produce high variance when predictors are highly correlated (multicollinearity). Ridge regression helps by:

Reducing overfitting by penalizing large coefficients
Improving model stability in the presence of multicollinearity
Providing better predictions when data has many predictors

Canonical Form

Let $P$ denote the orthogonal matrix whose elements are the eigenvectors of $X’X$ and let $\Lambda$ be the (diagonal) matrix containing the eigenvalues. Consider the spectral decomposition;

\begin{align*}
X’X &= P\Lambda P’\\
\alpha = P’\beta\\
X^* &= XP\\
C &= X’^*Y
\end{align*}

The mode $Y=X\beta + \varepsilon$ can be written as

$$Y = X^*\alpha + \varepsilon$$

The OLS estimator of $\alpha$ is

\begin{align*}
\hat{\alpha} &= (X’^*X*)^{-1}X’^* Y\\
&=(P’X’ XP)^{-1}C = \Lambda^{-1}C
\end{align*}

In scalar notation $$\hat{\alpha}_i=\frac{C_i}{\lambda_i},\quad i=1,2,\cdots,P_i\tag{(A)}$$

From $\hat{\beta}_R = (X’X+KI_p)^{-1}X’Y$, it follows that the principle of RR is to add a constant $K$ to the denominator of ($A$), to obtain:

$$\hat{\alpha}_i^R = \frac{C_i}{\lambda_i + K}$$

Grob criticized this approach, that all eigenvalues of $X’X$ are equal, while for the purpose of stabilization, it would be reasonable to add rather large values to small eigenvalues but small values to large eigenvalues. This is the general ridge (GR) estimator. it is

$$\hat{\alpha}_i^R = \frac{C_i}{\lambda_i+K_i}$$

Ridge Regression vs Lasso Regression

Both are regularized regression techniques, but:

Feature	L2	L1
Shrinkage	Shrinks coefficients evenly	Can shrink coefficients to zero
Use Case	Multicollinearity, many predictors	Feature selection, sparse models

Ridge regression is a powerful biased regression method that improves prediction accuracy by adding L2 regularization. It’s especially useful when dealing with multicollinearity and high-dimensional data.

Learn R Programming Language

Understanding Ridge Regression

Introduction

Table of Contents