The post is about the importance and role of the Hat Matrix in Regression Analysis.
Hat matrix is a $n\times n$ symmetric and idempotent matrix with many special properties that play an important role in the diagnostics of regression analysis by transforming the vector of observed responses $Y$ into the vector of fitted responses $\hat{Y}$.
The model $Y=X\beta+\varepsilon$ with solution $b=(X’X)^{-1}X’Y$ provided that $(X’X)^{-1}$ is non-singular. The fitted values are ${\hat{Y}=Xb=X(X’X)^{-1} X’Y=HY}$.
Like fitted values ($\hat{Y}$), the residual can be expressed as linear combinations of the response variable $Y_i$.
\begin{align*}
e&=Y-\hat{Y}\\
&=Y-HY\\&=(I-H)Y
\end{align*}
The role of hat matrix in Regression Analysis and Regression Diagnostics is:
- The hat matrix only involves the observation in the predictor variable X as $H=X(X’X)^{-1}X’$. It plays an important role in diagnostics for regression analysis.
- The hat matrix plays an important role in determining the magnitude of a studentized deleted residual and identifying outlying Y observations.
- The hat matrix is also helpful in directly identifying outlying $X$ observations.
- In particular, the diagonal elements of the hat matrix are indicators in a multi-variable setting of whether or not a case is outlying concerning $X$ values.
- The elements of the “Hat matrix” have their values between 0 and 1 always and their sum is p i.e. $0 \le h_{ii}\le 1$ and $\sum _{i=1}^{n}h_{ii} =p $
where p is the number of regression parameters with intercept term. - $h_{ii}$ is a measure of the distance between the $X$ values for the ith case and the means of the $X$ values for all $n$ cases.
Mathematical Properties of Hat Matrix
- $HX=X$
- $(I-H)X=0$
- $HH=H^2 = H H^p$
- $H(I-H)=0$
- $Cov(\hat{e},\hat{Y})=Cov\left\{HY,(I-H)Y\right\}=\sigma ^{2} H(I-H)=0$
- $1-H$ is also symmetric and idempotent.
- $H1=1$ with intercept term. i.e. every row of $H$ adds up to $1. 1’=1H’=1’H$ & $1’H1=n$
- The elements of $H$ are denoted by $h_{ii}$ i.e.
\[H=\begin{pmatrix}{h_{11} } & {h_{12} } & {\cdots } & {h_{1n} } \\ {h_{21} } & {h_{22} } & {\cdots } & {h_{2n} } \\ {\vdots } & {\vdots } & {\ddots } & {\vdots } \\ {h_{n1} } & {h_{n2} } & {\vdots } & {h_{nn} }\end{pmatrix}\]
The large value of $h_{ii}$ indicates that the ith case is distant from the center for all $n$ cases. In this context, the diagonal element $h_{ii}$ is called leverage of the ith case. $h_{ii}$ is a function of only the $X$ values, so $h_{ii}$ measures the role of the $X$ values in determining how important $Y_i$ is affecting the fitted $\hat{Y}_{i} $ values.
The larger the $h_{ii}$ the smaller the variance of the residuals $e_i$ for $h_{ii}=1$, $\sigma^2(ei)=0$. - Variance, Covariance of $e$
\begin{align*}
e-E(e)&=(I-H)Y(Y-X\beta )=(I-H)\varepsilon \\
E(\varepsilon \varepsilon ‘)&=V(\varepsilon )=I\sigma ^{2} \,\,\text{and} \,\, E(\varepsilon )=0\\
(I-H)’&=(I-H’)=(I-H)\\
V(e) & = E\left[e-E(e_{i} )\right]\left[e-E(e_{i} )\right]^{{‘} } \\
& = (I-H)E(\varepsilon \varepsilon ‘)(I-H)’ \\
& = (I-H)I\sigma ^{2} (I-H)’ \\
& =(I-H)(I-H)I\sigma ^{2} =(I-H)\sigma ^{2}
\end{align*}
$V(e_i)$ is given by the ith diagonal element $1-h_{ii}$ and $Cov(e_i, e_j)$ is given by the $(i, j)$th element of $-h_{ij}$ of the matrix $(I-H)\sigma^2$.
\begin{align*}
\rho _{ij} &=\frac{Cov(e_{i} ,e_{j} )}{\sqrt{V(e_{i} )V(e_{j} )} } \\
&=\frac{-h_{ij} }{\sqrt{(1-h_{ii} )(1-h_{jj} )} }\\
SS(b) & = SS({\rm all\; parameter)=}b’X’Y \\
& = \hat{Y}’Y=Y’H’Y=Y’HY=Y’H^{2} Y=\hat{Y}’\hat{Y}
\end{align*}
The average $V(\hat{Y}_{i} )$ to all data points is
\begin{align*}
\sum _{i=1}^{n}\frac{V(\hat{Y}_{i} )}{n} &=\frac{trace(H\sigma ^{2} )}{n}=\frac{p\sigma ^{2} }{n} \\
\hat{Y}_{i} &=h_{ii} Y_{i} +\sum _{j\ne 1}h_{ij} Y_{j}
\end{align*}
Role of Hat Matrix in Regression Diagnostic
Internally Studentized Residuals
$V(e_i)=(1-h_{ii})\sigma^2$ where $\sigma^2$ is estimated by $s^2$
i.e. $s^{2} =\frac{e’e}{n-p} =\frac{\Sigma e_{i}^{2} }{n-p} $ (RMS)
we can studentized the residual as $s_{i} =\frac{e_{i} }{s\sqrt{(1-h_{ii} )} } $
These studentized residuals are said to be internally studentized because $s$ has within it $e_i$ itself.
Extra Sum of Squares attributable to $e_i$
\begin{align*}
e&=(1-H)Y\\
e_{i} &=-h_{i1} Y_{1} -h_{i2} Y_{2} -\cdots +(1-h_{ii} )Y_{i} -h_{in} Y_{n} =c’Y\\
c’&=(-h_{i1} ,-h_{i2} ,\cdots ,(1-h_{ii} )\cdots -h_{in} )\\
c’c&=\sum _{i=1}^{n}h_{i1}^{2} +(1-2h_{ii} )=(1-h_{ii} )\\
SS(e_{i})&=\frac{e_{i}^{2} }{(1-h_{ii} )}\\
S_{(i)}^{2}&=\frac{(n-p)s^{2} -\frac{e_{i}^{2}}{e_{i}^{2} (1-h_{ii} )}}{n-p-1}
\end{align*}
provides an estimate of $\sigma^2$ after deletion of the contribution of $e_i$.
Externally Studentized Residuals
$t_{i} =\frac{e_{i} }{s(i)\sqrt{(1-h_{ii} )} }$ are externally studentized residuals. Here if $e_i$ is large, it is emphasized even more by the fact that $s_i$ has excluded it. The $t_i$ follows a $t_{n-p-1}$ distribution under the usual normality of error assumptions.
Read more about the Role of the Hat Matrix in Regression Analysis https://en.wikipedia.org/wiki/Hat_matrix
Read about Regression Diagnostics