# Regression Diagnostics

Regression Analysis, Hat Matrix, Residual Analysis, Regression Diagnostics

## Checking Normality of Error Term

### Normality of Error Term

In multiple linear regression models, the sum of squared residuals (SSR) is divided by $n-p$ (degrees of freedom, where $n$ is the total number of observations, and $p$ is the number of the parameter in the model) is a good estimate of the error variance. In the multiple linear regression model, the residual vector is

\begin{align*}
e &=(I-H)y\\
&=(I-H)(X\beta+e)\\
&=(I-H)\varepsilon
\end{align*}

where $H$ is the hat matrix for the regression model.

Each component $e_i=\varepsilon – \sum\limits_{i=1}^n h_{ij} \varepsilon_i$. Therefore, In multiple linear regression models, the normality of the residual is not simply the normality of the error term.

Note that:

$Cov(\mathbf{e})=(I-H)\sigma^2 (I-H)’ = (I-H)\sigma^2$

We can write $Var(e_i)=(1-h_{ii})\sigma^2$.

If the sample size ($n$) is much larger than the number of the parameters ($p$) in the model (i.e. $n > > p$), in other words, if sample size ($n$) is large enough, $h_{ii}$ will be small as compared to 1, and $Var(e_i) \approx \sigma^2$.

In multiple regression models, a residual behaves like an error if the sample size is large. However, this is not true for a small sample size.

It is unreliable to check the normality of error term assumption using residuals from multiple linear regression models when the sample size is small.

## Role of Hat Matrix in Regression Analysis

The post is about the importance and role of the Hat Matrix in Regression Analysis.

Hat matrix is a $n\times n$ symmetric and idempotent matrix with many special properties that play an important role in the diagnostics of regression analysis by transforming the vector of observed responses $Y$ into the vector of fitted responses $\hat{Y}$.

The model $Y=X\beta+\varepsilon$ with solution $b=(X’X)^{-1}X’Y$ provided that $(X’X)^{-1}$ is non-singular. The fitted values are ${\hat{Y}=Xb=X(X’X)^{-1} X’Y=HY}$.

Like fitted values ($\hat{Y}$), the residual can be expressed as linear combinations of the response variable $Y_i$.

\begin{align*}
e&=Y-\hat{Y}\\
&=Y-HY\\&=(I-H)Y
\end{align*}

The role of hat matrix in Regression Analysis and Regression Diagnostics is:

• The hat matrix only involves the observation in the predictor variable X  as $H=X(X’X)^{-1}X’$. It plays an important role in diagnostics for regression analysis.
• The hat matrix plays an important role in determining the magnitude of a studentized deleted residual and therefore in identifying outlying Y observations.
• The hat matrix is also helpful in directly identifying outlying $X$ observations.
• In particular, the diagonal elements of the hat matrix are indicators in a multi-variable setting of whether or not a case is outlying concerning $X$ values.
• The elements of the “Hat matrix” have their values between 0 and 1 always and their sum is p i.e. $0 \le h_{ii}\le 1$  and  $\sum _{i=1}^{n}h_{ii} =p$
where p is the number of regression parameters with intercept term.
• $h_{ii}$ is a measure of the distance between the $X$ values for the ith case and the means of the $X$ values for all $n$ cases.

### Mathematical Properties of Hat Matrix

• $HX=X$
• $(I-H)X=0$
• $HH=H^2 = H H^p$
• $H(I-H)=0$
• $Cov(\hat{e},\hat{Y})=Cov\left\{HY,(I-H)Y\right\}=\sigma ^{2} H(I-H)=0$
• $1-H$ is also symmetric and idempotent.
• $H1=1$ with intercept term. i.e. every row of $H$ adds up to $1. 1’=1H’=1’H$  & $1’H1=n$
• The elements of $H$ are denoted by $h_{ii}$ i.e.
$H=\begin{pmatrix}{h_{11} } & {h_{12} } & {\cdots } & {h_{1n} } \\ {h_{21} } & {h_{22} } & {\cdots } & {h_{2n} } \\ {\vdots } & {\vdots } & {\ddots } & {\vdots } \\ {h_{n1} } & {h_{n2} } & {\vdots } & {h_{nn} }\end{pmatrix}$
The large value of $h_{ii}$ indicates that the ith case is distant from the center for all $n$ cases. The diagonal element $h_{ii}$ in this context is called leverage of the ith case. $h_{ii}$ is a function of only the $X$ values, so $h_{ii}$ measures the role of the $X$ values in determining how important $Y_i$ is affecting the fitted $\hat{Y}_{i}$ values.
The larger the $h_{ii}$ the smaller the variance of the residuals $e_i$ for $h_{ii}=1$, $\sigma^2(ei)=0$.
• Variance, Covariance of $e$
\begin{align*}
e-E(e)&=(I-H)Y(Y-X\beta )=(I-H)\varepsilon \\
E(\varepsilon \varepsilon ‘)&=V(\varepsilon )=I\sigma ^{2} \,\,\text{and} \,\, E(\varepsilon )=0\\
(I-H)’&=(I-H’)=(I-H)\\
V(e) & =  E\left[e-E(e_{i} )\right]\left[e-E(e_{i} )\right]^{{‘} } \\
& = (I-H)E(\varepsilon \varepsilon ‘)(I-H)’ \\
& = (I-H)I\sigma ^{2} (I-H)’ \\
& =(I-H)(I-H)I\sigma ^{2} =(I-H)\sigma ^{2}
\end{align*}
$V(e_i)$ is given by the ith diagonal element $1-h_{ii}$ and $Cov(e_i, e_j)$ is given by the $(i, j)$th  element of $-h_{ij}$ of the matrix $(I-H)\sigma^2$.
\begin{align*}
\rho _{ij} &=\frac{Cov(e_{i} ,e_{j} )}{\sqrt{V(e_{i} )V(e_{j} )} } \\
&=\frac{-h_{ij} }{\sqrt{(1-h_{ii} )(1-h_{jj} )} }\\
SS(b) & = SS({\rm all\; parameter)=}b’X’Y \\
& = \hat{Y}’Y=Y’H’Y=Y’HY=Y’H^{2} Y=\hat{Y}’\hat{Y}
\end{align*}
The average $V(\hat{Y}_{i} )$ to all data points is
\begin{align*}
\sum _{i=1}^{n}\frac{V(\hat{Y}_{i} )}{n} &=\frac{trace(H\sigma ^{2} )}{n}=\frac{p\sigma ^{2} }{n} \\
\hat{Y}_{i} &=h_{ii} Y_{i} +\sum _{j\ne 1}h_{ij} Y_{j}
\end{align*}

### Role of Hat Matrix in Regression Diagnostic

#### Internally Studentized Residuals

$V(e_i)=(1-h_{ii})\sigma^2$ where $\sigma^2$ is estimated by $s^2$

i.e. $s^{2} =\frac{e’e}{n-p} =\frac{\Sigma e_{i}^{2} }{n-p}$  (RMS)

we can studentized the residual as $s_{i} =\frac{e_{i} }{s\sqrt{(1-h_{ii} )} }$

These studentized residuals are said to be internally studentized because $s$ has within it $e_i$ itself.

#### Extra Sum of Squares attributable to $e_i$

\begin{align*}
e&=(1-H)Y\\
e_{i} &=-h_{i1} Y_{1} -h_{i2} Y_{2} -\cdots +(1-h_{ii} )Y_{i} -h_{in} Y_{n} =c’Y\\
c’&=(-h_{i1} ,-h_{i2} ,\cdots ,(1-h_{ii} )\cdots -h_{in} )\\
c’c&=\sum _{i=1}^{n}h_{i1}^{2}  +(1-2h_{ii} )=(1-h_{ii} )\\
SS(e_{i})&=\frac{e_{i}^{2} }{(1-h_{ii} )}\\
S_{(i)}^{2}&=\frac{(n-p)s^{2} -\frac{e_{i}^{2}}{e_{i}^{2}  (1-h_{ii} )}}{n-p-1}
\end{align*}
provides an estimate of $\sigma^2$ after deletion of the contribution of $e_i$.

#### Externally Studentized Residuals

$t_{i} =\frac{e_{i} }{s(i)\sqrt{(1-h_{ii} )} }$ are externally studentized residuals. Here if $e_i$ is large, it is thrown into emphasis even more by the fact that $s_i$ has excluded it. The $t_i$ follows a $t_{n-p-1}$ distribution under the usual normality of error assumptions.

Read more about the Role of the Hat Matrix in Regression Analysis https://en.wikipedia.org/wiki/Hat_matrix