Role of Hat Matrix in Regression Analysis

The post is about the importance and role of the Hat Matrix in Regression Analysis.

Hat matrix is a $n\times n$ symmetric and idempotent matrix with many special properties that play an important role in the diagnostics of regression analysis by transforming the vector of observed responses $Y$ into the vector of fitted responses $\hat{Y}$.

The model $Y=X\beta+\varepsilon$ with solution $b=(X’X)^{-1}X’Y$ provided that $(X’X)^{-1}$ is non-singular. The fitted values are ${\hat{Y}=Xb=X(X’X)^{-1} X’Y=HY}$.

Like fitted values ($\hat{Y}$), the residual can be expressed as linear combinations of the response variable $Y_i$.

\begin{align*}
e&=Y-\hat{Y}\\
&=Y-HY\\&=(I-H)Y
\end{align*}

The role of hat matrix in Regression Analysis and Regression Diagnostics is:

  • The hat matrix only involves the observation in the predictor variable X  as $H=X(X’X)^{-1}X’$. It plays an important role in diagnostics for regression analysis.
  • The hat matrix plays an important role in determining the magnitude of a studentized deleted residual and identifying outlying Y observations.
  • The hat matrix is also helpful in directly identifying outlying $X$ observations.
  • In particular, the diagonal elements of the hat matrix are indicators in a multi-variable setting of whether or not a case is outlying concerning $X$ values.
  • The elements of the “Hat matrix” have their values between 0 and 1 always and their sum is p i.e. $0 \le h_{ii}\le 1$  and  $\sum _{i=1}^{n}h_{ii} =p $
    where p is the number of regression parameters with intercept term.
  • $h_{ii}$ is a measure of the distance between the $X$ values for the ith case and the means of the $X$ values for all $n$ cases.

Mathematical Properties of Hat Matrix

  • $HX=X$
  • $(I-H)X=0$
  • $HH=H^2 = H H^p$
  • $H(I-H)=0$
  • $Cov(\hat{e},\hat{Y})=Cov\left\{HY,(I-H)Y\right\}=\sigma ^{2} H(I-H)=0$
  • $1-H$ is also symmetric and idempotent.
  • $H1=1$ with intercept term. i.e. every row of $H$ adds up to $1. 1’=1H’=1’H$  & $1’H1=n$
  • The elements of $H$ are denoted by $h_{ii}$ i.e.
    \[H=\begin{pmatrix}{h_{11} } & {h_{12} } & {\cdots } & {h_{1n} } \\ {h_{21} } & {h_{22} } & {\cdots } & {h_{2n} } \\ {\vdots } & {\vdots } & {\ddots } & {\vdots } \\ {h_{n1} } & {h_{n2} } & {\vdots } & {h_{nn} }\end{pmatrix}\]
    The large value of $h_{ii}$ indicates that the ith case is distant from the center for all $n$ cases. In this context, the diagonal element $h_{ii}$ is called leverage of the ith case. $h_{ii}$ is a function of only the $X$ values, so $h_{ii}$ measures the role of the $X$ values in determining how important $Y_i$ is affecting the fitted $\hat{Y}_{i} $ values.
    The larger the $h_{ii}$ the smaller the variance of the residuals $e_i$ for $h_{ii}=1$, $\sigma^2(ei)=0$.
  • Variance, Covariance of $e$
    \begin{align*}
    e-E(e)&=(I-H)Y(Y-X\beta )=(I-H)\varepsilon \\
    E(\varepsilon \varepsilon ‘)&=V(\varepsilon )=I\sigma ^{2} \,\,\text{and} \,\, E(\varepsilon )=0\\
    (I-H)’&=(I-H’)=(I-H)\\
    V(e) & =  E\left[e-E(e_{i} )\right]\left[e-E(e_{i} )\right]^{{‘} } \\
    & = (I-H)E(\varepsilon \varepsilon ‘)(I-H)’ \\
    & = (I-H)I\sigma ^{2} (I-H)’ \\
    & =(I-H)(I-H)I\sigma ^{2} =(I-H)\sigma ^{2}
    \end{align*}
    $V(e_i)$ is given by the ith diagonal element $1-h_{ii}$ and $Cov(e_i, e_j)$ is given by the $(i, j)$th  element of $-h_{ij}$ of the matrix $(I-H)\sigma^2$.
    \begin{align*}
    \rho _{ij} &=\frac{Cov(e_{i} ,e_{j} )}{\sqrt{V(e_{i} )V(e_{j} )} } \\
    &=\frac{-h_{ij} }{\sqrt{(1-h_{ii} )(1-h_{jj} )} }\\
    SS(b) & = SS({\rm all\; parameter)=}b’X’Y \\
    & = \hat{Y}’Y=Y’H’Y=Y’HY=Y’H^{2} Y=\hat{Y}’\hat{Y}
    \end{align*}
    The average $V(\hat{Y}_{i} )$ to all data points is
    \begin{align*}
    \sum _{i=1}^{n}\frac{V(\hat{Y}_{i} )}{n} &=\frac{trace(H\sigma ^{2} )}{n}=\frac{p\sigma ^{2} }{n} \\
    \hat{Y}_{i} &=h_{ii} Y_{i} +\sum _{j\ne 1}h_{ij} Y_{j}
    \end{align*}

Role of Hat Matrix in Regression Diagnostic

Internally Studentized Residuals

$V(e_i)=(1-h_{ii})\sigma^2$ where $\sigma^2$ is estimated by $s^2$

i.e. $s^{2} =\frac{e’e}{n-p} =\frac{\Sigma e_{i}^{2} }{n-p} $  (RMS)

we can studentized the residual as $s_{i} =\frac{e_{i} }{s\sqrt{(1-h_{ii} )} } $

These studentized residuals are said to be internally studentized because $s$ has within it $e_i$ itself.

Extra Sum of Squares attributable to $e_i$

\begin{align*}
e&=(1-H)Y\\
e_{i} &=-h_{i1} Y_{1} -h_{i2} Y_{2} -\cdots +(1-h_{ii} )Y_{i} -h_{in} Y_{n} =c’Y\\
c’&=(-h_{i1} ,-h_{i2} ,\cdots ,(1-h_{ii} )\cdots -h_{in} )\\
c’c&=\sum _{i=1}^{n}h_{i1}^{2}  +(1-2h_{ii} )=(1-h_{ii} )\\
SS(e_{i})&=\frac{e_{i}^{2} }{(1-h_{ii} )}\\
S_{(i)}^{2}&=\frac{(n-p)s^{2} -\frac{e_{i}^{2}}{e_{i}^{2}  (1-h_{ii} )}}{n-p-1}
\end{align*}
provides an estimate of $\sigma^2$ after deletion of the contribution of $e_i$.

Externally Studentized Residuals

$t_{i} =\frac{e_{i} }{s(i)\sqrt{(1-h_{ii} )} }$ are externally studentized residuals. Here if $e_i$ is large, it is emphasized even more by the fact that $s_i$ has excluded it. The $t_i$ follows a $t_{n-p-1}$ distribution under the usual normality of error assumptions.

Hat Matrix in Regression itfeature.com

Read more about the Role of the Hat Matrix in Regression Analysis https://en.wikipedia.org/wiki/Hat_matrix

Read about Regression Diagnostics

https://rfaqs.com

Outliers and Influential Observations

Here we will focus on the difference between the outliers and influential observations.

Outliers

The cases (observations or data points) that do not follow the model as the rest of the data are called outliers. In Regression, the cases with large residuals are a candidate for outliers. So an outlier is a data point that diverges from an overall pattern in a sample. Therefore, an outlier can certainly influence the relationship between the variables and may also exert an influence on the slope of the regression line.

An outlier can be created by a shift in the location (mean) or in the scale (variability) of the process. An outlier may be due to recording errors (may be correctable), or due to the sample not being entirely from the same population. This may also be due to the values from the same population but from the non-normal (heavy-tailed) population. That is, outliers may be due to incorrect specifications that are based on the wrong distributional assumptions.

Outliers and Influential Observations

Inferential Observations

An influential observation is often an outlier in the x-direction. Influential observation may arise from

  1. observations that are unusually large or otherwise deviate in unusually extreme forms from the center of a reference distribution,
  2. the observation may be associated with a unit that has a low probability and thus has a high probability weight.
  3. the observation may have a very large weight (relative to the weights of other units in the specified sub-population) due to problems with stratum jumping; sampling of birth units or highly seasonal units; large nonresponse adjustment factors arising from unusually low response rates within a given adjustment cell; unusual calibration-weighting effects; or other factors.

Importance of Outliers and Influential Observations

Outliers and Influential observations are important because:

  • Both outliers and influential observations can potentially mislead the interpretation of the regression model.
  • Outliers might indicate errors in the data or a non-linear relationship that the model cannot capture.
  • Influential observations can make the model seem more accurate than it is, masking underlying issues.

How to Identify Outliers and Influential Observations

Both outliers and influential observations can be identified by using:

  • Visual inspection: Scatterplots can reveal outliers as distant points.
  • Residual plots: Plotting residuals against predicted values or independent variables can show patterns indicative of influential observations.
  • Statistical diagnostics: Measures like Cook’s distance or leverage can quantify the influence of each data point.

By being aware of outliers and influential observations, one can ensure that the regression analysis provides a more reliable picture of the relationship between variables.

Learn R Programming Language

Error and Residual in Regression

Error and Residual in Regression

In Statistics and Optimization, Statistical Errors and Residuals are two closely related and easily confused measures of “Deviation of a sample from the mean”.

Error is a misnomer; an error is the amount by which an observation differs from its expected value. The errors e are unobservable random variables, assumed to have zero mean and uncorrelated elements each with common variance  σ2.

A Residual, on the other hand, is an observable estimate of the unobservable error. The residuals $\hat{e}$ are computed quantities with mean ${E(\hat{e})=0}$ and variance ${V(\hat{e})=\sigma^2 (1-H)}$.

Like the errors, each of the residuals has zero mean, but each residual may have a different variance. Unlike the errors, the residuals are correlated. The residuals are linear combinations of the errors. If the errors are normally distributed so are the errors.

regression: Error and Residual in Regression

Note that the sum of the residuals is necessarily zero, and thus the residuals are necessarily not independent. The sum of the errors need not be zero; the errors are independent random variables if the individuals are chosen from the population independently.

The differences between errors and residuals in Regression are:

Sr. No.ErrorsResiduals
1)Error represents the unobservable difference between an actual value $y$ of the dependent variable and its true population mean.Residuals represent the observable difference between an actual value $y$ of the dependent variable and its predicted value according to the regression model.
2)Error is a theoretical concept because the true population mean is usually unknown.One can calculate residuals because we have the data and the fitted model.
3)Errors are assumed to be random and independent, with a mean of zero.Residuals are considered estimates of the errors for each data point.

Residuals are used in various ways to evaluate the regression model, including:

  • Residual plots: The residual plots are used to visualize the residuals versus the independent variable or predicted values.
  • Mean Squared Error (MSE): The MSE statistic measures the average squared difference between the residuals and zero.

In essence, understanding errors and residuals helps the researcher gauge how well the regression model captures the underlying relationship between variables, despite the inherent randomness or “noise” in real-world data.

FAQS about Errors and Residuals

  1. What is an Error?
  2. What are residuals in regression?
  3. What is the purpose of residual plots?
  4. What is a mean squared error (MSE)?
  5. Differentiate between error and residual.
  6. Discuss the sum of residuals and the sum of errors.
Statistics Help: https://itfeature.com

Learn about Simple Linear Regression Models

Statistical Models in R Language

Inverse Regression Analysis or Calibration (2012)

In most regression problems we have to determine the value of $Y$  corresponding to a given value of $X$. The inverse of this problem is also called inverse regression analysis or calibration.

Inverse Regression Analysis

For inverse regression analysis, let the known values represented by matrix $X$ and their corresponding values by vector $Y$, which both form a simple linear regression model. Let, there is an unknown value of $X$, such as $X_0$, which cannot be measured and we observe the corresponding value of $Y$, say $Y_0$. Then, $X_0$ can be estimated and a confidence interval for $X_0$ can be obtained.

In regression analysis, we want to investigate the relationship between variables. Regression has many applications, which occur in many fields: engineering, economics, the physical and chemical sciences, management, biological sciences, and social sciences. We only consider the simple linear regression model, which is a model with one regressor $X$ that has a linear relationship with a response $Y$. It is not always easy to measure the regressor $X$ or the response $Y$.

Let us consider a typical example of this problem. If $X$ is the concentration of glucose in certain substances, then a spectrophotometric method is used to measure the absorbance. This absorbance depends on the concentration of $X$. The response $Y$ is easy to measure with the spectrophotometric method, but the concentration, on the other hand, is not easy to measure. If we have $n$ known concentrations, then the absorbance can be measured.

If there is a linear relation between $Y$ and $X$, then a simple linear regression model can be made with these data. Suppose we have an unknown concentration, that is difficult to measure, but we can measure the absorbance of this concentration. Is it possible to estimate this concentration with the measured absorbance? This is called the calibration problem or inverse regression Analysis.

Suppose, we have a linear model $Y=\beta_0+\beta_1X+e$ and we have an observed value of the response $Y$, but we do not have the corresponding value of $X$. How can we estimate this value of $X$? The two most important methods to estimate $X$ are the classical method and the inverse method.

The classical method of inverse regression analysis is based on the simple linear regression model

$Y=\hat{\beta}_0+\hat{\beta}_1X+\varepsilon,$   where $\varepsilon \tilde N(0, \, \sigma^2)$

where the parameters $\hat{beta}_0$ and $\hat{beta}_1$ are estimated by Least Squares as $\beta_0$ and $\beta_1$. At least two of the $n$ values of $X$ have to be distinct, otherwise, we cannot fit a reliable regression line. For a given value of $X$, say $X_0$ (unknown), a $Y$ value, say $Y_0$ (or a random sample of $k$ values of $Y$) is observed at the $X_0$ value. For inverse regression analysis, the problem is to estimate $X_0$. The classical method uses a $Y_0$ value (or the mean of $k$ values of $Y_0$) to estimate $X_0$, which is then estimated by $\hat{x_0}=\frac{\hat{Y_0}-\hat{\beta_0}} {\hat{\beta_1}}$.

scatter with regression line: Inverse Regression Analysis

The inverse estimator is the simple linear regression of $X$ on $Y$. In this case, we have to fit the model

\[X=a_0+a_1Y+e, \text{where }\, N(0, \sigma^2)\]

to obtain the estimator. Then the inverse estimator of $X_0$

\[X_0=a_0+a_1Y+e\]

Important Considerations when performing Inverse Regression

  • Inverse regression can be statistically challenging, especially when the errors are mainly in the independent variables (which become the dependent variables in the inverse model).
  • It is not a perfect replacement for traditional regression, and the assumptions underlying the analysis may differ.
  • In some cases, reverse regression, which treats both variables as having errors, might be a more suitable approach.

In summary, inverse regression is a statistical technique that flips the roles of the independent and dependent variables in a regression model.

Learn R Language Programming