# Basic Statistics and Data Analysis

## Application of Regression Analysis in medical: an example

Considering the application of regression analysis in medical sciences, Chan et al. (2006) used multiple linear regression to estimate standard liver weight for assessing adequacies of graft size in live donor liver transplantation and remnant liver in major hepatectomy for cancer. Standard liver weight (SLW) in grams, body weight (BW) in kilograms, gender (male=1, female=0), and other anthropometric data of 159 Chinese liver donors who underwent donor right hepatectomy were analyzed. The formula (fitted model)

$SLW = 218 + 12.3 \times BW + 51 \times gender$

was developed with coefficient of determination $R^2=0.48$.

These results mean that in Chinese people, on average, for each 1-kg increase of BW, SLW increases about 12.3 g, and, on average, men have a 51-g higher SLW than women. Unfortunately, SEs and CIs for the estimated regression coefficients were not reported. By means of formula 6 in there article, the SLW for Chinese liver donors can be estimated if BW and gender are known. About 50% of the variance of SLW is explained by BW and gender.

#### Reference of Article

• Chan SC, Liu CL, Lo CM, et al. (2006). Estimating liver weight of adults by body weight and gender. World J Gastroenterol 12, 2217–2222.

## Assumptions about Linear Regression Models or Error Term

The linear regression model (LRM) is based on certain statistical assumption, some of which are related to the distribution of random variable (error term) $\mu_i$, some are about the relationship between error term $\mu_i$ and the explanatory variables (Independent variables, X’s) and some are related to the independent variable themselves. We can divide the assumptions into two categories

1. Stochastic Assumption
2. None Stochastic Assumptions

These assumptions about linear regression models (or ordinary least square method: OLS) are extremely critical to the interpretation of the regression coefficients.

• The error term ($\mu_i$) is a random real number i.e. $\mu_i$ may assume any positive, negative or zero value upon chance. Each value has a certain probability, therefore error term is a random variable.
• The mean value of $\mu$ is zero, i.e $E(\mu_i)=0$ i.e. the mean value of $\mu_i$ is conditional upon the given $X_i$ is zero. It means that for each value of variable $X_i$, $\mu$ may take various values, some of them greater than zero and some smaller than zero. Considering the all possible values of $\mu$ for any particular value of $X$, we have zero mean value of disturbance term $\mu_i$.
• The variance of $\mu_i$ is constant i.e. for the given value of X, the variance of $\mu_i$ is the same for all observations. $E(\mu_i^2)=\sigma^2$. The variance of disturbance term ($\mu_i$) about its mean is at all values of X will show the same dispersion about their mean.
• The variable $\mu_i$ has a normal distribution i.e. $\mu_i\sim N(0,\sigma_{\mu}^2$. The value of $\mu$ (for each $X_i$) have a bell shaped symmetrical distribution.
• The random term of different observation ($\mu_i,\mu_j$) are independent i..e $E(\mu_i,\mu_j)=0$, i.e. there is no autocorrelation between the disturbances. It means that random term assumed in one period does not depend of the values in any other period.
• $\mu_i$ and $X_i$ have zero covariance between them i.e. $\mu$ is independent of the explanatory variable or $E(\mu_i X_i)=0$ i.e. $Cov(\mu_i, X_i)=0$. The disturbance term $\mu$ and explanatory variable X are uncorrelated. The $\mu$’s and $X$’s do not tend to vary together as their covariance is zero. This assumption is automatically fulfilled if X variable is nonrandom or non-stochastic or if mean of random term is zero.
• All the explanatory variables are measured without error. It means that we will assume that the regressors are error free while y (dependent variable) may or may not include error of measurements.
• The number of observations n must be greater than the number of parameters to be estimated or alternatively the number of observation must be greater than the number of explanatory (independent) variables.
• The should be variability in the X values. That is X values in a given sample must not be same. Statistically, $Var(X)$ must be a finite positive number.
• The regression model must be correctly specified, meaning that there is no specification bias or error in the model used in empirical analysis.
• There is no perfect or near to perfect multicollinearity or collinearity among the two or more explanatory (independent) variables.
• Values taken by the regressors X are considered to be fixed in repeating sampling i.e. X is assumed to non-stochastic. Regression analysis is conditional on the given values of the regressor(s) X.
• Linear regression model is linear in the parameters, e.g. $y_i=\beta_1+\beta_2x_i +\mu_i$

## Simple Linear Regression Model (SLRM)

A simple linear regression model (SLRM) is based on a single independent (explanatory) variable and it fits a straight line such that the sum of squared residuals of the regression model (or vertical distances between the fitted line and points of the data set) as small as possible. This model can (usually known as statistical or probabilistic model) be written as

\begin{align*}
y_i &= \alpha + \beta x_i +\varepsilon_i\\
\text{OR} \quad y_i&=b_0 + b_1 x_i + \varepsilon_i\\
\text{OR} \quad y_i&=\beta_0 + \beta x_i + \varepsilon_i
\end{align*}
where y is dependent variable, x is independent variable. In regression context, y is called regressand and x is called the regressors. The epsilon ($\varepsilon$) is unobservable, denoting random error or the disturbance term of regression model. $\varepsilon$ (random error) has some specific importance for its inclusion in the regression model:

1. Random error ($\varepsilon$) captures the effect on the dependent variable of all variables which are not included in the model under study, because the variable not included in the model may or may not be observable.
2. Random error ($\varepsilon$) captures any specification error related to assumed linear-functional form.
3. Random error ($\varepsilon$) captures the effect of unpredictable random component present in the dependent variable.

We can say that $\varepsilon$ is the variation in variable y not explained (unexplained) by the independent variable x included in the model.

In above equation or model $\hat{\beta_0}, \hat{\beta_1}$ are the parameters of the model and our main objective is to obtain the estimates of their numerical values i.e. $\hat{\beta_0}$ and $\hat{\beta_1}$, where $\beta_0$ is the intercept (regression constant), it passes through the ($\overline{x}, \overline{y}$) i.e. center of mass of the data points and $\beta_1$ is the slope or regression coefficient of the model and slope is the correlation between variable x and y corrected by the ratio of standard deviations of these variables. The subscript i denotes the ith value of the variable in the model.
$y=\beta_0 + \beta_1 x_1$
This model is called mathematical model as all the variation in y is due solely to change in x and there are no other factors affecting the dependent variable. It this is true then all the pairs (x, y) will fall on a straight line if plotted on two dimensional plane. However for observed values the plot may or may not be a straight line. Two dimensional diagram with points plotted in pair form is called scatter diagram.

## Role of Hat matrix in diagnostics of Regression Analysis

Hat matrix is a n ×n symmetric and idempotent matrix with many special properties play an important role in diagnostics of regression analysis by transforming the vector of observed responses Y into the vector of fitted responses $\hat{Y}$.

The model Y=Xβ +ε with solution b=(XX)-1 X’Y provided that (XX)-1 is non-singular. The fitted values are ${\hat{Y}=Xb=X(X’X)^{-1} X’Y=HY}$.

Like fitted values ($\hat{Y}$), the residual can be expressed as linear combinations of the response variable Yi.

\begin{align*}
e&=Y-\hat{Y}\\
&=Y-HY\\&=(I-H)Y
\end{align*}

• Hat matrix only involves the observation in the predictor variable X  as H=X(XX)-1 X’. It plays an important role in diagnostics for regression analysis.
• The hat matrix plays an important role in determining the magnitude of a studentized deleted residual and therefore in identifying outlying Y observations.
• The hat matrix is also helpful in directly identifying outlying X observation.
• In particular the diagonal elements of the hat matrix are indicator of in a multi-variable setting of whether or not a case is outlying with respect to X values.
• The elements of hat matrix have their values between 0 and 1 always and their sum is p i.e.
0≤ hii ≤ 1  and  $\sum _{i=1}^{n}h_{ii} =p$
where p is number of regression parameter with intercept term.
• hii is a measure of the distance between the X values for the ith case and the means of the X values for all n cases.

## Mathematical Properties

• HX=X
• (I-H)X=0
• HH=H2=H=Hp
• H(I-H)=0
• $Cov(\hat{e},\hat{Y})=Cov\left\{HY,(I-H)Y\right\}=\sigma ^{2} H(I-H)=0$
• 1-H is also symmetric and idempotent.
• H1=1 with intercept term. i.e. every row of H adds  upto 1. 1’=1H’=1’H  & 1’H1=n
• The elements of H are denoted by hii  i.e.
$H=\begin{pmatrix}{h_{11} } & {h_{12} } & {\cdots } & {h_{1n} } \\ {h_{21} } & {h_{22} } & {\cdots } & {h_{2n} } \\ {\vdots } & {\vdots } & {\ddots } & {\vdots } \\ {h_{n1} } & {h_{n2} } & {\vdots } & {h_{nn} }\end{pmatrix}$
The large value of hii indicates that the ith case is distant from the center for all n cases. The diagonal element hii in this context is called leverage of the ith case.hii is a function of only the X values, so hii measures the role of the X values in determining how important Yi is affecting the fitted $\hat{Y}_{i}$ values.
The larger the hii, the smaller the variance of the residuals ei for hii =1, σ2(ei)=0.
• Variance, Covariance of e
\begin{align*}
e-E(e)&=(I-H)Y(Y-X\beta )=(I-H)\varepsilon \\
E(\varepsilon \varepsilon ‘)&=V(\varepsilon )=I\sigma ^{2} \,\,\text{and} \,\, E(\varepsilon )=0\\
(I-H)’&=(I-H’)=(I-H)\\
V(e) & =  E\left[e-E(e_{i} )\right]\left[e-E(e_{i} )\right]^{{‘} } \\
& = (I-H)E(\varepsilon \varepsilon ‘)(I-H)’ \\
& = (I-H)I\sigma ^{2} (I-H)’ \\
& =(I-H)(I-H)I\sigma ^{2} =(I-H)\sigma ^{2}
\end{align*}V(ei) is given by the ith diagonal element 1-hii  and Cov(ei ,ej ) is given by the (i, j)th  element of −hij of the matrix (I-H)σ2.\begin{align*}
\rho _{ij} &=\frac{Cov(e_{i} ,e_{j} )}{\sqrt{V(e_{i} )V(e_{j} )} } \\
&=\frac{-h_{ij} }{\sqrt{(1-h_{ii} )(1-h_{jj} )} }\\
SS(b) & = SS({\rm all\; parameter)=}b’X’Y \\
& = \hat{Y}’Y=Y’H’Y=Y’HY=Y’H^{2} Y=\hat{Y}’\hat{Y}
\end{align*}The average $V(\hat{Y}_{i} )$ to all data points is
\begin{align*}
\sum _{i=1}^{n}\frac{V(\hat{Y}_{i} )}{n} &=\frac{trace(H\sigma ^{2} )}{n}=\frac{p\sigma ^{2} }{n} \\
\hat{Y}_{i} &=h_{ii} Y_{i} +\sum _{j\ne 1}h_{ij} Y_{j}
\end{align*}

## Internally Studentized Residuals

$V(e_i)=(1-h_{ii}\sigma^2$ where σ2 is estimated by s2

i.e. $s^{2} =\frac{e’e}{n-p} =\frac{\Sigma e_{i}^{2} }{n-p}$  (RMS)

we can studentized the residual as $s_{i} =\frac{e_{i} }{s\sqrt{(1-h_{ii} )} }$

These studentized residuals are said to be internally studentized because s has within it ei itself.

## Extra Sum of Squares attributable to $e_i$

\begin{align*}
e&=(1-H)Y\\
e_{i} &=-h_{i1} Y_{1} -h_{i2} Y_{2} -\cdots +(1-h_{ii} )Y_{i} -h_{in} Y_{n} =c’Y\\
c’&=(-h_{i1} ,-h_{i2} ,\cdots ,(1-h_{ii} )\cdots -h_{in} )\\
c’c&=\sum _{i=1}^{n}h_{i1}^{2}  +(1-2h_{ii} )=(1-h_{ii} )\\
SS(e_{i})&=\frac{e_{i}^{2} }{(1-h_{ii} )}\\
S_{(i)}^{2}&=\frac{(n-p)s^{2} -\frac{e_{i}^{2}}{e_{i}^{2}  (1-h_{ii} )}}{n-p-1}
\end{align*}
provides an estimate of σ2 after deletion of the contribution of ei.

## Externally Studentized Residuals

$t_{i} =\frac{e_{i} }{s(i)\sqrt{(1-h_{ii} )} }$ are externally studentized residuals. Here if ei is large, it is thrown into emphases even more by the fact that si has excluded it. The ti follows a tn-p-1 distribution under the usual normality of errors assumptions.

Reference

## Correlation Coeficient values lies between +1 and -1?

We know that the ratio of the explained variation to the total variation is called the coefficient of determination. This ratio is non-negative, therefore denoted by $r^2$, thus

\begin{align*}
r^2&=\frac{\text{Explained Variation}}{\text{Total Variation}}\\
&=\frac{\sum (\hat{Y}-\overline{Y})^2}{\sum (Y-\overline{Y})^2}
\end{align*}

It can be seen that if the total variation is all explained, the ratio $r^2$ (Coefficient of Determination) is one and if the total variation is all unexplained then the explained variation and the ratio r2 is zero.

The square root of the coefficient of determination is called the correlation coefficient, given by

\begin{align*}
r&=\sqrt{ \frac{\text{Explained Variation}}{\text{Total Variation}} }\\
&=\pm \sqrt{\frac{\sum (\hat{Y}-\overline{Y})^2}{\sum (Y-\overline{Y})^2}}
\end{align*}

and

$\sum (\hat{Y}-\overline{Y})^2=\sum(Y-\overline{Y})^2-\sum (Y-\hat{Y})^2$

therefore

\begin{align*}
r&=\sqrt{ \frac{\sum(Y-\overline{Y})^2-\sum (Y-\hat{Y})^2} {\sum(Y-\overline{Y})^2} }\\
&=\sqrt{1-\frac{\sum (Y-\hat{Y})^2}{\sum(Y-\overline{Y})^2}}\\
&=\sqrt{1-\frac{\text{Unexplained Variation}}{\text{Total Variation}}}=\sqrt{1-\frac{S_{y.x}^2}{s_y^2}}
\end{align*}

where $s_{y.x}^2=\frac{1}{n} \sum (Y-\hat{Y})^2$ and $s_y^2=\frac{1}{n} \sum (Y-\overline{Y})^2$

\begin{align*}
\Rightarrow r^2&=1-\frac{s_{y.x}^2}{s_y^2}\\
\Rightarrow s_{y.x}^2&=s_y^2(1-r^2)
\end{align*}

Since variances are non-negative

$\frac{s_{y.x}^2}{s_y^2}=1-r^2 \geq 0$

Solving for inequality we have

\begin{align*}
1-r^2 & \geq 0\\
\Rightarrow r^2 \leq 1\, \text{or}\, |r| &\leq 1\\
\Rightarrow & -1 \leq r\leq 1
\end{align*}

## Alternative Proof

Since $\rho(X,Y)=\rho(X^*,Y^*)$ where $X^*=\frac{X-\mu_X}{\sigma_X}$ and $Y^*=\frac{Y-Y^*}{\sigma_Y}$

and as covariance is bi-linear and X* ,Y* have zero mean and variance 1, therefore

\begin{align*}
\rho(X^*,Y^*)&=Cov(X^*,Y^*)=Cov\{\frac{X-\mu_X}{\sigma_X},\frac{Y-\mu_Y}{\sigma_Y}\}\\
&=\frac{Cov(X-\mu_X,Y-\mu_Y)}{\sigma_X\sigma_Y}\\
&=\frac{Cov(X,Y)}{\sigma_X \sigma_Y}=\rho(X,Y)
\end{align*}

We also know that the variance of any random variable is ≥0, it could be zero i.e .(Var(X)=0) if and only if X is a constant (almost surely), therefore

$V(X^* \pm Y^*)=V(X^*)+V(Y^*)\pm2Cov(X^*,Y^*)$

As Var(X*)=1 and Var(Y*)=1, the above equation would be negative if $Cov(X^*,Y^*)$ is either greater than 1 or less than -1. Hence $1\geq \rho(X,Y)=\rho(X^*,Y^*)\geq -1$.

If $\rho(X,Y )=Cov(X^*,Y^*)=1$ then $Var(X^*- Y ^*)=0$ making X* =Y* almost surely. Similarly, if $\rho(X,Y )=Cov(X^*,Y^*)=-1$ then X*=−Y* almost surely. In either case, Y would be a linear function of X almost surely.