# Basic Statistics and Data Analysis

## Assumptions about Linear Regression Models or Error Term

The linear regression model (LRM) is based on certain statistical assumption, some of which are related to the distribution of random variable (error term) $\mu_i$, some are about the relationship between error term $\mu_i$ and the explanatory variables (Independent variables, X’s) and some are related to the independent variable themselves. We can divide the assumptions into two categories

1. Stochastic Assumption
2. None Stochastic Assumptions

These assumptions about linear regression models (or ordinary least square method: OLS) are extremely critical to the interpretation of the regression coefficients.

• The error term ($\mu_i$) is a random real number i.e. $\mu_i$ may assume any positive, negative or zero value upon chance. Each value has a certain probability, therefore error term is a random variable.
• The mean value of $\mu$ is zero, i.e $E(\mu_i)=0$ i.e. the mean value of $\mu_i$ is conditional upon the given $X_i$ is zero. It means that for each value of variable $X_i$, $\mu$ may take various values, some of them greater than zero and some smaller than zero. Considering the all possible values of $\mu$ for any particular value of $X$, we have zero mean value of disturbance term $\mu_i$.
• The variance of $\mu_i$ is constant i.e. for the given value of X, the variance of $\mu_i$ is the same for all observations. $E(\mu_i^2)=\sigma^2$. The variance of disturbance term ($\mu_i$) about its mean is at all values of X will show the same dispersion about their mean.
• The variable $\mu_i$ has a normal distribution i.e. $\mu_i\sim N(0,\sigma_{\mu}^2$. The value of $\mu$ (for each $X_i$) have a bell shaped symmetrical distribution.
• The random term of different observation ($\mu_i,\mu_j$) are independent i..e $E(\mu_i,\mu_j)=0$, i.e. there is no autocorrelation between the disturbances. It means that random term assumed in one period does not depend of the values in any other period.
• $\mu_i$ and $X_i$ have zero covariance between them i.e. $\mu$ is independent of the explanatory variable or $E(\mu_i X_i)=0$ i.e. $Cov(\mu_i, X_i)=0$. The disturbance term $\mu$ and explanatory variable X are uncorrelated. The $\mu$’s and $X$’s do not tend to vary together as their covariance is zero. This assumption is automatically fulfilled if X variable is nonrandom or non-stochastic or if mean of random term is zero.
• All the explanatory variables are measured without error. It means that we will assume that the regressors are error free while y (dependent variable) may or may not include error of measurements.
• The number of observations n must be greater than the number of parameters to be estimated or alternatively the number of observation must be greater than the number of explanatory (independent) variables.
• The should be variability in the X values. That is X values in a given sample must not be same. Statistically, $Var(X)$ must be a finite positive number.
• The regression model must be correctly specified, meaning that there is no specification bias or error in the model used in empirical analysis.
• There is no perfect or near to perfect multicollinearity or collinearity among the two or more explanatory (independent) variables.
• Values taken by the regressors X are considered to be fixed in repeating sampling i.e. X is assumed to non-stochastic. Regression analysis is conditional on the given values of the regressor(s) X.
• Linear regression model is linear in the parameters, e.g. $y_i=\beta_1+\beta_2x_i +\mu_i$

## Simple Linear Regression Model (SLRM)

A simple linear regression model (SLRM) is based on a single independent (explanatory) variable and it fits a straight line such that the sum of squared residuals of the regression model (or vertical distances between the fitted line and points of the data set) as small as possible. This model can (usually known as statistical or probabilistic model) be written as

\begin{align*}
y_i &= \alpha + \beta x_i +\varepsilon_i\\
\text{OR} \quad y_i&=b_0 + b_1 x_i + \varepsilon_i\\
\text{OR} \quad y_i&=\beta_0 + \beta x_i + \varepsilon_i
\end{align*}
where y is dependent variable, x is independent variable. In regression context, y is called regressand and x is called the regressors. The epsilon ($\varepsilon$) is unobservable, denoting random error or the disturbance term of regression model. $\varepsilon$ (random error) has some specific importance for its inclusion in the regression model:

1. Random error ($\varepsilon$) captures the effect on the dependent variable of all variables which are not included in the model under study, because the variable not included in the model may or may not be observable.
2. Random error ($\varepsilon$) captures any specification error related to assumed linear-functional form.
3. Random error ($\varepsilon$) captures the effect of unpredictable random component present in the dependent variable.

We can say that $\varepsilon$ is the variation in variable y not explained (unexplained) by the independent variable x included in the model.

In above equation or model $\hat{\beta_0}, \hat{\beta_1}$ are the parameters of the model and our main objective is to obtain the estimates of their numerical values i.e. $\hat{\beta_0}$ and $\hat{\beta_1}$, where $\beta_0$ is the intercept (regression constant), it passes through the ($\overline{x}, \overline{y}$) i.e. center of mass of the data points and $\beta_1$ is the slope or regression coefficient of the model and slope is the correlation between variable x and y corrected by the ratio of standard deviations of these variables. The subscript i denotes the ith value of the variable in the model.
$y=\beta_0 + \beta_1 x_1$
This model is called mathematical model as all the variation in y is due solely to change in x and there are no other factors affecting the dependent variable. It this is true then all the pairs (x, y) will fall on a straight line if plotted on two dimensional plane. However for observed values the plot may or may not be a straight line. Two dimensional diagram with points plotted in pair form is called scatter diagram.

# Inverse Regression Analysis

In most regression problems we have to determine the value of Y corresponding to a given value of X. We will consider the inverse problem, which is called inverse regression or calibration.

Assume we have known values of X and their corresponding Y values, which both form a simple linear regression model and we have also an unknown value of X, such as X0, which cannot be measured and we can observe the corresponding value of Y, say Y0. Then, X0 can be estimated and a confidence interval for X0 can be obtained.

In regression analysis we want to investigate the relationship between variables. Regression has many applications, which occur in many fields: engineering, economics, the physical and chemical sciences, management, biological sciences and social sciences. We only consider the simple linear regression model, which is a model with one regressor X that has a linear relationship with a response Y. It is not always easy to measure the regressor X or the response Y.

We now consider a typical example for this problem. If X is the concentration of glucose in certain substances, then a spectrophotometric method is used to measure the absorbance. This absorbance depends on the concentration X. The response Y is easy to measure with the spectrophotometric method, but the concentration on the other hand is not easy to measure. If we have n known concentrations, then the absorbance can be measured. If there is a linear relation between Y and Y, then a simple linear regression model can be made with these data. Suppose we have an unknown concentration, which is difficult to measure, but we can measure the absorbance of this concentration. Is it possible to estimate this concentration with the measured absorbance? This is called the calibration problem.

Suppose we have a linear model Y = β0 + β1 X + ε and we have an observed value of the response Y, but we do not have the corresponding value of X. How can we estimate this value of X? The two most important methods to estimate X are the classical method and the inverse method.

The classical method is based on the simple linear regression model

Y = b0 + b1 X + ε    where  ε~N(0,σ2)

where the parameters b0 and b1 are estimated by Least Squares as β0 and β1 . At least two of the n values of X have to be distinct, otherwise we cannot ﬁt a reliable regression line. For a given value of X, say X0 (unknown), a Y value, say Y0(or random sample of k values of Y) is observed at the X0value. The problem is to estimate X0. The classical method uses a Y0value (or the mean of k values of Y0) to estimate X0, which is then estimated by $\hat{x_0}=\frac{\hat{Y_0}-\hat{\beta_0}} {\hat{\beta_1}}$.

The inverse estimator is the simple linear regression of X on Y. In this case, we have to ﬁt the model

X=α0+ α1Y+e           where  ε~N(0,σ2)

to obtain the estimator. Then the inverse estimator of X0 is

X0=α0+ α1Y+e

## How is the regression coefficient interpreted in simple regression?

The basic or unstandardized regression coefficient is interpreted as the predicted change in Y (i.e., the DV) given a one unit change in X (i.e., the IV). It is in the same units as the dependent variable.

• Note that there is another form of the regression coefficient that is important: the standardized regression coefficient. The standardized coefficient varies from –1.00 to +1.00 just like a simple correlation coefficient;
• If the regression coefficient is in standardized units, then in simple regression the regression coefficient is the same thing as the correlation coefficient.