# Basic Statistics and Data Analysis

## Simple Linear Regression Model (SLRM)

A simple linear regression model (SLRM) is based on a single independent (explanatory) variable and it fits a straight line such that the sum of squared residuals of the regression model (or vertical distances between the fitted line and points of the data set) as small as possible. This model can (usually known as statistical or probabilistic model) be written as

\begin{align*}
y_i &= \alpha + \beta x_i +\varepsilon_i\\
\text{OR} \quad y_i&=b_0 + b_1 x_i + \varepsilon_i\\
\text{OR} \quad y_i&=\beta_0 + \beta x_i + \varepsilon_i
\end{align*}
where y is dependent variable, x is independent variable. In regression context, y is called regressand and x is called the regressors. The epsilon ($\varepsilon$) is unobservable, denoting random error or the disturbance term of regression model. $\varepsilon$ (random error) has some specific importance for its inclusion in the regression model:

1. Random error ($\varepsilon$) captures the effect on the dependent variable of all variables which are not included in the model under study, because the variable not included in the model may or may not be observable.
2. Random error ($\varepsilon$) captures any specification error related to assumed linear-functional form.
3. Random error ($\varepsilon$) captures the effect of unpredictable random component present in the dependent variable.

We can say that $\varepsilon$ is the variation in variable y not explained (unexplained) by the independent variable x included in the model.

In above equation or model $\hat{\beta_0}, \hat{\beta_1}$ are the parameters of the model and our main objective is to obtain the estimates of their numerical values i.e. $\hat{\beta_0}$ and $\hat{\beta_1}$, where $\beta_0$ is the intercept (regression constant), it passes through the ($\overline{x}, \overline{y}$) i.e. center of mass of the data points and $\beta_1$ is the slope or regression coefficient of the model and slope is the correlation between variable x and y corrected by the ratio of standard deviations of these variables. The subscript i denotes the ith value of the variable in the model.
$y=\beta_0 + \beta_1 x_1$
This model is called mathematical model as all the variation in y is due solely to change in x and there are no other factors affecting the dependent variable. It this is true then all the pairs (x, y) will fall on a straight line if plotted on two dimensional plane. However for observed values the plot may or may not be a straight line. Two dimensional diagram with points plotted in pair form is called scatter diagram.

# Inverse Regression Analysis

In most regression problems we have to determine the value of Y corresponding to a given value of X. We will consider the inverse problem, which is called inverse regression or calibration.

Assume we have known values of X and their corresponding Y values, which both form a simple linear regression model and we have also an unknown value of X, such as X0, which cannot be measured and we can observe the corresponding value of Y, say Y0. Then, X0 can be estimated and a confidence interval for X0 can be obtained.

In regression analysis we want to investigate the relationship between variables. Regression has many applications, which occur in many fields: engineering, economics, the physical and chemical sciences, management, biological sciences and social sciences. We only consider the simple linear regression model, which is a model with one regressor X that has a linear relationship with a response Y. It is not always easy to measure the regressor X or the response Y.

We now consider a typical example for this problem. If X is the concentration of glucose in certain substances, then a spectrophotometric method is used to measure the absorbance. This absorbance depends on the concentration X. The response Y is easy to measure with the spectrophotometric method, but the concentration on the other hand is not easy to measure. If we have n known concentrations, then the absorbance can be measured. If there is a linear relation between Y and Y, then a simple linear regression model can be made with these data. Suppose we have an unknown concentration, which is difficult to measure, but we can measure the absorbance of this concentration. Is it possible to estimate this concentration with the measured absorbance? This is called the calibration problem.

Suppose we have a linear model Y = β0 + β1 X + ε and we have an observed value of the response Y, but we do not have the corresponding value of X. How can we estimate this value of X? The two most important methods to estimate X are the classical method and the inverse method.

The classical method is based on the simple linear regression model

Y = b0 + b1 X + ε    where  ε~N(0,σ2)

where the parameters b0 and b1 are estimated by Least Squares as β0 and β1 . At least two of the n values of X have to be distinct, otherwise we cannot ﬁt a reliable regression line. For a given value of X, say X0 (unknown), a Y value, say Y0(or random sample of k values of Y) is observed at the X0value. The problem is to estimate X0. The classical method uses a Y0value (or the mean of k values of Y0) to estimate X0, which is then estimated by $\hat{x_0}=\frac{\hat{Y_0}-\hat{\beta_0}} {\hat{\beta_1}}$.

The inverse estimator is the simple linear regression of X on Y. In this case, we have to ﬁt the model

X=α0+ α1Y+e           where  ε~N(0,σ2)

to obtain the estimator. Then the inverse estimator of X0 is

X0=α0+ α1Y+e

## How is the regression coefficient interpreted in multiple regression?

In this case the unstandardized multiple regression coefficient is interpreted as the predicted change in Y (i.e., the DV) given a one unit change in X (i.e., the IV) while controlling for the other independent variables included in the equation.

• The regression coefficient in multiple regression is called the partial regression coefficient because the effects of the other independent variables have been statistically removed or taken out (“partialled out”) of the relationship.
• If the standardized partial regression coefficient is being used, the coefficients can be compared for an indicator of the relative importance of the independent variables (i.e., the coefficient with the largest absolute value is the most important variable, the second is the second most important, and so on.)