Frequently, we measure two or more variables on each individual and try to express the nature of the relationship between these variables (for example in simple linear regression model and correlation analysis). Using the regression technique, we estimate the relationship of one variable with another by expressing the one in terms of a linear (or more complex) function of another. We also predict the values of one variable in terms of the other. The variables involved in regression and correlation analysis are continuous. In this post we will learn about Simple Linear Regression Model.
We are interested in establishing significant functional relationships between two (or more) variables. For example, the function $Y=f(X)=a+bx$ (read as $Y$ is function of $X$) establishes a relationship to predict the values of variable $Y$ for the given values of variable $X$. In statistics (biostatistics), the function is called a simple linear regression model or simply the regression equation.
The variable $Y$ is called the dependent (response) variable, and $X$ is called the independent (regressor or explanatory) variable.
In biology, many relationships can be appropriate over only a limited range of values of $X$. Negative values are meaningless in many cases, such as age, height, weight, and body temperature.
The method of linear regression is used to estimate the best-fitting straight line to describe the relationship between variables. The linear regression gives the equation of the straight line that best describes how the outcome of $Y$ increases/decreases with an increase/decrease in the explanatory variable $X$. The equation of the regression line is
$$Y=\beta_0 + \beta_1 X,$$
where $\beta_0$ is the intercept (value of $Y$ when $X=0$) and $\beta_1$ is the slope of the line. Both $\beta_0$ and $\beta_1$ are the parameters (or regression coefficients) of the linear equation.
Estimation of Regression Coefficients in Simple Linear Regression Model
The best-fitting line is derived using the method of the \textit{Least Squares} by finding the values of the parameters $\beta_0$ and $\beta_1$ that minimize the sum of the squared vertical distances of the points from the regression line,
The dotted-line (best-fit) line passes through the point ($\overline{X}, \overline{Y}$).
The regression line $Y=\beta_0+\beta_1X$ is fit by the least-squares methods. The regression coefficients $\beta_0$ and $\beta_1$ both are calculated to minimize the sum of squares of the vertical deviations of the points about the regression line. Each deviation equals the difference between the observed value of $Y$ and the estimated value of $Y$ (the corresponding point on the regression.
The following table shows the \textit{body weight} and \textit{plasma volume} of eight healthy men.
Subject | Body Weight (KG) | Plasma Volume (liters) |
---|---|---|
1 | 58.0 | 2.75 |
2 | 70.0 | 2.86 |
3 | 74.0 | 3.37 |
4 | 63.5 | 2.76 |
5 | 62.0 | 2.62 |
6 | 70.5 | 3.49 |
7 | 71.0 | 3.05 |
8 | 66.0 | 3.12 |
The parameters $\beta_0$ and $\beta_1$ are estimated using the following formula (for simple linear regression model):
\begin{align}
\beta_1 &= \frac{n\sum\limits_{i=1}^{n} x_iy_i -\sum\limits_{i=1}^{n} x_i \sum\limits_{i=1}^{n} y_i} {n \sum\limits_{i=1}^{n} x_i^2 – \left(\sum\limits_{i=1}^{n} x_i \right)^2}\\
\beta_0 &= \overline{Y} – \beta_1 \overline{X}
\end{align}
Regression coefficients are sometimes known as “beta-coefficients”. When slope ($\beta_1=0$) then there is no relationship between $X$ and $Y$ variable. For the data above, the best-fitting straight line describing the relationship between plasma volume with body weight is
$$Plasma\, Volume = 0.0857 +0.0436\times Weight$$
Note that the calculated values for $\beta_0$ and $\beta_1$ are estimates of the population values, therefore, subject to sampling variations.