Akaike Information Criteria: A Comprehensive Guide

The Akaike Information Criteria/Criterion (AIC) is a method used in statistics and machine learning to compare the relative quality of different models for a given dataset. The AIC method helps in selecting the best model out of a bunch by penalizing models that are overly complex. Akaike Information Criterion provides a means for comparing among models i.e. a tool for model selection.

  • A too-simple model leads to a large approximation error.
  • A too-complex model leads to a large estimation error.

AIC is a measure of goodness of fit of a statistical model developed by Hirotsugo Akaike under the name of “an information Criteria (AIC) and published by him in 1974 first time. It is grounded in the concept of information entropy in between bias and variance in model construction or between accuracy and complexity of the model.

The Formula of Akaike Information Criteria

Given a data set, several candidate models can be ranked according to their AIC values. From AIC values one may infer that the top two models are roughly in a tie and the rest far worse.

$$AIC = 2k-ln(L)$$

where $k$ is the number of parameters in the model, and $L$ is the maximized value of the likelihood function for the estimated model.

Akaike Information Criteria/ Criterion (AIC)

For a set of candidate models for the data, the preferred model is the one that has a minimum AIC value. AIC estimates relative support for a model, which means that AIC scores by themselves are not very meaningful

Akaike Information Criteria focuses on:

  • Balances fit and complexity: A model that perfectly fits the data might not be the best because it might be memorizing the data instead of capturing the underlying trend. AIC considers both how well a model fits the data (goodness of fit) and how complex it is (number of variables).
  • A lower score is better: Models having lower AIC scores are preferred as they achieve a good balance between fitting the data and avoiding overfitting.
  • Comparison tool: AIC scores are most meaningful when comparing models for the same dataset. The model with the lowest AIC score is considered the best relative to the other models being evaluated.

Summary

The AIC score is a single number and is used as model selection criteria. One cannot interpret the AIC score in isolation. However, one can compare the AIC scores of different model fits to the same data. The model with the lowest AIC is generally considered the best choice.

The AIC is the most useful model selection criterion when there are multiple candidate models to choose from. It works well for larger datasets. However, for smaller datasets, the corrected AIC should be preferred. AIC is not perfect, and there can be situations where it fails to choose the optimal model.

There are many other model selection criteria. For more detail read the article: Model Selection Criteria

Akaike Information Criteria

https://rfaqs.com

https://gmstat.com

https://itfeature.com

Multiple Regression Analysis

Introduction to Multiple Regression Analysis

Francis Galton (a biometrician) examines the relationship between fathers’ and sons’ height. He analyzed the similarities between the parent and child generation of 700 sweet peas. Galton found that the offspring of tall parents tended to be shorter and offspring of shorter parents tended to be taller. The height of the children depends ($Y$) upon the height of the parents ($X$). In case, there is more than one independent variable (IV), we need multiple regression analysis (MRA), also called multiple linear regression (MLR).

Multiple Linear Regression Model

The linear regression model (equation) for two independent variables (regressors) is

$$Y_{ij} = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \varepsilon_{ij}$$

The general linear regression model (equation) for $k$ independent variables is

$$Y_{ij} = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3X_{3i} + \cdots + \varepsilon_{ij}$$

The $\beta$s are all regression coefficients (partial slopes) and the $\alpha$ is the intercept.

The sample linear regression model is

$$\hat{y} = \hat{\alpha} + \hat{\beta}_1 x_{1i} + \hat{\beta}_2x_{2i} + \hat{\varepsilon}_{ij}$$

Multiple Regression Coefficients Formula

To fit the MLR equation for two variables, one needs to compute the values of $\hat{\beta}_1, \hat{\beta}_2$, and $\alpha$.

Multiple Regression Analysis Partial Coefficient 1

The yellow part of the above formula is the (“sum of the product of 1st independent and dependent variables”) multiplied by the (“sum of the square of 2nd independent variable).

The red part of the above formula is the (“Sum of the product of 2nd independent and dependent variables”) multiplied by the (“sum of the product of two independent variables”).

The green part of the above formula is the (“sum of the square of 1st independent variable”) multiplied by the (“sum of the square of 2nd independent variable”).

The blue part of the above formula is the (“square of the sum of the product of two independent variables”).

The formula for 2nd regression coefficient is

Multiple Regression Analysis Partial Coefficient 1

In short, note that the $S$ stands for the sum of squares and the sum of products.

Multiple Linear Regression Example

Consider the following data about two regressors ($X_1, X_2$) and one regressand variable ($Y$).

$Y$$X_1$$X_2$$X_1 y$$X_2 y$$X_1 X_2$$X_1^2$$X_2^2$
301015300450150100225
2258110176402564
161012160192120100144
737214921949
1421028140204100
8930526191007351238582

\begin{align*}
S_{x_1Y} &= \sum X_1 y – \frac{\sum X_1 \sum Y}{n} = 619 – \frac{30\times 59}{5} = 265\\
S_{x_1x_2} &= \sum X_1 X_2 – \frac{\sum X_1 \sum X_2}{n} = 351 – \frac{30 \times 52}{5} = 39\\
S_{X_1^2} &= \sum X_1^2 – \frac{(\sum X_1)^2}{n} = 238 -\frac{30^2}{5} = 58\\
S_{X_2^2} &= \sum X_2^2 – \frac{(\sum X_2)^2}{n} = 582 – \frac{52^2}{5} = 41.2\\
S_{X_2 y} &= \sum X_2 Y – \frac{\sum X_2 \sum Y}{n} =1007 – \frac{52 \times 89}{5} = 81.4
\end{align*}

\begin{align*}
\hat{\beta}_1 &= \frac{(S_{X_1 Y})(S_{X_2^2}) – (S_{X_2Y})(S_{X_1 X_2}) }{(S_{X_1^2})(S_{X_2^2}) – (S_{X_1X_2})^2} = \frac{(265)(41.2) – (81.4)(39)}{(58)(41.2) – (39)^2} = 8.91\\
\hat{\beta}_2 &= \frac{(S_{X_2 Y})(S_{X_1^2}) – (S_{X_1Y})(S_{X_1 X_2}) }{(S_{X_1^2})(S_{X_2^2}) – (S_{X_1X_2})^2} = \frac{(81.4)(58) – (265)(39)}{(58)(41.2) – (39)^2} = -6.46\\
\hat{\alpha} &= \overline{Y} – \hat{\beta}_1 \overline{X}_1 – \hat{\beta}_2 \overline{X}_2\\
&=31.524 + 8.91X_1 – 6.46X_2
\end{align*}

Important Key Points of Multiple Regression

  • Independent variables (predictors, regressors): These are the variables that one believes to influence the dependent variable. One can have two or more independent variables in a multiple-regression model.
  • Dependent variable (outcome, response): This is the variable one is trying to predict or explain using the independent variables.
  • Linear relationship: The core assumption is that the relationship between the independent variables and dependent variable is linear. This means the dependent variable changes at a constant rate for a unit change in the independent variable, holding all other variables constant.

The main goal of multiple regression analysis is to find a linear equation that best fits the data. The multiple regression analysis also allows one to:

  • Predict the value of the dependent variable based on the values of the independent variables.
  • Understand how changes in the independent variables affect the dependent variable while considering the influence of other independent variables.

Interpreting the Multiple Regression Coefficient

https://rfaqs.com

https://gmstat.com

Simple Linear Regression Model

A simple Linear Regression model is one of the most fundamental techniques in machine learning and statistics. Whether you are a data science newbie or just brushing up on the basics, understanding linear regression is essential.

Introduction

Frequently, we measure two or more variables on each individual and try to express the nature of the relationship between these variables (for example, in the simple linear regression model and correlation analysis). Using the regression technique, we estimate the relationship of one variable with another by expressing the one in terms of a linear (or more complex) function of another. We also predict the values of one variable in terms of the other. The variables involved in regression and correlation analysis are continuous. In this post, we will learn about the Simple Linear Regression Model.

Functional Relationship Between Variables

We are interested in establishing significant functional relationships between two (or more) variables. For example, the function $Y=f(X)=a+bx$ (read as $Y$ is a function of $X$) establishes a relationship to predict the values of variable $Y$ for the given values of variable $X$. In statistics (biostatistics), the function is a simple linear regression model or the regression equation.

The variable $Y$ is called the dependent (response) variable, and $X$ is called the independent (regressor or explanatory) variable.

In biology, many relationships can be appropriate over only a limited range of values of $X$. Negative values are meaningless in many cases, such as age, height, weight, and body temperature.

The method of linear regression is used to estimate the best-fitting straight line to describe the relationship between variables. The linear regression gives the equation of the straight line that best describes how the outcome of $Y$ increases/decreases with an increase/decrease in the explanatory variable $X$. The equation of the regression line is
$$Y=\beta_0 + \beta_1 X,$$
where $\beta_0$ is the intercept (value of $Y$ when $X=0$) and $\beta_1$ is the slope of the line. Both $\beta_0$ and $\beta_1$ are the parameters (or regression coefficients) of the linear equation.

Estimation of Regression Coefficients in Simple Linear Regression Model

The best-fitting line is derived using the method of the \textit{Least Squares} by finding the values of the parameters $\beta_0$ and $\beta_1$ that minimize the sum of the squared vertical distances of the points from the regression line,

The dotted-line (best-fit) line passes through the point ($\overline{X}, \overline{Y}$).

The regression line $Y=\beta_0+\beta_1X$ is fit by the least-squares methods. The regression coefficients $\beta_0$ and $\beta_1$ are both calculated to minimize the sum of squares of the vertical deviations of the points about the regression line. Each deviation equals the difference between the observed value of $Y$ and the estimated value of $Y$ (the corresponding point on the regression.

The following table shows the \textit{body weight} and \textit{plasma volume} of eight healthy men.

SubjectBody Weight (KG)Plasma Volume (liters)
158.02.75
270.02.86
374.03.37
463.52.76
562.02.62
670.53.49
771.03.05
866.03.12
Simple Linear Regression Models: Scatter plot with regression line

Estimation of Paramters

The parameters $\beta_0$ and $\beta_1$ are estimated using the following formula (for simple linear regression model):

\begin{align}
\beta_1 &= \frac{n\sum\limits_{i=1}^{n} x_iy_i -\sum\limits_{i=1}^{n} x_i \sum\limits_{i=1}^{n} y_i} {n \sum\limits_{i=1}^{n} x_i^2 – \left(\sum\limits_{i=1}^{n} x_i \right)^2}\\
\beta_0 &= \overline{Y} – \beta_1 \overline{X}
\end{align}

Regression coefficients are sometimes known as “beta-coefficients”. When the slope ($\beta_1=0$, then there is no relationship between $X$ and $Y$ variables. For the data above, the best-fitting straight line describing the relationship between plasma volume with body weight is
$$Plasma\, Volume = 0.0857 +0.0436\times Weight$$
Note that the calculated values for $\beta_0$ and $\beta_1$ are estimates of the population values and, therefore, subject to sampling variations.

Real-Life Examples: Simple Linear Regression Models

  • Real Estate: Predicting House Prices (Estimate home prices based on size to guide buyers and sellers.)
    Independent Variable ($X$): Size of the house (sq ft)
    Dependent Variable ($Y$): Price of the house
  • Education: Predicting Student Scores (Teachers or students can predict likely outcomes based on study habits.)
    $X$: Hours studied
    $Y$: Exam scores
  • Healthcare: Predicting Blood Pressure (Understand how blood pressure tends to rise with age, aiding diagnosis.)
    $X$: Age of patient
    $Y$: Systolic blood pressure
  • Energy: Predicting Electricity Usage (Power companies use this to forecast demand and manage resources)
    $X$: Temperature (°C or °F)
    $Y$: Electricity consumption (kWh)
  • Manufacturing: Predicting Machine Failures
    $X$: Hours a machine has been in use (Predict maintenance schedules and avoid production delays.)
    $Y$: Number of breakdowns or wear percentage
  • Business: Predicting Sales Based on Advertising Spend (Helps businesses decide how much to invest in advertising.)
    $X$: Advertising expenditure (in $\$$)
    $Y$: Product sales (in units)
  • Agriculture: Predicting Crop Yield (Estimate yield based on expected rainfall to plan for food production.)
    $X$: Amount of rainfall (mm)
    $Y$: Crop yield (kg per acre)
  • Finance: Predicting Stock Prices (Although basic, it helps in forecasting trends over time (note: simple linear regression has limits in volatile markets))
    $X$: Time (days or months)
    $Y$: Stock closing price
  • Transportation: Estimating Fuel Consumption (Predict fuel needs and optimize transportation costs.)
    $X$: Distance traveled (km)
    $Y$: Fuel used (liters)
  • E-commerce: Predicting Customer Spending (Analyze user behavior and optimize website experience for better conversion.)
    $X$: Time spent on the website
    $Y$: Amount spent on a purchase
Simple linear regression model equation

https://gmstat.com, https://rfaqs.com