Akaike Information Criteria: A Comprehensive Guide

The Akaike Information Criteria/Criterion (AIC) is a method used in statistics and machine learning to compare the relative quality of different models for a given dataset. The AIC method helps in selecting the best model out of a bunch by penalizing models that are overly complex. Akaike Information Criterion provides a means for comparing among models i.e. a tool for model selection.

  • A too-simple model leads to a large approximation error.
  • A too-complex model leads to a large estimation error.

AIC is a measure of goodness of fit of a statistical model developed by Hirotsugo Akaike under the name of “an information Criteria (AIC) and published by him in 1974 first time. It is grounded in the concept of information entropy in between bias and variance in model construction or between accuracy and complexity of the model.

The Formula of Akaike Information Criteria

Given a data set, several candidate models can be ranked according to their AIC values. From AIC values one may infer that the top two models are roughly in a tie and the rest far worse.

$$AIC = 2k-ln(L)$$

where $k$ is the number of parameters in the model, and $L$ is the maximized value of the likelihood function for the estimated model.

Akaike Information Criteria/ Criterion (AIC)

For a set of candidate models for the data, the preferred model is the one that has a minimum AIC value. AIC estimates relative support for a model, which means that AIC scores by themselves are not very meaningful

Akaike Information Criteria focuses on:

  • Balances fit and complexity: A model that perfectly fits the data might not be the best because it might be memorizing the data instead of capturing the underlying trend. AIC considers both how well a model fits the data (goodness of fit) and how complex it is (number of variables).
  • A lower score is better: Models having lower AIC scores are preferred as they achieve a good balance between fitting the data and avoiding overfitting.
  • Comparison tool: AIC scores are most meaningful when comparing models for the same dataset. The model with the lowest AIC score is considered the best relative to the other models being evaluated.

Summary

The AIC score is a single number and is used as model selection criteria. One cannot interpret the AIC score in isolation. However, one can compare the AIC scores of different model fits to the same data. The model with the lowest AIC is generally considered the best choice.

The AIC is the most useful model selection criterion when there are multiple candidate models to choose from. It works well for larger datasets. However, for smaller datasets, the corrected AIC should be preferred. AIC is not perfect, and there can be situations where it fails to choose the optimal model.

There are many other model selection criteria. For more detail read the article: Model Selection Criteria

Akaike Information Criteria

https://rfaqs.com

https://gmstat.com

https://itfeature.com

Estimation of Population Parameters

Introduction to Estimation of Population Parameters

In statistics, estimating population parameters is important because it allows the researcher to conclude a population (whole group) by analyzing a small part of that population. The estimation of population parameters is done when the population under study is large enough. For example, instead of performing a census, a random sample from the population can be drawn. To draw some conclusions about the population, one can calculate the required sample statistic(s).

Important Terminologies

The following are some important terminologies to understand the concept of estimating the population parameters.

  • Population: The entire collection of individuals or items one is interested in studying. For instance, all the people living in a particular country.
  • Sample: A subgroup (or small portion) chosen from the population that represents the larger group.
  • Parameter: A characteristic that describes the entire population, such as the population mean, median, or standard deviation.
  • Statistic: A value calculated from the sample data used to estimate the population parameter. For example, the sample mean is an estimate of the population mean. It is the characteristics of the sample under study.

Various statistical methods are used to estimate population parameters with different levels of accuracy. The accuracy of the estimate depends on the size of the sample and how well the sample represents the population.

We use statistics calculated from the sample data as estimates for the population parameters.

Estimation of Population Parameters Sample Statistic, Population Parameter
  • Sample mean: is used to estimate the population mean. It is calculated by averaging the values of all observations in the sample, that is the sum of all data values divided by the total number of observations in the data.
  • Sample proportion: is used to estimate the population proportion (percentage). It represents the number of successes (events of interest) divided by the total sample size.
  • Sample standard deviation: is used to estimate the population standard deviation. It reflects how spread out the data points are in the sample.

Types of Estimates

There are two types of estimates:

Estimation of Population Parameters: Point Estimate and Interval Estimate
  • Point Estimate: A single value used to estimate the population parameter. The example of point estimates are:
    • The mean/average height of Boys in Colleges is 65 inches.
    • 65% of Lahore residents support a ban on cell phone use while driving.
  • Interval Estimate: It is a set of values (interval) that is supposed to contain the population parameter. Examples of interval estimates are:
    • The mean height of Boys in Colleges lies between 63.5 and 66.5 inches.
    • 65% ($\pm 3$% of Lahore residents support a ban on cell phone use during driving.

Some Examples

Estimation of population parameters is widely used in various fields of life. For example,

  • a company might estimate customer satisfaction through a sample survey,
  • a biologist might estimate the average wingspan of a specific bird species by capturing and measuring a small group.

https://rfaqs.com

https://gmstat.com

Empirical Probability Examples

Introduction to Empirical Probability

An empirical probability (also called experimental probability) is calculated by collecting data from past trials of the experiments. The experimental probability obtained is used to predict the future likelihood of the event occurring.

Formula and Examples of Empirical Probability

To calculate an empirical/ experimetnal probability, one can use the formula

$$P(A)=\frac{\text{Number of trials in which $A$ occurs} }{$\text{Total number of trials}}$$

  • Coin Flip: Let us flip a coin 200 times and get heads 105 times. The empirical probability of getting heads is $\frac{105}{200} = 0.525%, or 52.5%.
  • Weather Prediction: Let you track the weather for a month and see that it rained 12 out of 30 days. The empirical probability of rain on a given day that month is $\frac{12}{30} = 0.4$ or 40%.
  • Plant Growth: Let you plant 50 seeds and 35 sprout into seedlings. The experimental probability of a seed sprouting is $\frac{35}{50} = 0.70$ or 70%.
  • Board Game: Suppose you play a new board game 10 times and win 6 times. The empirical probability of winning the game is $\frac{6}{10} = 0.6$ or 60%.
  • Customer Preferences: In a survey of 100 customers, 80 prefer chocolate chip cookies over oatmeal raisins. The empirical probability of a customer preferring chocolate chip cookies is $\frac{80}{100} = 0.80$ or 80%.
  • Basketball Game: A basketball player practices free throws and makes 18 out of 25 attempts. The experimental probability of the player making their next free throw is $\frac{18}{25} = 0.72$ or 72%.

Empirical Probability From Frequency Tables

From a frequency table, one can calculate the probability that a certain data value falls into any data group/ class. For example, consider the frequency table of examination scores in a certain class.

ClassFrequency ($f$)$frf$
40 – 491$\frac{1}{20}=0.05$
50 – 592$\frac{1}{20}=0.10$
60 – 693$\frac{3}{20}=0.15$
70 – 794$\frac{4}{20}=0.20$
80 – 896$\frac{6}{20}=0.30$
90 – 994$\frac{4}{20}=0.20$

Let event $A$ is the event that a student scores between 90 and 99 on the exam, then

$$P(A) = \frac{\text{Number of students scoring 90-99}}{\text{Total number of students}} = \frac{4}{20} = 0.20$$

Notice that $P(A)$ is the relative frequency of the class 90-99.

Empirical Probability and Classical Probability

Key Points about Empirical Probability

  • It is based on actual data, not theoretical models.
  • It is a good approach when the data is from similar events in the past.
  • The more data you have, the more accurate the estimate will be.
  • It is not always perfect, as past results do not guarantee future outcomes.

Empirical Probability also has Limitations

  • It can be time-consuming and expensive to collect enough data.
  • It may not be representative of the future, especially if the underlying conditions change.

Online Quiz Website

R Frequently Asked Questions

Median of Ungrouped Data

Introduction to Median of Ungrouped Data

The post is about calculating the median ungrouped data. The median is the most central point (middlemost central value) of the data/set of observations, with the condition that the data or set of observations should be arranged in ascending or descending order. The median divides the data into two equal parts. That is the main objective of the median.

It is important to note that the criteria for finding the median for grouped and ungrouped data are different.

The primary and secondary data can be defined as:

  1. Primary data, also called raw or ungrouped data, does not undergo any statistical procedure/method, which is not in the form of frequency distribution.
  2. Secondary data may also be called group data if it is in the form of frequency distribution.

Let us discuss how to find the median for ungrouped data.

There are two cases for ungrouped data. These cases are based on no of observations which is $n$

When the number of observations is odd (Say $n$ i.e. $n$ is odd), and when the number of observations is even (Say $n$ i.e. $n$ is even).

Median Calculations

The data below contains the odd number of observations.

Observation No.
(Ascending Order)
1st2nd3rd4th5th6th7th8th9th10th11th
Data Values81899096100102103104108109118
(Descending Order)1110987654321

Since the number of observations is odd ($n = 11$), the central value after arranging in ascending order will be the 6th value. and the 6th value is 102. That is the median is 102 for the above data.

The position of the median can be located mathematically, as follows:

\begin{align*}
\tilde{x} &= \left( \frac{n+1}{2} \right)th\,\, \text{value}\\
&=\frac{11+1}{2} = 6th\,\, \text{value}
\end{align*}

The value at the 6th position (from sorted data) is 102. The $\tilde{x}$ can be read as “x-tild” which is the notation of the median.

Median for Even Numbers of Observations

Consider the following data that contains an even number of observations.

Observation No.12345678910
Data Values81100961089010210410310989

Data after sorting (either in ascending or descending order) is

Observations No.1st2nd3rd4th5th6th7th8th9th10th
x81899096100102103104108109

Since $n=10$ which is even, the central position (that is median) lies between the 5th value and the 6th value. This central value is the average of the 5th and 6th values (from the sorted data). The average of these two central observations is called the median. The two central positions are 100 and 102, take the average of these two numbers and find the median.

$$Median = \frac{100+102}{2} = 101$$

Median Formula for Large Data Sets

The median formula for large or small data sets can be represented mathematically.

  • For large data sets one can find the median of data mathematically. The formula for both odd number of observations and even numbers of observations is different.

The point to remember when computing the median is that

  • For an odd number of observations, the median is the centermost value after sorting the data
  • For an even number of observations, the median is the average of two central values after sorting the data

\begin{align*}
\tilde{x} &= \frac{1}{2} \left[ \left(\frac{n}{2}th \, \, value \right)+ \left(\frac{n}{2}+1 \right)the \,\, value \right]\quad \quad \text{(When observations are even)}\\
&= \frac{n+1}{2} \quad \quad \text{(when observations are odd)}
\end{align*}

The other way of the median formula is

Median of ungrouped data

Consider, a data set containing 157 observations. To compute the median, first of all, you need to sort the data in either ascending or descending order. The formula for this data will be

$$\tilde{x} = \frac{n+1}{2} = \frac{157+1}{2}=79th$$.

The 79th observation in the sorted data will be the median of the data.

In case, if there are even number of observations (say $n=396$, the median will be

\begin{align*}
\tilde{x} &= \frac{1}{2}\left[\left(\frac{n}{2}\right)th + \left(\frac{n+1}{2}\right)th \right]\\
&=\frac{1}{2} \left[\frac{396}{2}th + \frac{396}{2}+1 \right]\\
&= \frac{1}{2} \left[198th + 199th\right]
\end{align*}

The average of 198th value and 199th value from the sorted data will be the median of the data.

https://rfaqs.com

https://gmstat.com