Logistic regression Introduction

Logistic regression was introduced in the 1930s by Ronald Fisher and Frank Yates and was first proposed in the 1970s as an alternative technique to overcome the limitations of ordinary least squares regression in handling dichotomous outcomes. It is a type of probabilistic statistical classification model, which is a non-linear regression model, and can be converted into a linear model by using a simple transformation. It is used to predict a binary response categorical dependent variable, based on one or more predictor variables. That is, it is used in estimating empirical values of the parameters in a model. Here response variable assumes a value of zero or one, i.e., a dichotomous variable.

Logistic Regression Model

The logistic regression model is written as

  \[\pi=\frac{1}{1+e^{-[\alpha +\sum_{i=1}^k \beta_i X_{ij}]}}\]

where $\alpha$ and $\beta_i$ are the intercept and slope respectively.

Logistic Regression

So, in simple words, logistic regression is used to find the probability of the occurrence of the outcome of interest.  For example, if we want to find the significance of the different predictors (gender, sleeping hours, taking part in extracurricular activities, etc.), on a binary response (pass or fail in exams coded as 0 and 1), for this kind of problem, we used logistic regression.

By using a transformation, this nonlinear regression model can be easily converted into a linear model. As $\pi$ is the probability of the events in which we are interested, if we take the ratio of the probability of success and failure, then the model becomes a linear model.

\[ln(y)=ln(\frac{\pi}{1-\pi})\]

The natural log of odds can convert the logistic regression model into a linear form.

Real-Life Examples of Logistic Regression

Some real-life examples are:

  1. Medical Diagnostics: It is used to predict whether a patient has a disease (for example, diabetes, cancer) based on symptoms, lab tests, and medical history.
  2. Spam Email Detection: Emails can be classified as spam or not spam using word frequency, sender details, etc.
  3. Marketing (Customer Churn Prediction): It is used to predict if a customer will stop using a service (for example, cancel a subscription) based on usage patterns and demographics.
  4. Credit Scoring (Loan Approval): Banks use logistic regression to decide whether to approve a loan based on income, credit score, employment status, etc.
  5. Ad Click Prediction: It can be used to predict whether a user will click on an online ad based on browsing history and demographics.
  6. Employee Attrition: Can be used to predict if an employee will leave a company based on job satisfaction, salary, and tenure.
  7. College Admissions: It is used to predict whether a student will be admitted to a university based on GPA, test scores, and extracurricular activities.
  8. Fraud Detection in Banks: It can be used to detect fraudulent credit card transactions based on transaction amount, location, and spending habits.
  9. Political Election Forecasting: Predicting if a candidate will win an election based on polling data, demographics, and campaign spending.
  10. Sports Analytics: Win prediction of a team based on their past performance, player stats, and opponent strength can be made using logistic regression.

Binary Logistic Regression in Minitab

References:

Discovering Odds Ratio

An odds ratio is a relative measure of effect, allowing the comparison of the intervention group of a study relative to the comparison or placebo group. The odds ratio helps quantify the strength and direction of the relationship between two groups or conditions.

Introduction Odds Ratio

The odds ratio (OR) is a measure of association used in statistics to compare the odds of an event occurring in one group to the odds of it occurring in another group. It is commonly used in case-control studies and logistic regression.

  • an OR of 1 indicates no difference between groups,
  • an OR greater than 1 suggests higher odds in the first group, and
  • an OR less than 1 suggests lower odds in the first group.

Medical students, students from clinical and psychological sciences, professionals allied to medicine enhancing their understanding and learning of medical literature, and researchers from different fields of life usually encounter Odds Ratio (OR) throughout their careers.

When computing the OR, one would do:

  • The numerator is the odds in the intervention arm
  • The denominator is the odds in the control or placebo arm

Calculating Odds Ratio

The ratio of the probability of success and failure is known as the odds. If the probability of an event is $P_1$, then the odds are:
\[OR=\frac{p_1}{1-p_1}\]

If the outcome is the same in both groups, the ratio will be 1, implying that there is no difference between the two arms of the study. However, if the $OR>1$, the control group is better than the intervention group, while if the $OR<1$, the intervention group is better than the control group.

The Odds Ratio is the ratio of two odds that can be used to quantify how much a factor is associated with the response factor in a given model. If the probabilities of occurrences of an event are $P_1$ (for the first group) and $P_2$ (for the second group), then the OR is:
\[OR=\frac{\frac{p_1}{1-p_1}}{\frac{p_2}{1-p_2}}\]

If predictors are binary, then the OR for the $i$th factor is defined as
\[OR_i=e^{\beta}_i\]

Odds Ratio

Real-Life Examples of Odds Ratio

  1. Medical Researches
    • Consider we are interested in comparing the odds of developing a disease (e.g., lung cancer) in smokers versus non-smokers. Suppose the OR is 2.5; it means smokers have 2.5 times higher odds of developing lung cancer compared to non-smokers.
  2. Public Health
    • Suppose we are interested in assessing the effectiveness of a vaccine. For example, comparing the odds of contracting a disease (e.g., COVID-19) in vaccinated versus unvaccinated individuals. An OR less than 1 would indicate the vaccine reduces the odds of infection.
  3. Social Sciences
    • Consider that we are interested in studying the odds of students passing an exam based on attendance. For instance, if students who attend extra tutoring have an OR of 3.0 for passing, they have 3 times higher odds of passing compared to those who don’t attend.
  4. Marketing
    • Suppose we need to analyze the odds of customers purchasing a product after seeing an advertisement versus not seeing it. An OR greater than 1 suggests the ad increases the likelihood of purchase.
  5. Environmental Studies
    • Evaluating the odds of developing asthma in people living in high-pollution areas compared to those in low-pollution areas. An OR greater than 1 would indicate higher odds of asthma in high-pollution areas.

The regression coefficient $b_1$ from logistic regression is the estimated increase in the log odds of the dependent variable per unit increase in the value of the independent variable. In other words, the exponential function of the regression coefficients $(e^{b_1})$ in the OR is associated with a one-unit increase in the independent variable.

Online MCQs about Economics with Answers

R Programming Language Lectures

Formula of Median and Definition

The post is about the Formula of Median, its definition, and examples for the computation of the median for an even or odd number of observations in a data set.

Introduction of Median

Median (a measure of central tendency) is the middle-most value in the data set when all of the values (observations) in a data set are arranged either in ascending or descending order of their magnitude. The median is also considered a measure of central tendency that divides the data set into two halves, where the first half contains 50% observations below the median value and 50% above the median value. If there is an odd number of observations (data points) in a data set, the median value is the single-most middle value after sorting the data set.

After understanding the median definition, let’s consider a few examples of how to calculate the median for a data set.

Median Example – 1

Question: For the following data set: 5, 9, 8, 4, 3, 1, 0, 8, 5, 3, 5, 6, 3, calculate the median.

Answer: To find the median of the given data set, first sort the data (either in ascending or descending order), that is
0, 1, 3, 3, 3, 4, 5, 5, 5, 6, 8, 8, 9. After sorting, the middle-most value of the above data is 5, which is the median of the given data set.

When the number of observations in a data set is even, then the median value is the average of the two middle-most values in the sorted data.

Median Example – 2

Question: Consider the following data set: 5, 9, 8, 4, 3, 1, 0, 8, 5, 3, 5, 6, 3, 2. Compute the median.

Answer: To find the median, first sort it and then locate the middle-most two values, that is,
0, 1, 2, 3, 3, 3, 4, 5, 5, 5, 6, 8, 8, 9. The middle-most two values are 4 and 5. So the median will be the average of these two values, i.e., 4.5 in this case.

The median is less affected by extreme values in the data set, so the median is the preferred measure of central tendency when the data set is skewed or not symmetrical.

Formula of Median for Odd Number of Observations

For large data sets, it is relatively very difficult to locate median values in sorted data. It will be helpful to compute the median value using the formula. The median formula for an odd number of observations is
$\begin{aligned}
Median &=\frac{n+1}{2}th\\
Median &=\frac{n+1}{2}\\
&=\frac{13+1}{2}\\
&=\frac{14}{2}=7th
\end{aligned}$

The 7th value in the sorted data is the median of the given data.

Formula of Median for Even Number of Observations

The formula of the median for an even number of observations is
$\begin{aligned}
Median&=\frac{1}{2}(\frac{n}{2}th + (\frac{n}{2}+1)th)\\
&=\frac{1}{2}(\frac{14}{2}th + (\frac{14}{2}+1)th)\\
&=\frac{1}{2}(7th + 8th )\\
&=\frac{1}{2}(4 + 5)= 4.5
\end{aligned}$

Median definition formula of median and example

The computation/ calculation of the median is a crucial step in exploratory data analysis (EDA). It helps identify potential outliers, assess skewness in the data distribution, and choose appropriate statistical methods for further analysis.

Applications of Median in Different Scenarios

1. Resisting Outliers: The median’s primary strength lies in its resistance to outliers. Unlike the mean (which can be swayed by extreme values), the median remains unaffected and stable by a few very high or very low data points (extreme observations).

2. Analyzing Skewed Distributions: When dealing with data that is not symmetrical (has skewed distributions), the median provides a more accurate representation of the “center” of the data compared to the mean/average. The median reflects the value that divides the data into halves, whereas the mean gets pulled towards the tail of the skewed distribution.

3. Ease of Interpretation: The median is a simple concept – the middle (centermost) value when the data is arranged in order (either ascending or descending).

Note that the median measure of central tendency cannot be found for categorical data.

FAQs about Median

  1. What is the median?
  2. What is the advantage of the median over other measures of central tendency?
  3. On what kind/type of data, the median computed?
  4. What is the benefit of using the median?
  5. What is the formula of the median when the number of observations is even and when the number of observations is odd?
  6. How is the median interpreted?
  7. In how many groups median classify the data/sample/population?
https://itfeature.com

Online MCQs Test website

R Programming Language