Statistical Inference: An Introduction

Introduction to Statistical Inference

Inference means conclusion. When we discuss statistical inference, it is the branch of Statistics that deals with the methods to make conclusions (inferences) about a population (called reference population or target population), based on sample information. The statistical inference is also known as inferential statistics. As we know, there are two branches of Statistics: descriptive and inferential.

Statistical inference is a cornerstone of many fields of life. It allows the researchers to make informed decisions based on data, even when they can not study the entire population of interest. The statistical inference has two fields of study:

Statistical Inference

Estimation

Estimation is the procedure by which we obtain an estimate of the true but unknown value of a population parameter by using the sample information that is taken from that population. For example, we can find the mean of a population by computing the mean of a sample drawn from that population.

Estimator

The estimator is a statistic (Rule or formula) whose calculated values are used to estimate (a wise guess from data information) is used to estimate a population parameter $\theta$.

Estimate

An estimate is a particular realization of an estimator $\hat{\theta}$. It is the notation of a sample statistic.

Types of Estimators

An estimator can be classified either as a point estimate or an interval estimate.

Point Estimate

A point estimate is a single number that can be regarded as the most plausible value of the $\theta$ (notation for a population parameter).

Interval Estimate

An interval estimate is a set of values indicating confidence that the interval will contain the true value of the population parameter $\theta$.

Testing of Hypothesis

Testing of Hypothesis is a procedure that enables us to decide, based on information obtained by sampling procedure whether to accept or reject a specific statement or hypothesis regarding the value of a parameter in a Statistical problem.

Note that since we rely on samples, there is always some chance our inferences are not perfect. Statistical inference acknowledges this by incorporating concepts like probability and confidence intervals. These help us quantify the uncertainty in our estimates and test results.

Important Considerations about Testing of Hypothesis

  • Hypothesis testing does not prove anything; it provides evidence for or against a claim.
  • There is always a chance of making errors (Type I or Type II).
  • The results are specific to the chosen sample and significance level.

Statistical Inference in Real-Life

Some real-life examples of inferential statistics:

  1. Medical Trials: When a new drug is developed, it is tested on a sample of patients to infer its effectiveness and safety for the general population. Statistical inference helps determine whether the observed effects are due to the drug or random chance.
  2. Market Research: Companies use inferential statistics to understand consumer preferences and behaviours. By surveying a sample of consumers, they can infer the preferences of the broader market and make informed decisions about product development and marketing strategies.
  3. Public Health: Epidemiologists use statistical inference to track the spread of diseases and the effectiveness of interventions. Analyzing sample data one can infer the overall impact of a disease and the effectiveness of measures like vaccinations.
  4. Quality Control: Manufacturers use statistical inference to monitor product quality. By sampling a few items from a production batch, they can infer the quality of the entire batch and make decisions about whether to continue production or make adjustments.
  5. Election Polling: Pollsters use samples of voter opinions to infer the likely outcome of an election. Statistical inference helps estimate the proportion of the population that supports each candidate and the margin of error in these estimates.
  6. Education: Educators and policymakers use statistical inference to evaluate the effectiveness of teaching methods and educational programs. By analyzing test scores and other performance metrics from a sample of students, they can infer the impact of these methods on the broader student population.
  7. Environmental Studies: Researchers use statistical inference to assess environmental impacts. For example, by sampling air or water quality in specific locations, they can infer the overall environmental conditions and the effectiveness of pollution control measures.
  8. Sports Analytics: Teams and coaches use statistical inference to evaluate player performance and strategy effectiveness. By analyzing data from a sample of games, they can infer the overall performance trends and make decisions about training and game strategy.
  9. Finance: Investors and financial analysts use statistical inference to make decisions about investments. By analyzing sampled historical data of stocks or other financial instruments, one can infer future performance and make informed investment decisions.
  10. Customer Satisfaction: Businesses use statistical inference to gauge customer satisfaction and loyalty. By surveying a sample of customers, one can infer the overall satisfaction levels and identify areas for improvement.

FAQs about Statistical Inference

  1. Define the term estimation.
  2. Define the term estimate.
  3. Define the term estimator.
  4. Write a short note on statistical inference.
  5. What is statistical hypothesis testing?
  6. What is the estimation in statistics?
  7. What are the types of estimations?
  8. Write about point estimation and intervention estimation.

https://rfaqs.com, https://gmstat.com

Multiple Regression Analysis

Introduction to Multiple Regression Analysis

Francis Galton (a biometrician) examines the relationship between fathers’ and sons’ height. He analyzed the similarities between the parent and child generation of 700 sweet peas. Galton found that the offspring of tall parents tended to be shorter and offspring of shorter parents tended to be taller. The height of the children depends ($Y$) upon the height of the parents ($X$). In case, there is more than one independent variable (IV), we need multiple regression analysis (MRA), also called multiple linear regression (MLR).

Multiple Linear Regression Model

The linear regression model (equation) for two independent variables (regressors) is

$$Y_{ij} = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \varepsilon_{ij}$$

The general linear regression model (equation) for $k$ independent variables is

$$Y_{ij} = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3X_{3i} + \cdots + \varepsilon_{ij}$$

The $\beta$s are all regression coefficients (partial slopes) and the $\alpha$ is the intercept.

The sample linear regression model is

$$\hat{y} = \hat{\alpha} + \hat{\beta}_1 x_{1i} + \hat{\beta}_2x_{2i} + \hat{\varepsilon}_{ij}$$

Multiple Regression Coefficients Formula

To fit the MLR equation for two variables, one needs to compute the values of $\hat{\beta}_1, \hat{\beta}_2$, and $\alpha$.

Multiple Regression Analysis Partial Coefficient 1

The yellow part of the above formula is the (“sum of the product of 1st independent and dependent variables”) multiplied by the (“sum of the square of 2nd independent variable).

The red part of the above formula is the (“Sum of the product of 2nd independent and dependent variables”) multiplied by the (“sum of the product of two independent variables”).

The green part of the above formula is the (“sum of the square of 1st independent variable”) multiplied by the (“sum of the square of 2nd independent variable”).

The blue part of the above formula is the (“square of the sum of the product of two independent variables”).

The formula for 2nd regression coefficient is

Multiple Regression Analysis Partial Coefficient 1

In short, note that the $S$ stands for the sum of squares and the sum of products.

Multiple Linear Regression Example

Consider the following data about two regressors ($X_1, X_2$) and one regressand variable ($Y$).

$Y$$X_1$$X_2$$X_1 y$$X_2 y$$X_1 X_2$$X_1^2$$X_2^2$
301015300450150100225
2258110176402564
161012160192120100144
737214921949
1421028140204100
8930526191007351238582

\begin{align*}
S_{x_1Y} &= \sum X_1 y – \frac{\sum X_1 \sum Y}{n} = 619 – \frac{30\times 59}{5} = 265\\
S_{x_1x_2} &= \sum X_1 X_2 – \frac{\sum X_1 \sum X_2}{n} = 351 – \frac{30 \times 52}{5} = 39\\
S_{X_1^2} &= \sum X_1^2 – \frac{(\sum X_1)^2}{n} = 238 -\frac{30^2}{5} = 58\\
S_{X_2^2} &= \sum X_2^2 – \frac{(\sum X_2)^2}{n} = 582 – \frac{52^2}{5} = 41.2\\
S_{X_2 y} &= \sum X_2 Y – \frac{\sum X_2 \sum Y}{n} =1007 – \frac{52 \times 89}{5} = 81.4
\end{align*}

\begin{align*}
\hat{\beta}_1 &= \frac{(S_{X_1 Y})(S_{X_2^2}) – (S_{X_2Y})(S_{X_1 X_2}) }{(S_{X_1^2})(S_{X_2^2}) – (S_{X_1X_2})^2} = \frac{(265)(41.2) – (81.4)(39)}{(58)(41.2) – (39)^2} = 8.91\\
\hat{\beta}_2 &= \frac{(S_{X_2 Y})(S_{X_1^2}) – (S_{X_1Y})(S_{X_1 X_2}) }{(S_{X_1^2})(S_{X_2^2}) – (S_{X_1X_2})^2} = \frac{(81.4)(58) – (265)(39)}{(58)(41.2) – (39)^2} = -6.46\\
\hat{\alpha} &= \overline{Y} – \hat{\beta}_1 \overline{X}_1 – \hat{\beta}_2 \overline{X}_2\\
&=31.524 + 8.91X_1 – 6.46X_2
\end{align*}

Important Key Points of Multiple Regression

  • Independent variables (predictors, regressors): These are the variables that one believes to influence the dependent variable. One can have two or more independent variables in a multiple-regression model.
  • Dependent variable (outcome, response): This is the variable one is trying to predict or explain using the independent variables.
  • Linear relationship: The core assumption is that the relationship between the independent variables and dependent variable is linear. This means the dependent variable changes at a constant rate for a unit change in the independent variable, holding all other variables constant.

The main goal of multiple regression analysis is to find a linear equation that best fits the data. The multiple regression analysis also allows one to:

  • Predict the value of the dependent variable based on the values of the independent variables.
  • Understand how changes in the independent variables affect the dependent variable while considering the influence of other independent variables.

Interpreting the Multiple Regression Coefficient

https://rfaqs.com

https://gmstat.com

Geometric Mean Formula

Introduction to the Geometric Mean

The geometric mean (GM) is a way of calculating an average, but instead of adding values like the regular (arithmetic) mean, it multiplies them and then takes a root. The geometric mean (a useful measure of central tendency) is defined as the $n$th root of the product of $n$ positive values.

If we have two observations, let’s say 9 and 4, then the geometric mean is the square root of the product of these values, which is 6 ($\sqrt{9\times 4}=6$. If there are three values, say  3, 9, and 3, then the geometric average will be the $sqrt[3]{3\times 9 \times 3} = 3$. In a similar pattern, mathematically, for $n$ number of observations ($x_1, x_2, \cdots, x_n$) then the Geometric Average Formula will be

$$GM = (x_1 \times x_2 \times x_3 \times \cdots \times x_n)^{\frac{1}{n} }$$

Geometric Mean

Geometric Mean Example

Suppose we have the following set of values: $x=32, 36, 36, 37, 39, 41, 45, 46, 48$. The Computation of the Geometric Mean will be

\begin{align*}
GM &= (32\times 36 \times 36 \times 37 \times 39 \times 41 \times 45 \times 46 \times 48)^{\frac{1}{9}}\\
&=(243790484520960)^{\frac{1}{9}} = 39.7
\end{align*}

For a large number of observations, one can compute the GM by taking the log of all observations using the following formula:

$$GM = antilog \left[\frac{\sum\limits_{i=1}^n log\, x}{n} \right]$$

$x$$log\, x$
32Log 32 = 1.5051
36log 36 = 1.5563
36log 36 = 1.5563
37log 37 = 1.5682
39log 39 = 1.5911
41log 41 = 1.6128
45log 45 = 1.6532
46log 46 = 1.6628
48log 48 = 1.6812
Total14.3870

\begin{align*}
GM &= antilog \left[ \frac{\sum\limits_{i=1}^n log\, x}{n} \right]\\
&= antilog \left[\frac{14.3870}{9}\right] = antilog [1.5986]\\
&= 38.7
\end{align*}

One important point that should be remembered is that if any value in the data set is zero or negative, then the GM cannot be computed.

Geometric Mean for Grouped Data

The GM for grouped data can also be computed using the following formula:

$$GM = antilog \left[ \frac{\Sigma f\times log\, x}{\Sigma f} \right]$$

Suppose we have the following frequency distribution as follows:

ClassesFrequency
65 to 849
85 to 10410
105 to 12417
125 to 14410
145 to 1645
165 to 1844
185 to 2045
Tota60

The GM of the above frequency distribution can be performed as follows

Classes$f$$X$$log\, X$$f \times log\, X$
65-84974.5log 74.5 = 1.872216.8494
85-1041094.5log 94.5 = 1.975419.7543
105-12417114.5log 114.5 = 2.058834.9997
125-14410134.5log 134.5 = 2.128721.2872
145-1645154.5log 154.5 = 2.188910.9446
165-1844174.5log 174.5 = 2.24188.9672
185-2045194.5log 194.5 = 2.288911.4446
Total60  124.2471

\begin{align*}
GM &= antilog \left[ \frac{124.2471}{60} \right]\\
&=antilog (2.0708) = 117.4
\end{align*}

The GM is particularly useful when dealing with rates of change or ratios, such as growth rates in investments. That is because the geometric mean considers how things are multiplied over time rather than simply added.

Use and Application of the Geometric Mean

The GM is useful in situations like:

  • Investment returns: When one looks at average investment growth, one wants to consider how much one’s money is multiplied over time, not just the change each year. That is why the GM is better suited for this scenario.
  • Rates of change: Similar to investment returns, if something is increasing or decreasing by a percentage each time, the GM is a more accurate measure of the overall change.
  • Growth Rates: When dealing with percentages or ratios that change over time (like investment returns or population growth), the geometric mean provides a more accurate picture of the overall change compared to the arithmetic mean.
  • Proportional Changes: This is helpful for situations where changes are multiplied, not added. For example, if a recipe calls for doubling all ingredients, the geometric mean of the original quantities represents the final amount.

Real-Life Examples

  1. Finance (Average Investment Returns): To calculate the average rate of return on investments over time one can use the Geometric Mean? It is because the returns are compound that one cannot use the arithmetic mean. For example, Year 1 return = +10%, the Year 2 return = -20%, Year 3 return = +30%, the GM Return will give the true average annual return over 3 years.
  2. Economics (Growth Rates): The GM should be used to compute average GDP growth, inflation, or population growth over multiple years. It is because growth over time is multiplicative. For example, GDP grows at 3%, 4%, and 5% over 3 years. The geometric mean provides the average annual growth rate.
  3. Business (Average Rate of Change in Prices or Sales): To find the average percentage change in prices or sales across several periods, the GM can be used. For example, A product price increased by 10%, then decreased by 5%, then increased by 8%. The GM will give the true average percentage change.
  4. Environmental Science (Air or Water Quality Data): The GM should be used to calculate the average concentration of pollutants, as environmental data often contains highly skewed values. For example, Pollution levels: 2, 4, 8, 50 → The arithmetic mean is skewed by 50, therefore, the Geometric Mean will give a better central tendency for such data.
  5. Demographics (Fertility or Mortality Rates): In demographic research, to average birth or death rates across different countries or regions, The Geometric Mean should be used because these rates are often ratios and vary widely between groups.
  6. Health & Medicine (Drug Dosage and Bacterial Growth): To measure average bacterial growth rates, enzyme activities, or dosage effectiveness. It is because these processes grow or decline exponentially.
  7. Marketing (Average Performance Metrics): The Geometric Mean should be used to calculate the average conversion rate or engagement rate over multiple platforms or campaigns. Because metrics are often multiplicative percentages, and geometric mean gives a more accurate reflection.

Summary Table

Summary

The geometric mean is incredibly useful in real-life situations where values are multiplied together, grow exponentially, or vary in ratios or percentages — rather than being added. The GM is very useful, especially in business, finance, science, and data analysis.

FAQS about Geometric Mean

  • What is meant by the Geometric Mean?
  • In what situation should the GM be used?
  • For what observations, the GM not computed?
  • Write down the formula of GM for group and ungrouped data.
  • Give some real-life examples that make use of the Geometric Mean.

https://rfaqs.com

https://gmstat.com