Muhammad Imdad Ullah - Statistics for Data Science & Analytics

Combining Events Using OR

Aug 5, 2025 by Muhammad Imdad Ullah

In probability and logic theory, combining events using OR (denoted as $\cup$) means considering situations where either one event occurs, or the other occurs, or both occur. This is known as the “inclusive OR.”

Given two events $A$ and $B$, one can define the event $A$ or $B$ to be the event that at least one of the events $A$ or $B$ occurs. The probability of the events $A$ or $B$ using the Addition Rule of probability can be computed easily. Learn the Basics of Probability.

Addition Rule of Probability (for Non-Mutually Exclusive Events)

If $A$ and $B$ are two events for an experiment, then
$$P(A\,\, or \,\,B) = P(A\cup B) = P(A) + P(B) – P(A\,\,and \,\, B)$$
This accounts for the overlap between events to avoid double-counting

Addition Rule Probability (for Mutually Exclusive Events)

Two events are called mutually exclusive events if both events cannot occur at the same time (cannot occur together). In this case, when the mutually exclusive events, $P(A\,\,\cap\,\,B)=0$, so the addition rule simplies to:
$$P(A\,\,or\,\,B) = P(A) + P(B)$$
This does not account for the overlap between events to avoid double-counting.

Real Life Examples of Combining Events using OR

The following are a few real-life examples of Combining Events Using OR.

Weather Forecast Example

Suppose Event $A$ represents that it will rain tomorrow and Event $B$ that it will snow tomorrow. One can compute the probability that it will rain OR snow tomorrow. This means that at least one of them happens (it could be rain, snow, or both).
Suppose that the chance of rain tomorrow = $P(A)$ = 30% = 0.3. Supose that the probability of snow tomorrow = $P(B)$ = 20% = 0.2. Suppose the chances of both rain and snow are $P(A \cap B)$ = 5% = 0.5.
Therefore,
\begin{align*}
P(A \cup B) &= P(A) + P(B) – P(A \cup B) \\
& = 0.3 + 0.2 – 0.05 = 0.45
\end{align*}
There is a 45% chance that it will rain or snow tomorrow.

Job Requirements

Suppose Event $A$ represents that applicants must have a Bachelor’s degree, and Event $B$ represents that applicants must have 3 years of experience. One can compute the probability (or count) that the applicant must have a bachelor’s degree OR 3 years of experience to apply. The applicant will qualify if he/she have either one or both experiences.
Suppose there are 100 applicants for a certain job. For Event $A$, there are 40 applicants who have a Bachelor’s degree, and Event $B$ represents that there are 30 applicants who have more than 5 years of experience. Similarly, 10 applicants have both a Bachelor’s degree and have more than 3 years of experience. The number of qualifying applicants will be

\begin{align*}
A \cup B &= A + B – A \cap B \\
& = 40 + 30 – 10 = 60
\end{align*}
Therefore, 60 applicants meet at least one requirement (degree OR experience).

Restaurant Menu Choices

Consider Event $A$ represents the meal comes with fries, and Event $B$ represents the meal comes with a salad. One can compute if a customer can pick one, or sometimes both, if allowed. For illustrative purposes, suppose a Fast-Food Chain tracks 1000 orders. The Event $A$ represents 400 customers who choose fries, and Event $B$ represents 300 customers who choose a salad. Similarly, there are 100 customers who both choose fries and salad. The number of customers’ choices for both fries and salad will be

\begin{align*}
A \cup B &= A + B – A\cap B\\
&= 400 + 300 – 100 = 600
\end{align*}
600 customers ordered fries OR salad (or both).

Discount Offers

Let Event $A$ represent the use of a promo code for 10% off, Event $B$ represents a Student ID for 15% off. One uses a promo code or a Student ID to get a discount. Suppose a store offers two discount options to 200 customers. Event $A$ represents 65% of customers who used a coupon, Event $B$ represents that 13% customers showed their Student ID. 7% customers have used both the coupon and the Student ID. The probability that at least one discount is used will be

\begin{align*}
P(A \cup B) &= P(A) + P(B) – P(A \cap B)\\
& = 0.65 + 0.13 – 0.07 = 0.71
\end{align*}
71% of the customers have used at least one discount.

Security System Access

Suppose a building logs 500 entry attempts. Out of 500, 300 entries used a keycard, 200 used a PIN code, and 50 used both methods. What is the probability that both entry attempts are made?
\begin{align*}
P(A\cap B) &= P(A) + P(B) – P(A \cap B)\\
& = \frac{300}{500} + \frac{200}{500} – \frac{50}{500} = 0.6 + 0.4 – 0.1 = 0.9
\end{align*}
There are 90% ($500\times 0.9=450$) entries that used a keycard OR a PIN.

General Knowledge Quizzes

FAQs about Combining Events

What is meant by Combining Events?
What symbol is used to combine two or more events?
What rule of probability is used to combine events?
Give some real-life examples of Combining Events using OR.
What are mutually and Non-Mutually Exclusive Events?

Properties of Measure of Central Tendency

Aug 1, 2025 by Muhammad Imdad Ullah

Understanding the Properties of Measure of Central Tendency helps in selecting the appropriate measure for accurate data interpretation. This blog post explores the key properties of measures of central tendency: mean, median, and mode, along with their advantages and limitations.

Introduction: Properties of Measure of Central Tendency

In statistics, measures of central tendency are crucial for summarizing and interpreting data. Measures of central tendency provide a single value that represents the center or typical value of a dataset. The three most common measures of central tendency are the mean, median, and mode. Each central tendency has unique properties that make it suitable for different types of data and analytical purposes.

Mean (Arithmetic Average)

The mean (the most widely used measure of central tendency) is the sum of all values in a dataset divided by the number of values $\left(\frac{\sum\limits_{i=1}^n X_i}{n}\right)$.

Properties of Mean

Sensitive to All Data Points
The mean considers every value in the dataset, making it highly responsive to changes. A single extreme value (outlier) can significantly affect the mean.
Algebraic Manipulability
The mean is used in further mathematical operations (measures of dispersion, e.g., calculating variance, standard deviation). The sum of deviations from the mean ($x-\overline{x}$) is always zero:
$$\sum\limits_{i=1}^n (X_i – \overline{X}) =0$$
Applicable to Interval and Ratio Data
The mean is suitable for continuous numerical data (for example, height, weight, and income). It is not appropriate for nominal or ordinal data.
Affected by Skewness
In skewed distributions, the mean is pulled toward the tail, making it less representative of central tendency.

Advantages of the Mean

Mean uses all data points, providing a comprehensive measure.
It is useful in statistical inferences and parametric tests.

Limitations of the Mean

Distorted by outliers.
Mean should not be used for highly skewed data.

properties of measures of central tendency

Median (Middle Value)

The median is the middle value (the most central data value) in an ordered dataset/array. If the dataset has an even number of observations, the median is the average of the two central values.

Properties of Median

Resistant to Outliers
Unlike the mean, the median is not influenced/affected by extreme values (outliers). It is because the median only depends on the middle value(s) in the ordered dataset. It is also applicable to Ordinal, Interval, and Ratio Data. On the other hand, median works well for ranked (ordinal) and continuous numerical data. However, the median is not suitable for nominal data (categories without order).
Unaffected by Skewness
The median remains stable in skewed distributions, making it a better measure than the mean in such cases.
Not Algebraically Manipulable
Unlike the mean, the median cannot be used in further mathematical computations (for example, standard deviation).

Advantages of the Median

Median is robust against outliers.
Median better represents the central tendency in skewed distributions.

Limitations of the Median

Median does not consider all data points.
It is less efficient than the mean for normally distributed data.

Mode (Most Frequent Value)

The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), or multiple modes (multimodal). It is the only measure of central tendency that can have more than one value.

Properties of Mode

Mode applies to All Data Types (that is, it works with nominal, ordinal, interval, and ratio data). However, it is the only measure of central tendency suitable for categorical data (e.g., colors, brands).

Unaffected by Outliers
Since the mode depends on frequency, extreme values do not impact the mode.
Not Necessarily Unique
Some datasets have no mode (if all values are unique or no value repeats in the dataset,) or data may have multiple modes.
Not Useful for Small Datasets
In small samples, the mode may not accurately represent central tendency.

Advantages of the Mode

Mode is useful for categorical data.
Mode helps identify peaks in frequency distributions.

Limitations of the Mode

May not exist in some datasets.
Less informative for continuous numerical data with no repeated values.

Comparison of Mean, Median, and Mode

Property	Mean	Median	Mode
Sensitive to Outliers	Yes	No	No
Works with Skewed Data	No	Yes	Sometimes
Applicable to Nominal Data	No	No	Yes
Mathematical Usability	High	Low	Low
Best for Symmetric Data	Yes	Yes	Sometimes

Choosing the Right Measures of Central Tendency

The choice between mean, median, and mode depends on:

Data Type
- Use the mean for normally distributed numerical data, that is, data points are homogeneous.
- Use the median for ordinal or skewed numerical data, that is, data points are heterogeneous.
- Use mode for categorical data, or when data points repeat.
Presence of Outliers
- If outliers are present, the median is preferred.
- If data is clean and normally distributed, the mean is ideal.
Purpose of Analysis
- For statistical computations (e.g., regression), the mean is necessary.
- For descriptive summaries (e.g., income distribution), the median is better.

Summary: Properties of Measures of Central Tendency

Measures of central tendency: mean, median, and mode, each has unique properties that determine their suitability for different datasets. The mean is precise but affected by outliers, the median is robust against skewness, and the mode is versatile for categorical data. Understanding these properties ensures accurate data interpretation and informed decision-making in statistical analysis.

By selecting the appropriate measure based on data characteristics, analysts can derive meaningful insights and avoid misleading conclusions. Whether summarizing exam scores, income levels, or survey responses, the right measure of central tendency provides clarity in a world of data.

General Knowledge Quiz

Sample Size Determination

Jul 30, 2025 by Muhammad Imdad Ullah

Sample size determination is one of the most critical steps in designing any research study or experiment. Whether the researcher is conducting clinical trials, market research, or social science studies, the selection of an appropriate sample size ensures that the results are statistically valid while optimizing resources. This guide will walk you through the key concepts and methods for sample size determination.

In planning a study, the sample size determination is an important issue required to meet certain conditions. For example, for a study dealing with blood cholesterol levels, these conditions are typically expressed in terms such as “How large a sample do I need to be able to reject the null hypothesis that two population means are equal if the difference between them is $d=10$mg/dl?“

Why Sample Size Matters

Statistical Power: Adequate sample sizes increase the ability to detect true effects
Precision: Larger samples typically yield more precise estimates
Resource Efficiency: Avoid wasting time/money on unnecessarily large samples
Ethical Considerations: Especially important in clinical research to neither under- nor over-recruit participants

Special Considerations for Estimating Sample Size

Small Populations: May require finite population corrections
Stratified Sampling: Need to calculate for each stratum
Cluster Sampling: Must account for design effect
Longitudinal Studies: Consider repeated measures and attrition

Sample Size Determination Formula

In general, there exists a formula for computing a sample size for the specific test statistic (appropriate to test a specified hypothesis). These formulae require that the user specify the $\alpha$-level and Power = ($1-\beta$) desired, as well as the difference to be detected and the variability of the measure.

Common Approaches to Sample Size Calculation

For Estimating Proportions (Prevalence Studies)

The common approach to calculate sample size, use the formula:

$$n=\frac{Z^2 p (1-p)}{E^2}$$

where

Z = Z-value (1.96 for 95% confidence interval)
p = estimated proportion
E = margin of error

For a survey with an expected proportion of 50%, a 95% confidence level, and 5% margin of error, the sample size will be

$$n=\frac{1.96^2 \times 0.5 \times 0.5}{0.05^2} \approx 385$$

Note that it is not wise to calculate a single number for the sample size. It is better to calculate a range of values by varying the assumptions so that one can get a sense of their impact on the resulting projected sample size. From this range of sample sizes, a suitable sample may be picked for the research work.

Common Situations for Sample Size Determination

We consider the process of estimating sample size for three common circumstances:

One-Sample t-test and paired t-test
Two-Sample t-test
Comparison of $P_1$ vs $P_2$ with a Z-test

One Sample t-test and Paired test

For testing the hypothesis:

$H_o:\mu=\mu_o\quad$ vs $\quad H_1:\mu \ne \mu_o$

For a two-tailed test, the formula of one-sample t-test is

$$n = \left[\frac{(Z_{1-\alpha/2} + Z_{1-\beta})\sigma}{d} \right]^2$$

Example: Suppose we are interested in estimating the size of a sample from a population of blood cholesterol levels. The typical standard deviation of the population is, say, 30 mg/dl. Consider, $\alpha = 0.05, \sigma = 25, d = 5.0, power = 0.80$

\begin{align*}
n & = \left[ \frac{(Z_{1-\alpha/2} + Z_{1-\beta})\sigma}{d} \right]^2\\
&= \left[\frac{(1.96 + 0.842)}{5}25\right]^2 = 196.28 \approx 197
\end{align*}

Two Sample t-test

How large a sample would be needed for comparing two approaches to cholesterol lowering using $\alpha=0.05$, to detect a difference of $d=20$ mg/dl or more with power = $1-\beta=0.90$? For the following hypothesis

$H_o:\mu_1 =\mu_2\quad$ vs $\quad H_1:\mu_1 \ne \mu_2$. For a two-tailed t-test, the formula is

$$N=n_1+n_2 = \frac{4\sigma^2(Z_{1-\alpha/2} + Z_{1-\beta})^2 } {(d = \mu_1 – \mu_2)^2}$$

For $\sigma = 30$mg/dl, $\beta=0.10, \alpha = 0.05$, $Z_{1-\alpha/2}=1.96$, Power = $1-\beta$, $Z_{1-\beta}=1.282$, d = 20 mg/dl.

\begin{align*}
N &= n_1 + n_2 = \frac{4(30)^2 (1.96 + 1.282)^2}{20^2}\\
&= \frac{4\times 900 \times (3.242)^2}{400} = 94.6
\end{align*}

The required sample size is about 50 for each group.

Two Sample Proportion Test

For testing the two-sample proportions hypothesis,

$H_o:P_1=P_2 \quad$ vs $\quad H_1:P_1\ne P_2$

The formula for the two-sample proportion test is

$$N=n_1+n_2 = \frac{{4(Z_{1-\alpha} + Z_{1-\beta})^2}\left[\left(\frac{P_1+P_2}{2}\right) \left(1-\frac{P_1+P_2}{2}\right) \right] }{(d=P_1-P_2)^2}$$

Consider when $\sigma = 30$ mg/dl, $\beta=0.10$, $\alpha = 0.05$, $Z_{1-\alpha/2} = 1.96$, Power = $1-\beta$; $Z_{1-\beta} = 1.282$. $P_1 = 0.7, P_2=0.5$, $d=P_1 – P_2 = 0.7-0.5 = 0.2$. The sample size will be

\begin{align*}
N &= n_1+n_2 = \frac{4(1.96+1.282)^2 [0.6(1-0.6)]}{0.2^2}\\
&= \frac{4(3.242^2)[0.6\times 0.4]}{0.2^2} = 252.25
\end{align*}

Considering using $N=260$ or 130 in each group.

Summary

Proper sample size determination is both an art and a science that balances statistical requirements with practical constraints. While formulas provide a starting point, thoughtful consideration of your specific research context is essential. When in doubt, consult with a statistician to ensure your study is appropriately powered to answer your research questions.

Sample Size Determination FAQs

What is meant by sample size?
What is the importance of determining the sample size?
What are the important considerations in determining the sample size?
What are the common situations for sample size determination?
What is the formula of a one-sample t-test?
What is the formula of a two-sample test?
What is the formula of a two-sample proportion test?
What is the importance of sample size determination?

R Programming Language

Table of Contents

Addition Rule of Probability (for Non-Mutually Exclusive Events)

Addition Rule Probability (for Mutually Exclusive Events)

Real Life Examples of Combining Events using OR

Weather Forecast Example

Job Requirements

Restaurant Menu Choices

Discount Offers

Security System Access

FAQs about Combining Events

Share this:

Table of Contents

Introduction: Properties of Measure of Central Tendency

Mean (Arithmetic Average)

Properties of Mean

Advantages of the Mean

Limitations of the Mean

Median (Middle Value)

Properties of Median

Advantages of the Median

Limitations of the Median

Mode (Most Frequent Value)

Properties of Mode

Advantages of the Mode

Limitations of the Mode

Comparison of Mean, Median, and Mode

Choosing the Right Measures of Central Tendency

Summary: Properties of Measures of Central Tendency

Share this:

Table of Contents

Why Sample Size Matters

Special Considerations for Estimating Sample Size

Sample Size Determination Formula

Common Approaches to Sample Size Calculation

For Estimating Proportions (Prevalence Studies)

Common Situations for Sample Size Determination

One Sample t-test and Paired test

Two Sample t-test

Two Sample Proportion Test

Summary

Sample Size Determination FAQs

Share this: