Understanding P-value in Statistics

Understanding P-value is important, as P-values are one of the most widely used and misunderstood concepts in the subject of statistics. Whether you are a novice, a data analyst, or an experienced data scientist, understanding p-values is crucial for hypothesis testing, A/B testing, and scientific research. In this post, we will cover:

What is a p-value? Understanding P-value

A p-value (probability value) measures the strength of evidence against a null hypothesis in a statistical test. The formal definition is

The probability of observing a test statistic as extreme as, or more extreme than, the one calculated, assuming the null hypothesis is true.

Key Interpretation: A low p-value (typically ≤ 0.05) suggests the observed data is unlikely under the null hypothesis, leading to its rejection. For example, suppose you run an A/B test:

Null Hypothesis ($H_o$): No difference between versions A and B.

Observed p-value = 0.03 → There is a 3% chance of seeing this result if $H_o$ were true.

Conclusion: Reject $H_o$ at the 5% significance level.

The P-value of a test statistic is the probability of drawing a random sample whose standardized test statistic is at least as contrary to the claim of the Null Hypothesis as that observed in the sample group.

How to Interpret P-Values Correctly?

To interpret P-values correctly, we need thresholds and Significance. For example,

  • $p \le 0.05$: Often considered “statistically significant” (but context matters!).
  • $p > 0.05$: Insufficient evidence to reject $H_o$ (but not proof that $H_o$ is true).

The following are some common Misinterpretations:

  • A p-value is the probability that the null hypothesis is true. → No! It is the probability of the data given $H_o$, not the other way around.
  • A smaller p-value means a stronger effect. → No! It only indicates stronger evidence against $H_o$, not the effect size.
  • $p > 0.05$ means ‘no effect.’ → No! It means no statistically significant evidence, not proof of absence.

Limitations and Criticisms of P-Values

The following are some limitations and criticisms of P-values:

  • P-hacking: Cherry-picking data to get $p\le 0.05$ inflates false positives.
  • Dependence on Sample Size: Large samples can produce tiny p-values for trivial effects.
  • Alternatives: Consider confidence intervals, Bayesian methods, or effect sizes.

Cherry-Picking Data: selectively choosing data points that support a desired outcome or hypothesis while ignoring data that contradicts it. For example, showing an upward sales trend over the first few months of a year, while omitting the data that showed sales declined for the rest of the year.

Understanding p-value

Computing P-value: A Numerical Example

A university claims that the average SAT score for its incoming students is 1080. A sample of 56 freshmen at the university is drawn, and the average SAT score is found to be $\overline{x} = 1044$ with a sample standard deviation of $s=94.7$ points. Find the p-value.

Suppose our hypothesis in this case is

$H_o: \mu = 1080$

$H_1: \mu \ne 1080$

The standardized test statistic is:

\begin{align*}
Z &= \frac{\overline{x} – \mu_o }{\frac{s}{\sqrt{n}}} \\
&= \frac{1044-1080}{\frac{94.7}{\sqrt{56}}} = -2.85
\end{align*}

From the alternative hypothesis, the test statistic is two-tailed, therefore, the p-value is given by

\begin{align*}
P(z \le -2.85\,\, or\,\, z \ge 2.85) &= 2 \times P(z\le -2.85)\\
&=2\times 0.0022 = 0.0044
\end{align*}

Deciding to Reject the Null Hypothesis

A very small p-value would lead us to reject the null hypothesis while a high p-value would not Since the p-value of a test is the probability of randomly drawing a sample at least as contrary to $H_o$ as the observed sample, one can think of the p-value as the probability that we will be wrong if we choose to reject $H_o$ based on our sampled data. The p-value, then, is the probability of making a Type I Error.

Recall that the maximum acceptable probability of making a Type-I Error is the significance level ($\alpha$), and it is usually determined at the outset of the hypothesis test. The rule that is used to decide whether to reject $H_o$ is:

  • Reject $H_o$ if $p \le \alpha$
  • Do not reject $H_o$ if p > \alpha$

Practical Example: Calculating P-Values in Python & R

from scipy import stats

# Two-sample t-test  

t_stat, p_value = stats.ttest_ind(group_A, group_B)

print(f"P-value: {p_value:.4f}") 
# Two-Sample t-test

result <- t.test(group_A, group_B)

print(paste("P-value:", result$p.value))

Best Practices for Using P-Values

  • Pre-specify significance levels (e.g., $\ alpha=0.05$) before testing.
  • Report effect sizes and confidence intervals alongside p-values.
  • Avoid dichotomizing results (“significant” vs “not significant”).
  • Consider Bayesian alternatives when appropriate.

Conclusion

P-values are powerful but often misused. By understanding their definition, interpretation, and limitations, you can make better data-driven decisions.

Want to learn more?

statistics help https://itfeature.com Statistics for Data Science & Analytics

Try Permutation Combination Math MCQS

Hypothesis Testing MCQs Test 12

The post is about Hypothesis Testing MCQs Test with Answers. The quiz contains 20 questions about hypothesis testing and p-values. It covers the topics of formulation of the null and alternative hypotheses, level of significance, test statistics, region of rejection, decision, effect size, about acceptance and rejection of the hypothesis. Let us start with the Quiz Hypothesis Testing MCQs Test now.

Hypothesis Testing MCQs Test with Answers

Online Hypothesis Testing MCQs Test with Answers

1. A study compared five different methods for teaching descriptive statistics. The five methods were (i) traditional lecture and discussion, (ii) programmed textbook instruction, (iii) programmed text with lectures, (iv) computer instruction, and (v) computer instruction with lectures. 45 students were randomly assigned, 9 to each method. After completing the course, students took a 1-hour exam. We are interested in finding out if the average test scores are different for the different teaching methods.

If the original significance level for the ANOVA was 0.05, what should be the adjusted significance level for the pairwise tests to compare all pairs of means to each other?

 
 
 
 

2. One-sided alternative hypotheses are phrased in terms of:

 
 
 
 

3. We want to estimate the average coffee intake of Coursera students, measured in cups of coffee. A survey of 1,000 students yields an average of 0.55 cups per day, with a standard deviation of 1 cup per day. Which of the following is not necessarily true?

 
 
 
 

4. If a p-value for a hypothesis test of the mean was 0.0330 and the level of significance was 5%, what conclusion would you draw?

 
 
 
 

5. You set up a two-sided hypothesis test for a population mean with a null hypothesis of $H_0:\mu=100$. You chose a significance level $\alpha=0.05$. The p-value calculated from the data is 0.12, and hence you failed to reject the null hypothesis. Suppose that after your analysis was completed and published, an expert informed you that the true value of  $\mu$ is 104. How would you describe the result of your analysis?

 
 
 

6. Which hypothesis is tested for possible rejection under the assumption that it is true?

 
 
 
 

7. Which of the following are tests about population proportions and frequencies?

 
 
 
 

8. A man accused of committing a crime is taking a polygraph (lie detector) test. The polygraph is essentially testing the hypotheses
$H_0$: The man is telling the truth vs. $H_a$: The man is not telling the truth.
Suppose we use a 5% level of significance. Based on the man’s responses to the questions asked, the polygraph determines a P-value of 0.08. We conclude that:

 
 
 
 

9. Which of the following is false?

 
 
 
 

10. The value $(1 – \alpha)$ is called ————–.

 
 
 
 

11. If you were running a two-tail t-test with a sample size of $n=24$, what would the critical t-value be if $\alpha$ was chosen as 5%?

 
 
 
 

12. Which of the following would best be analyzed using a chi-square test of independence?

 
 
 
 

13. Scientists claim that a diet will increase the mean weight of eggs at least by 0.3 ounces. A sample of 25 eggs has a mean increase of 0.4 ounces with a SD of 0.20. What will be the null hypothesis for testing this claim about diet?

 
 
 
 

14. The power of a statistical test is the probability of rejecting the null hypothesis when it is —————–. When you increase alpha, the power of the test will —————.

 
 
 
 

15. A statement or assumption made about the value of a population parameter is

 
 
 
 

16. The feed of a certain type of hormone increases the mean weight of chicks by 0.3 ounces. A sample of 25 eggs has a mean increase of 0.4 ounces with a standard deviation of 0.20 ounces. What is the value of the t-statistic?

 
 
 
 

17. Which of the following is false regarding paired data?

 
 
 
 

18. For given values of the sample mean and the sample standard deviation when $n = 25$, you conduct a hypothesis test and obtain a p-value of 0.0667, which leads to non-rejection of the null hypothesis. What will happen to the p-value if the sample size increases (and all else stays the same)?

 
 
 
 

19. A Type 2 error occurs when the null hypothesis is

 
 
 
 

20. Which of the following is false?

 
 
 
 

Online Hypothesis Testing MCQs Test with Answers

  • Which of the following are tests about population proportions and frequencies?
  • Which of the following would best be analyzed using a chi-square test of independence?
  • A man accused of committing a crime is taking a polygraph (lie detector) test. The polygraph is essentially testing the hypotheses $H_0$: The man is telling the truth vs. $H_a$: The man is not telling the truth. Suppose we use a 5% level of significance. Based on the man’s responses to the questions asked, the polygraph determines a P-value of 0.08. We conclude that:
  • If you were running a two-tail t-test with a sample size of $n=24$, what would the critical t-value be if $\alpha$ was chosen as 5%?
  • If a p-value for a hypothesis test of the mean was 0.0330 and the level of significance was 5%, what conclusion would you draw?
  • The power of a statistical test is the probability of rejecting the null hypothesis when it is —————–. When you increase alpha, the power of the test will —————.
  • The value $(1 – \alpha)$ is called ————–.
  • Which of the following is false?
  • Which of the following is false?
  • We want to estimate the average coffee intake of Coursera students, measured in cups of coffee. A survey of 1,000 students yields an average of 0.55 cups per day, with a standard deviation of 1 cup per day. Which of the following is not necessarily true?
  • One-sided alternative hypotheses are phrased in terms of:
  • A Type 2 error occurs when the null hypothesis is
  • You set up a two-sided hypothesis test for a population mean with a null hypothesis of $H_0:\mu=100$. You chose a significance level $\alpha=0.05$. The p-value calculated from the data is 0.12, and hence you failed to reject the null hypothesis. Suppose that after your analysis was completed and published, an expert informed you that the true value of  $\mu$ is 104. How would you describe the result of your analysis?
  • For given values of the sample mean and the sample standard deviation when $n = 25$, you conduct a hypothesis test and obtain a p-value of 0.0667, which leads to non-rejection of the null hypothesis. What will happen to the p-value if the sample size increases (and all else stays the same)?
  • A study compared five different methods for teaching descriptive statistics. The five methods were (i) traditional lecture and discussion, (ii) programmed textbook instruction, (iii) programmed text with lectures, (iv) computer instruction, and (v) computer instruction with lectures. 45 students were randomly assigned, 9 to each method. After completing the course, students took a 1-hour exam. We are interested in finding out if the average test scores are different for the different teaching methods. If the original significance level for the ANOVA was 0.05, what should be the adjusted significance level for the pairwise tests to compare all pairs of means to each other?
  • Which of the following is false regarding paired data?
  • A statement or assumption made about the value of a population parameter is
  • Which hypothesis is tested for possible rejection under the assumption that it is true?
  • The feed of a certain type of hormone increases the mean weight of chicks by 0.3 ounces. A sample of 25 eggs has a mean increase of 0.4 ounces with a standard deviation of 0.20 ounces. What is the value of the t-statistic?
  • Scientists claim that a diet will increase the mean weight of eggs at least by 0.3 ounces. A sample of 25 eggs has a mean increase of 0.4 ounces with a SD of 0.20. What will be the null hypothesis for testing this claim about diet?

Learn R Programming

MCQs General Knowledge

Testing of Hypothesis Quiz 11

The quiz is about Testing of Hypothesis Quiz with Answers. The quiz contains 20 questions about hypothesis testing and p-values. It covers the topics of formulation of the null and alternative hypotheses, level of significance, test statistics, region of rejection, decision, effect size, value, confidence interval, about acceptance and rejection of the hypothesis. Let us start with the MCQs Testing of Hypothesis Quiz now.

MCQs Testing of Hypothesis quiz with Answers
Please go to Testing of Hypothesis Quiz 11 to view the test

Testing of Hypothesis Quiz with Answers

  • The main goal of a direct replication is to ————-; replications are important according to Popper because —————.  
  • What is an important reason to make sure the data and analysis scripts related to your research are well-organized?
  • In Frequentist statistics, a p-value lower than the alpha level can mean —————. This differs from Bayesian statistics, which focuses on ——————.
  • You performed 6 studies, only 4 of them had a significant result. The likelihood ratio of this happening assuming $H_0$ versus assuming $H_1$ tells you ————-. If you assume you had around 80% power, this likelihood ratio will probably show that ————-.
  • We compare model A (the effect is 0) to model B (the effect is 1) and find a Bayes Factor of 10 which means ————–; the effect size is estimated with a certain 95% credible interval, this interval ———————.
  • When $H_0$ is true, the probability that at least 1 out of an $X$ completely independent findings is a Type 1 error is equal to —————-, this probability ————— when you look at your data and collect more data if a test is not significant.
  • You did a pilot study that found an effect size of 0.4, and $p < 0.05$. You decide to repeat the study with a power of 80% and an alpha of 5%. In the second study, assuming $H_0$ is true, the probability of a type 1 error is ————–. Assuming $H_0$ is false, the probability of a type 2 error is —————–.
  • A researcher reports two significant findings testing the same hypothesis, using an alpha of 5%. The researcher predicted one finding before doing the study, but the other finding was observed during exploratory analyses where many tests were performed. Which statement is correct?
  • An example of a standardized effect size is ————–; these are useful for ————–.
  • If the difference between means is 2, and the standard deviation is 3, Cohen’s d is —————- which is ————— according to the rule of thumb.
  • In an ANOVA with multiple predictors, a partial eta-squared gives ————–?
  • You analyze your data in two ways. With Frequentist statistics you find a mean effect size of 3, with a 95% confidence interval of 1 to 5. With Bayesian methods, you find a mean of 2.75, with a 95% credible interval of 1.5 to 4. Which conclusions can you make?
  • What are the benefits of performing a study with a larger sample size, compared to doing the same study with a smaller sample size (all else being equal)?
  • You performed a p-curve analysis and found a skewed distribution of p-values with much more small p-values (around 0.01) than high p-values (around 0.04). What does this mean?
  • You predict that your intervention will significantly increase participants’ performance on a test, this is an example of —————-. You find a significant result and conclude your theory is true, this is an example of ——————-.
  • For confirmatory analyses it is problematic to —————; for exploratory analyses, it is NOT problematic to ——————.
  • The main goal of direct replication is —————; the main reason(s) why successful replication rates are low is ——————-.
  • How do we know there is publication bias in favor of significant results? Why is it unreasonable to expect articles with 4 experiments that aim for 80% power to exclusively show significant results?
  • The Dutch Government wants 100% of scientific articles to be Open Access in 2024. What is the main advantage of open access that led the government to aim for 100% Open Access in 2024?
  • If a test of hypothesis has a Type I error probability of 0.01, what does this mean?

R Language and Data Analysis