Hypothesis Testing
Hypothesis testing is the formal process for using data to make decisions under uncertainty. It is the framework behind every clinical trial, every A/B test, every scientific experiment that claims a finding. Understanding it properly is one of the most practically valuable skills in statistics.
In this lesson
1 The Hypothesis Testing Framework
Hypothesis testing is basically formalized skepticism. You start by assuming the boring explanation is true. Then you collect data and ask: is my data surprising enough to reject that boring explanation? If yes, you have evidence for something more interesting. If no, you do not. You assume the boring, default explanation is true (the null hypothesis) and ask: is my data surprising enough to reject that default in favor of something more interesting?
The null hypothesis is the default assumption: no effect, no difference, nothing interesting happening. The alternative hypothesis is what you are trying to find evidence for. The important thing is you never prove the alternative. You either find enough evidence to reject the null, or you do not. , no effect, no difference, no relationship. The alternative hypothesis (H₁ or Hₐ) is what you're trying to show evidence for. You never "prove" the alternative; you either reject the null or fail to reject it.
Hypothesis testing works like a trial. The null hypothesis is 'innocent.' You need evidence beyond a reasonable doubt to reject it (convict). Failing to convict doesn't prove innocence , it just means the evidence wasn't strong enough. Similarly, failing to reject H₀ doesn't prove H₀ is true.
2 The Five Steps
−1.670.095These five steps always go in this order: state your hypotheses, choose your significance level, calculate the test statistic, find the p-value, make a decision. Never work backwards from the data to the hypothesis, and never adjust the threshold after seeing results. → choose α → calculate test statistic → find p-value → make decision. Never skip steps or work backwards.
3 Type I and Type II Errors
Two types of mistakes are possible in hypothesis testing:
There is an inherent tradeoff: reducing α (requiring stronger evidence to reject H₀) reduces Type I errors but increases Type II errors. The choice of α = 0.05 is a convention that balances this tradeoff for general use.
p = 0.001 means strong evidence against H₀, but says nothing about the size or practical importance of the effect. A drug that lowers blood pressure by 0.01 mmHg might produce p = 0.0001 in a huge trial , statistically overwhelming, clinically meaningless. Always report effect sizes.
4 Common Test Types
One-sample z-test: compare a sample mean to a known population mean when σ is known. Used in the example above.
One-sample t-test: same as z-test but when σ is unknown (which is almost always). Uses t-distribution with n−1 degrees of freedom.
Two-sample t-test: compare means from two independent groups. Did the treatment group improve more than the control group?
Chi-square test: test relationships between categorical variables. Are gender and voting preference independent? Are observed counts different from expected counts?
One-tailed vs two-tailed: two-tailed tests (μ ≠ μ₀) test for any difference. One-tailed tests (μ > μ₀ or μ < μ₀) test for a difference in a specific direction. Use two-tailed unless you have a strong theoretical reason for one-tailed before seeing the data.
5 Interpreting Results Correctly
Statistically significant means: if the null hypothesis were true, you would see results this extreme less than alpha percent of the time by chance. It does not mean the effect is large, real, or important. It means the data clears a threshold of unusualness. It does NOT mean the effect is large, important, or certain.
"Fail to reject H₀" does NOT mean H₀ is true. It means the data didn't provide enough evidence to reject it. The effect might still exist but the sample was too small to detect it (low power).
Good practice: always report the test statistic, p-value, effect size (e.g., Cohen's d), and confidence interval together. A single p-value without context is insufficient for scientific conclusions.
Practice Problems
Sources & Further Reading
The explanations on this page draw on the following established sources. We link to primary and secondary sources so you can verify claims and go deeper on any topic.