StatisticsIntermediate

Hypothesis Testing

Hypothesis testing is the formal process for using data to make decisions under uncertainty. It is the framework behind every clinical trial, every A/B test, every scientific experiment that claims a finding. Understanding it properly is one of the most practically valuable skills in statistics.

1 The Hypothesis Testing Framework

Hypothesis testing is basically formalized skepticism. You start by assuming the boring explanation is true. Then you collect data and ask: is my data surprising enough to reject that boring explanation? If yes, you have evidence for something more interesting. If no, you do not. You assume the boring, default explanation is true (the null hypothesis) and ask: is my data surprising enough to reject that default in favor of something more interesting?

The null hypothesis is the default assumption: no effect, no difference, nothing interesting happening. The alternative hypothesis is what you are trying to find evidence for. The important thing is you never prove the alternative. You either find enough evidence to reject the null, or you do not. , no effect, no difference, no relationship. The alternative hypothesis (H₁ or Hₐ) is what you're trying to show evidence for. You never "prove" the alternative; you either reject the null or fail to reject it.

The criminal trial analogy

Hypothesis testing works like a trial. The null hypothesis is 'innocent.' You need evidence beyond a reasonable doubt to reject it (convict). Failing to convict doesn't prove innocence , it just means the evidence wasn't strong enough. Similarly, failing to reject H₀ doesn't prove H₀ is true.

2 The Five Steps

Hypothesis Test , Full Example
A manufacturer claims their batteries last 200 hours. You test 36 batteries and find a mean of 195 hours with standard deviation 18 hours. Is the claim false at α = 0.05?
1Step 1 , State hypotheses: H₀: μ = 200 (claim is true). H₁: μ ≠ 200 (two-tailed , either too high or too low)
2Step 2 , Choose significance level: α = 0.05 (5% chance of false rejection)
3Step 3 , Calculate test statistic: z = (x̄ − μ₀)/(σ/√n) = (195−200)/(18/√36) = −5/3 = −1.67
4Step 4 , Find p-value: P(|z| > 1.67) for two-tailed test ≈ 0.095
5Step 5 , Decision: p = 0.095 > α = 0.05, so we fail to reject H₀
Answer: Insufficient evidence to reject the manufacturer's claim at 5% significance

These five steps always go in this order: state your hypotheses, choose your significance level, calculate the test statistic, find the p-value, make a decision. Never work backwards from the data to the hypothesis, and never adjust the threshold after seeing results. → choose α → calculate test statistic → find p-value → make decision. Never skip steps or work backwards.

3 Type I and Type II Errors

Two types of mistakes are possible in hypothesis testing:

The two error types
IType I error (false positive): Rejecting H₀ when it's actually true. Probability = α (the significance level you chose). Analogy: convicting an innocent person.
IIType II error (false negative): Failing to reject H₀ when it's actually false. Probability = β. Analogy: acquitting a guilty person.
1−βPower = 1 − β: probability of correctly rejecting a false H₀. Higher sample size increases power.

There is an inherent tradeoff: reducing α (requiring stronger evidence to reject H₀) reduces Type I errors but increases Type II errors. The choice of α = 0.05 is a convention that balances this tradeoff for general use.

Lower p-value ≠ more important finding

p = 0.001 means strong evidence against H₀, but says nothing about the size or practical importance of the effect. A drug that lowers blood pressure by 0.01 mmHg might produce p = 0.0001 in a huge trial , statistically overwhelming, clinically meaningless. Always report effect sizes.

4 Common Test Types

One-sample z-test: compare a sample mean to a known population mean when σ is known. Used in the example above.

One-sample t-test: same as z-test but when σ is unknown (which is almost always). Uses t-distribution with n−1 degrees of freedom.

Two-sample t-test: compare means from two independent groups. Did the treatment group improve more than the control group?

Chi-square test: test relationships between categorical variables. Are gender and voting preference independent? Are observed counts different from expected counts?

One-tailed vs two-tailed: two-tailed tests (μ ≠ μ₀) test for any difference. One-tailed tests (μ > μ₀ or μ < μ₀) test for a difference in a specific direction. Use two-tailed unless you have a strong theoretical reason for one-tailed before seeing the data.

5 Interpreting Results Correctly

Statistically significant means: if the null hypothesis were true, you would see results this extreme less than alpha percent of the time by chance. It does not mean the effect is large, real, or important. It means the data clears a threshold of unusualness. It does NOT mean the effect is large, important, or certain.

"Fail to reject H₀" does NOT mean H₀ is true. It means the data didn't provide enough evidence to reject it. The effect might still exist but the sample was too small to detect it (low power).

Good practice: always report the test statistic, p-value, effect size (e.g., Cohen's d), and confidence interval together. A single p-value without context is insufficient for scientific conclusions.

Practice Problems

A researcher tests whether a new drug reduces recovery time. H₀: no difference. They find p = 0.03 with α = 0.05. What is the conclusion?
p = 0.03 < α = 0.05, so we reject H₀. There is statistically significant evidence the drug reduces recovery time.
What type of error occurs when you reject H₀ but it was actually true?
Type I error (false positive). Its probability equals α, the significance level.
A study has very low power (β = 0.80, so power = 0.20). What does this mean?
The test only has a 20% chance of detecting a real effect if one exists. There's an 80% chance of a Type II error , missing a true effect. The sample size is likely too small.