StatisticsBeginner

Correlation vs Causation

Ice cream sales and drowning rates are strongly correlated. Eating ice cream does not cause drowning. Understanding why , and how to think carefully about correlation and causation , is one of the most practically valuable skills in statistics.

In this lesson

What correlation means
What causation means
Why they are different
Confounding variables
How causation is actually established

1 What Correlation Means

Correlation measures how closely two variables move together. A correlation coefficient (r) runs from -1 to +1. At +1, they move in perfect lockstep. At -1, when one goes up the other goes down perfectly. At 0, there is no linear relationship to speak of. A correlation coefficient (r) ranges from −1 to +1. A correlation of +1 means the variables increase together perfectly. −1 means when one increases the other decreases perfectly. 0 means no linear relationship.

Correlation simply describes a pattern in data. Two variables are correlated if knowing one tells you something about the other. Shoe size and reading ability are correlated in children , not because shoes cause reading, but because both increase with age.

Correlation Coefficient (r)

r close to +1: strong positive correlation (both increase together). r close to −1: strong negative correlation (one increases as the other decreases). r close to 0: weak or no linear correlation. Values above 0.7 or below −0.7 are generally considered strong.

2 What Causation Means

Causation means one thing directly produces another. Smoking causes lung cancer. The mechanism is understood, the relationship is consistent, and reducing the cause reduces the effect. That chain of evidence is what makes something causal rather than just correlated. , the mechanism is understood, the relationship is direct, and removing the cause reduces the effect. This is causation.

To establish causation you need more than a correlation in data. You need a plausible mechanism, evidence that the cause precedes the effect in time, and ideally experimental evidence where you manipulate the cause and observe the effect while controlling everything else.

3 Why They Are Different

The ice cream and drowning example is the classic illustration. Both go up in summer. Neither causes the other. Hot weather causes people to buy more ice cream and also causes more people to go swimming, and more swimming means more drowning. Warm weather is the hidden third variable driving both. The real cause of both is warm weather , people eat more ice cream when it's hot, and more people swim (and some drown) when it's hot. Ice cream and drowning are correlated, but neither causes the other. The warm weather variable is hiding in the background.

Classic Spurious Correlations

Nicolas Cage films released per year correlates with swimming pool drownings

1Strong positive correlation found in real US data across multiple years

2Mechanism? There is none , this is pure coincidence in small samples

3Both happen to move together over time without any causal link

4This is called a spurious correlation , statistically real, causally meaningless

Answer: Correlation without causation , no action should be taken based on this relationship

Tyler Vigen's Spurious Correlations website documents hundreds of these: divorce rate in Maine correlates with per capita margarine consumption. US spending on science correlates with suicides by hanging. These demonstrate that any two trending variables will correlate if you look hard enough.

4 Confounding Variables

A confounding variable is a third variable that influences both of the variables you're studying, creating a spurious correlation between them. Warm weather confounds the ice cream/drowning correlation. Age confounds the shoe size/reading ability correlation.

Coffee and heart disease looked correlated for years. Coffee drinkers had higher rates of heart disease in early studies. The problem was that heavy coffee drinkers also smoked more, and smoking was the actual driver. When researchers controlled for smoking, the coffee association disappeared. The confounder was hiding in plain sight. , coffee drinkers had higher rates of heart disease. Later research found that heavy coffee drinkers also smoked more. Smoking was the confounder. When researchers controlled for smoking, the coffee-heart disease correlation disappeared.

How to Spot Confounders

Ask: is there a third variable that could cause both of the things I'm observing? Demographic variables (age, income, education) are common confounders because they affect so many outcomes simultaneously.

5 How Causation Is Actually Established

The gold standard for establishing causation is a randomized controlled trial. You randomly assign people to the treatment or control group, give one group the treatment and not the other, and measure what happens. Random assignment is the key step: it makes the two groups similar in every way except the treatment, which eliminates confounders by design., administer the treatment to one group only, and measure outcomes. Random assignment eliminates confounders , both groups are similar in every way except the treatment.

When RCTs are unethical or impractical (you can't randomly assign people to smoke for 30 years), researchers use natural experiments, instrumental variables, regression discontinuity, and difference-in-differences , all statistical techniques designed to approximate the conditions of a controlled experiment using observational data.

Bradford Hill criteria provide a framework for inferring causation from observational data: strength of association, consistency across studies, specificity, temporality (cause precedes effect), biological gradient (dose-response), plausibility of mechanism, coherence with existing knowledge, and experimental evidence where available. No single criterion is sufficient; the totality of evidence builds the case.

Practice Problems

Countries with more televisions per capita have higher life expectancy. Does TV cause longer life?

No , this is a confounding relationship. Wealthier countries have both more TVs and better healthcare, nutrition, and living conditions. Wealth is the confounder driving both variables.

A study finds students who eat breakfast get better grades. Does breakfast cause better grades?

Possibly, but not necessarily from this data alone. Students who eat breakfast may also come from households with more resources, stability, and parental involvement , all of which affect grades. To establish causation you'd need a controlled study.

What is the key difference between a correlational study and a randomized controlled trial?

In a correlational study you observe existing data , confounders can't be controlled. In an RCT, random assignment ensures both groups are similar in all ways except the treatment, eliminating confounders and enabling causal conclusions.

Sources & Further Reading

The explanations on this page draw on the following established sources. We link to primary and secondary sources so you can verify claims and go deeper on any topic.

Khan AcademyCorrelation and CausalityVideo explanation Spurious CorrelationsTyler VigenFunny examples of the problem WikipediaCorrelation does not imply causationFull explanation Crash CourseCorrelation StudiesYouTube video