You look at the numbers and they scream: Option A is better than Option B.
Then someone “just averages everything,” and suddenly the conclusion flips: B is better than A.
It feels like a glitch in reality. It isn’t. It’s a very real statistical trap called Simpson’s Paradox-and it shows up in marketing, medicine, hiring, education, and basically anywhere people love a single “overall” metric.
Simpson’s Paradox in one sentence
Simpson’s Paradox happens when:
- A beats B inside every subgroup,
- but B beats A after you combine the subgroups.
The villain isn’t “bad math.” The villain is unequal weighting: the groups don’t contribute equally to the overall average for A and B.
In other words: you didn’t compare A vs B.
You compared two different mixtures.
A shareable example you can compute in 60 seconds
Let’s make this concrete with a classic setup: two audience segments.
- Segment 1: “Easy” audience (warm leads, brand fans, or high-intent traffic)
- Segment 2: “Hard” audience (cold traffic, low intent, or tough cases)
We’re comparing conversion rates for two landing pages: A and B.
The data
Segment 1 (easy):
- A: 9 conversions out of 10 visitors → 90%
- B: 800 conversions out of 1000 visitors → 80%
So in Segment 1: A > B.
Segment 2 (hard):
- A: 20 conversions out of 1000 visitors → 2%
- B: 1 conversion out of 100 visitors → 1%
So in Segment 2: A > B again.
A is better in both segments. Case closed… right?
Now combine the data “overall.”
Overall results (the part that shocks people)
Overall conversion rate is:
.
For A:
- Total conversions: 9 + 20 = 29
- Total visitors: 10 + 1000 = 1010
- Overall:
For B:
- Total conversions: 800 + 1 = 801
- Total visitors: 1000 + 100 = 1100
- Overall:
So overall: B > A by a landslide.
Let’s say it out loud:
A wins in every segment, but loses overall.
This is Simpson’s Paradox in its purest form.
Why the flip happens: “overall” is a weighted average
The overall rate isn’t a magical truth serum. It’s a weighted average of the group rates.
For two groups, the overall rate for A is:
,
and similarly for B:
.
Where:
- rate in group 1, rate in group 2
- = how much A is represented in each group
- = how much B is represented in each group
Here’s the key: A and B don’t share the same weights.
In our example:
- A is mostly tested on the hard group (1000 hard vs 10 easy).
- B is mostly tested on the easy group (1000 easy vs 100 hard).
So the “overall” numbers are answering a different question than you think.
They aren’t answering:
“Which page is better for a given person?”
They’re answering:
“Which page did better under the specific audience mix it happened to receive?”
Those are not the same question.
Is Simpson’s Paradox a lie… or a clue?
This is where people get heated, because both sides can be “right” depending on your goal.
If your question is “Which is better within each segment?”
Then you should compare within segments.
- In our example, A is better for easy users.
- A is also better for hard users. So A wins for both segments.
That’s useful if you’re deciding what to show once you know the segment.
If your question is “Which version produces more total conversions as deployed?”
Then the overall number may matter-but only with a giant warning label:
- The overall advantage might come from who got shown what, not from the variant itself.
In real life, this “who got shown what” can happen due to:
- targeting differences (campaign A hits colder regions/devices)
- algorithmic allocation (platform sends B better traffic)
- time effects (A ran during a slow week, B during a busy week)
- selection bias (people self-select into groups)
Simpson’s Paradox is often a symptom of a confounding variable-a hidden factor influencing both exposure and outcome.
How not to get fooled: a practical checklist
If you run A/B tests, read reports, or even compare two strategies, this checklist saves you.
Always look for natural subgroups Common ones:
- country/region
- device (mobile vs desktop)
- new vs returning users
- traffic source (ads vs organic)
- intent level (brand vs non-brand search)
Check the mix Ask: “Did A and B get the same audience composition?”
- If not, the overall comparison is at risk.
Don’t worship a single “overall” metric Overall conversion can be a useful KPI, but it can also be a blender that hides the story.
Prefer randomization when possible Random assignment (true A/B testing) is a powerful antidote to hidden bias. If traffic wasn’t randomized, be extra suspicious of “overall wins.”
Report both views A good report shows:
- results by key segments
- the overall result
- the segment proportions for A and B (the weights)
When someone sees the weights, the paradox stops feeling spooky and starts feeling obvious.
Bonus: how to “fix” the comparison with standardization
Suppose you want a fair head-to-head comparison:
“What if A and B had the same audience mix?”
That’s called standardization (or “reweighting”).
Here’s a simple version: give each segment equal weight, 50/50.
From our example:
- Segment 1 rates: A = 90%, B = 80%
- Segment 2 rates: A = 2%, B = 1%
Equal-weight overall (just average the two segment rates):
- A:
- B:
Under equal weights, A > B, matching the within-segment result.
So what was going on?
- “Overall” was dominated by unequal exposure.
- “Fair comparison” removes that dominance by aligning the weights.
In real work, you might weight segments by:
- your target customer mix,
- long-run expected traffic mix,
- or a standard benchmark population.
The principle is the same: compare like with like.
Key takeaways
- Simpson’s Paradox is when A beats B in every subgroup, but B beats A after combining the data.
- The flip happens because overall rates are weighted averages, and A/B often have different weights across groups.
- The “overall winner” may reflect audience mix (or confounding), not the intrinsic quality of A vs B.
- To avoid being fooled, inspect key segments, check group proportions, and prefer randomized allocation.
- If needed, use standardization/reweighting to compare A and B under the same audience mix.
