April 14, 2026 By Vincent Declercq

Simpson's Paradox: The Statistical Trap That Ruins A/B Tests

Your A/B test shows Version B wins overall. Higher conversion rate, clear statistical significance. Time to ship. Then you segment the data. Version B loses on mobile. It loses on desktop too. It loses in every single segment. But it wins overall?

This isn't a bug in your analytics. It's Simpson's Paradox, a statistical phenomenon that can turn winning A/B tests into losing decisions. It sits alongside type 1 errors (false positives) and type 2 errors (false negatives) as one of the most damaging statistical traps in conversion rate optimization.

Here's how it works, how to detect it, and how to protect your A/B testing programme from it.

What Is Simpson's Paradox?

Simpson's Paradox occurs when a trend that appears in aggregated data reverses or disappears when the data is split into subgroups. Put simply: the whole tells a different story than its parts.

In A/B testing and CRO, this means your overall conversion rate can declare a winner that is actually the loser in every individual segment, mobile, desktop, new visitors, returning visitors, every one.

A Simple Example: Two Doctors

Imagine two doctors treating patients. Here are their results:

	Doctor A	Doctor B
Mild cases treated	100 patients	10 patients
Mild cases recovered	90 (90%)	8 (80%)
Severe cases treated	10 patients	100 patients
Severe cases recovered	3 (30%)	40 (40%)
Overall recovery rate	93/110 = 85%	48/110 = 44%

At first glance, Doctor A has an 85% overall success rate vs. Doctor B's 44%. Doctor A seems far better. But look at the individual segments: Doctor B outperforms Doctor A on severe cases (40% vs. 30%). Doctor B is genuinely better at treating severe cases. Yet the overall number hides this because Doctor B treats far more severe patients, whose lower baseline recovery rate drags down the aggregate.

This is Simpson's Paradox. The aggregate result contradicts the subgroup results because the patient mix is different between the two doctors.

How Simpson's Paradox Shows Up in A/B Testing

In A/B testing, Simpson's Paradox appears when there is a confounding variable that affects conversion rate AND is unevenly distributed between your test variants. This is the core mechanism behind many misleading split test results.

Example 1: Consistent Traffic Mix (No Paradox)

You run an A/B test on a landing page. The traffic split by device is roughly equal between variants:

Segment	Control (Version A)	Variation (Version B)
Mobile visitors	3% conversion (8,000)	4% conversion (2,000)
Desktop visitors	6% conversion (2,000)	7% conversion (8,000)
Overall	3.6% (10,000)	6.4% (10,000)

Version B wins overall and wins in both segments. The result is consistent and trustworthy. No paradox here.

Example 2: Unequal Traffic Mix (Paradox Emerges)

Now run the same test, but traffic distribution shifts. Control gets more desktop visitors (who convert at higher rates). Variation gets more mobile visitors (who convert at lower rates):

Segment	Control (Version A)	Variation (Version B)
Mobile visitors	3% conversion (2,000)	4% conversion (8,000)
Desktop visitors	6% conversion (8,000)	7% conversion (2,000)
Overall	5.4% (10,000)	4.6% (10,000)

Control now wins overall (5.4% vs. 4.6%). But Variation still wins in both individual segments, mobile (4% vs. 3%) and desktop (7% vs. 6%). If you ship Control based on the aggregate, you are shipping the version that performs worse for every single user.

Common Confounding Variables in A/B Testing

Simpson's Paradox is caused by confounders, variables that influence both group assignment and the outcome. The most common ones in CRO and split testing are:

Unequal device split: one variant receives more mobile traffic, which has a lower baseline conversion rate
Traffic source shifts: a new paid campaign launches mid-test and only affects one variant's audience
Time-based effects: running over a weekend shifts one variant toward leisure intent traffic
New vs. returning imbalance: returning visitors convert at higher rates; unequal distribution skews aggregate results
Geographic shifts: regional campaigns or events that affect one variant's audience composition
Bucketing problems: flawed randomisation that correlates with user characteristics

A Detailed A/B Testing Walkthrough

Here is a realistic scenario showing how Simpson's Paradox builds across a two-week test on a new checkout flow.

Week 1 Results

Traffic is evenly split. Variation wins in every segment and overall.

Segment	Control	Variation
New visitors	2.0% (1,000)	2.5% (1,000)
Returning visitors	5.0% (1,000)	5.5% (1,000)
Overall	3.5% (2,000)	4.0% (2,000)

Week 2 Results

Mid-test, traffic composition shifts. Control receives more returning visitors (higher baseline converters). Variation receives more new visitors (lower baseline converters).

Segment	Control	Variation
New visitors	2.0% (1,000)	2.5% (4,000)
Returning visitors	5.0% (4,000)	5.5% (1,000)
Overall	4.4% (5,000)	2.9% (5,000)

Combined Results

Aggregating both weeks produces a misleading overall picture:

Segment	Control	Variation
New visitors	2.0% (2,000)	2.5% (5,000)
Returning visitors	5.0% (5,000)	5.5% (2,000)
Overall	4.14% (7,000)	3.36% (7,000)

Control wins in aggregate: 4.14% vs. 3.36%. But Variation wins in every segment. If you ship Control based on the combined aggregate result, you are shipping the version that performs worse for both new and returning visitors. This is Simpson's Paradox in action.

How to Detect Simpson's Paradox in Your Tests

Step 1: Always Segment Your Results

Never base decisions solely on aggregate conversion rates. Check results across the following segments as standard:

Segment	Why It Matters	Red Flag
Device type	Mobile converts differently than desktop	Unequal mobile/desktop split between variants
Traffic source	Paid vs. organic intent differs significantly	New campaign launched mid-test
New vs. returning	Returning visitors convert at higher rates	One variant skews heavily returning
Geography	Conversion rates vary by market	Geographic campaign changes mid-test
Time period	Weekend vs. weekday behaviour differs	Test results shift dramatically week-on-week

If the overall winner and the segment-level winner disagree, you have a paradox, or at minimum a confounding variable worth investigating before shipping.

Step 2: Check for Traffic Mix Differences Between Variants

Compare the audience composition of each variant side by side:

What percentage of Control was mobile vs. desktop?
What percentage of Variation was mobile vs. desktop?
Did these proportions change between week 1 and week 2?
Did new vs. returning visitor ratios differ significantly?

Significant differences in segment distribution between variants are the primary red flag for Simpson's Paradox.

Step 3: Plot Results Over Time

Visualise conversion rate for each variant day by day. A dramatic shift mid-test, particularly if it coincides with a campaign launch, a seasonal event, or a day-of-week pattern, signals that traffic composition changed. This is often where the paradox originates.

Step 4: Understand Your Error Types

Simpson's Paradox is distinct from the two classical A/B testing errors, but all three can compound:

Error Type	What It Means	In A/B Testing
Type 1 error (false positive)	Concluding an effect exists when it does not	Calling a winner when the result was random chance
Type 2 error (false negative)	Missing an effect that does exist	Dismissing a real improvement as insignificant
Simpson's Paradox	Aggregate trend reverses at segment level	Shipping a winner that loses for every user group

How to Prevent Simpson's Paradox

Prevention is built into how you design and run tests, not just how you analyse them:

Action	When	Why It Helps
Use stratified sampling	Before test launch	Ensures balanced traffic mix across variants
Set minimum sample size	Before test launch	Avoids underpowered tests and early false positives
Pre-register analysis plan	Before test launch	Prevents cherry-picking segments after the fact
Avoid mid-test traffic changes	During test	Prevents confounding variables from shifting the mix
Run for 2-4 weeks minimum	During test	Normalises day-of-week and traffic composition effects
Segment results on completion	After test	Detects paradox before shipping a misleading winner

Stratified Randomisation: Instead of pure random assignment, stratify by known confounders before the test begins. Ensure each variant receives proportionally equal amounts of mobile traffic, new visitors, traffic from each major source, and so on. Most enterprise-grade A/B testing platforms support stratified sampling. If yours does not, this is a significant limitation.

Set Your Sample Size Before You Start: Use a statistical significance calculator to determine the minimum number of visitors needed per variant before you begin. This disciplines you against ending tests early when early results look promising, a major driver of type 1 errors and susceptibility to Simpson's Paradox.

Pre-Register Your Analysis Plan: Decide before the test launches exactly which segments you will analyse and how you will resolve conflicts between aggregate and segment results. Pre-registration prevents post-hoc rationalisation and segment cherry-picking, which is itself a form of CRO statistics error.

What to Do When You Find Simpson's Paradox

Do not trust the aggregate result. The aggregate number is misleading. Do not ship based on it. Making decisions from a paradoxical aggregate result could mean rolling out a change that hurts performance for every user segment simultaneously.

Investigate the root cause. Work through the following questions:

Was there a bucketing or randomisation problem?
Did external traffic shifts affect one variant more than the other?
Are there strong time-based effects, day of week, week of month?
Was a new paid campaign or channel launched mid-test?

Consider segment-specific experiences. If the paradox is real and rooted in genuine segment differences rather than a data error, personalisation may be the correct answer. Variation performs better for mobile users; Control performs better for desktop users. Rather than choosing one, serve each segment the version that works for them. This is precisely the use case that personalisation and dynamic page optimisation tools are built for.

Re-run with proper stratification. If traffic mix issues caused the paradox and can be controlled, re-run the test with stratified randomisation and a pre-determined minimum sample size. Ensure no major traffic changes occur during the test window.

Simpson's Paradox sits within a broader family of statistical illusions that mislead CRO practitioners and split testers.

Type 1 and Type 2 Errors: Type 1 errors (false positives) declare winners prematurely due to insufficient sample size or early stopping. Type 2 errors (false negatives) dismiss genuine improvements as noise due to underpowered tests. Both are reduced by correct sample size calculation before tests begin and adherence to pre-registered test durations.

Berkson's Paradox: When a sample is selected based on two variables, those variables can appear negatively correlated within the sample even when they are unrelated in the general population. Relevant when analysing converted users only rather than all visitors.

Survivorship Bias: Analysing only data that survived a selection process misses the full picture. In CRO, this might mean optimising for users who reached checkout while ignoring the larger population who never got there.

Regression to the Mean: Extreme results tend to move toward average over time. A page that performed unusually well in week 1 may normalise in week 2 regardless of which variant it was. This can be mistaken for a real treatment effect in short tests.

Preventing Misleading Results in Your CRO Programme

Simpson's Paradox highlights why professional A/B testing requires more than comparing two conversion rate numbers. A rigorous CRO testing process requires:

Careful experiment design before a single visitor is assigned
Stratified or balanced randomisation to control known confounders
Segmented analysis as standard, not as an afterthought
Awareness of type 1 errors, type 2 errors, and confounding variables
Sufficient sample sizes and minimum test durations, calculated upfront
No mid-test changes to traffic sources or major site elements

Most marketing and ecommerce teams do not have time for this level of statistical rigour on every test. And they should not need to.

Dalton AI handles all of this automatically. It continuously optimises across segments, allocates traffic intelligently using multi-armed bandit algorithms, and identifies confounding patterns before they corrupt results, so you never have to worry about whether your aggregate results are hiding a paradox. No manual segmentation. No statistical interpretation. No risk of shipping a winner that actually loses for every customer segment.

FAQ: Simpson's Paradox in A/B Testing

What causes Simpson's Paradox?

A confounding variable that influences both group membership and the outcome. In A/B testing and split testing, this is usually uneven distribution of a factor, device type, traffic source, new vs. returning status, that affects baseline conversion rates.

How common is Simpson's Paradox in A/B testing?

More common than most teams realise, and frequently undetected. Any test where traffic composition differs between variants is susceptible. Research suggests over 20% of A/B tests may be affected by some form of confounding. Teams that never segment their results would not know.

Is Simpson's Paradox the same as a type 1 error or false positive?

No. A type 1 error (false positive) means concluding an effect exists when it does not, typically due to insufficient sample size or early stopping. Simpson's Paradox involves real effects that appear to reverse when aggregated. Both produce wrong decisions, but for different reasons and require different remedies.

Can A/B testing software detect Simpson's Paradox automatically?

Some advanced testing and CRO platforms flag when segment-level results contradict aggregate results. Most do not. The safest practice is to build segment analysis into your standard post-test process for every significant test. Platforms using multi-armed bandit optimisation, which continuously reallocates traffic based on performance, are inherently less susceptible because they adapt to shifting traffic compositions in real time.

What sample size do I need to avoid these issues?

Use a sample size calculator before launching any test. As a baseline, aim for at least 1,000 visitors per variant per segment you plan to analyse, not just overall. For low-traffic sites, this makes traditional A/B testing impractical, which is why Bayesian methods and multi-armed bandit approaches exist as alternatives.

How do I explain Simpson's Paradox to stakeholders?

Use concrete numbers and tables. Show how a version can win in every segment but lose overall due to traffic mix differences. The doctor example works well: Doctor B is better at treating every type of patient, but treats more severe cases, so the overall rate looks worse. Connect it directly to the test in question with the actual data.

Should I always segment my A/B test results?

Yes, for any test you plan to act on. At minimum, check device type, traffic source, and new vs. returning visitors. If segment winners match the aggregate winner, proceed confidently. If they differ, investigate before shipping anything.

Simpson's Paradox: The Statistical Trap That Ruins A/B Tests