In this article
- The false-positive problem nobody budgets for
- The “average visitor” is hiding the real story
- What testing is genuinely good at (and what it isn’t)
- Diagnose first, then test
- Frequently asked questions
- Is A/B testing dead?
- Why do my A/B tests keep coming back inconclusive?
- Can an A/B test tell me why visitors don’t convert?
A/B testing is the gold standard of conversion optimisation — rigorous, data-driven, objective. Or so the story goes. In practice, a lot of “winning” variants are statistical noise that got shipped with a confidence badge, and even the real winners leave the most important question unanswered.
This isn’t an argument against testing. It’s an argument for being honest about what it can and can’t do — and for doing the cheaper thing first.
A green “statistically significant” banner is not the same as the truth. It’s a probability, and you’re rolling the dice more often than you think.
The false-positive problem nobody budgets for
Run enough tests at 95% confidence and, by definition, 1 in 20 “wins” is a fluke. But real programs do far worse than 1 in 20, because of how they’re run:
- Peeking. Checking results daily and stopping when significance appears inflates false positives dramatically — studies put the real error rate in continuously-monitored tests as high as 20–30%, not 5%.
- Too many variants. Google’s famous “41 shades of blue” test illustrates the multiple-comparisons trap: test enough variants and something will look significant by chance. With 40 comparisons, the odds of at least one false positive approach 88%.
- Underpowered tests. Most pages don’t have the traffic to detect small effects, so teams call winners on samples far too small to mean anything.
The result: a meaningful share of the “lifts” in your optimisation log never existed. You changed a button, the number wobbled, you declared victory, and you moved on — having learned nothing.
The “average visitor” is hiding the real story
There’s a deeper problem. A/B tests report an average effect across everyone who saw the page. But your page is read by several different buyer types, and a change can help one while hurting another.
This is Simpson’s Paradox applied to your funnel: a variant that shows a flat +0.5% overall might be a +15% lift for your end users and a −10% drop for your economic buyers, netting out to “no clear winner.” You ship it (or don’t) and never learn that you just made the page worse for the buyers who write the cheques.
The “average visitor” your test optimises for doesn’t exist. Real pages are read by specific people with specific objections — and averages bury exactly the signal you need.
What testing is genuinely good at (and what it isn’t)
To be fair: A/B testing is excellent for refining a page you fundamentally understand, at scale, once you have the traffic. It’s the right tool for “which of these two proven directions wins.”
It’s a poor tool for:
| Question | A/B testing | What you actually need |
|---|---|---|
| ”Which refinement wins?” | ✅ Great (with enough traffic) | A/B test |
| ”Why are people leaving?” | ❌ Silent | Qualitative / friction diagnosis |
| ”Is my value prop clear?” | ❌ Too slow, no “why” | Buyer-perspective read |
| ”What should I test first?” | ❌ Can’t tell you | Friction diagnosis → hypotheses |
| ”Will this page work pre-launch?” | ❌ Needs traffic you don’t have | Pre-launch testing |
The trap is using a refinement tool to do diagnosis work — burning weeks of traffic to discover, vaguely, that “version B did a bit better,” when what you needed was to know why version A was losing people in the first place.
Diagnose first, then test
The sequence that actually compounds:
- Diagnose the friction — find where specific buyers stall and why (here’s how). This generates real hypotheses instead of “let’s try a green button.”
- Fix the obvious leaks — the unclear value prop, the unanswered objection, the missing proof. These often dwarf any A/B-able tweak.
- A/B test the genuine trade-offs — once you have traffic and two defensible directions, let the data referee.
Most teams skip step 1 entirely and start at step 3, which is why their test logs are full of inconclusive results and “winners” that don’t move the business.
The diagnosis step used to require weeks of user research. Buyer Clone collapses it: buyer-persona agents read your page, report where each buyer type hesitates and why, and hand you a ranked list of friction to fix — per persona, so you don’t get fooled by the average. You walk into your A/B test with real hypotheses instead of a coin to flip.
Test less. Diagnose more. Then test the things actually worth testing.
Frequently asked questions
Is A/B testing dead?
No — but it’s overused for the wrong job. A/B testing is excellent for refining a page you already understand, given enough traffic. It’s poor at diagnosis (why people leave) and impossible before launch. Use it to settle genuine trade-offs, not to find out what’s wrong.
Why do my A/B tests keep coming back inconclusive?
Usually too little traffic (underpowered tests), testing trivial changes that can’t move the number, or stopping early when a result looks significant (“peeking”). Inconclusive results often mean you’re testing refinements when you should be diagnosing bigger friction first.
Can an A/B test tell me why visitors don’t convert?
No. A test reports which variant performed better on average — never the reason. To understand why (unclear value, missing proof, an unanswered objection), you need qualitative or buyer-perspective methods. Pair the two: diagnose for the why, test for the which.