Imagine you're at a casino with a row of slot machines, each promising a payout but with unknown odds. You have a pocket full of coins and a mission to win as much as possible. How do you play? Do you stick to one machine, or try them all to find the luckiest one?
This classic gambler’s dilemma is exactly what the multi-armed bandit problem describes: a choice between exploration (trying different options to discover their payoff) and exploitation (sticking with the option that seems best so far).
Now, replace slot machines with different versions of your website — headlines, CTAs, layouts — and each visitor becomes a coin you spend. The “payout” isn’t a jackpot but a conversion.
This is the essence of multi-armed bandit testing: an optimisation framework that continuously learns, reallocates traffic toward winners, and squeezes more value out of every visitor.
The Problem with A/B Testing
For decades, A/B testing has been the go-to method for website optimisation. The process is simple: split traffic evenly, measure performance, and declare a winner once statistical significance is reached.
At large scale, this works. But for most companies, A/B testing has painful limitations:
- Slow results — Small to medium websites may need months to finish a single test.
- Traffic waste — By definition, half of your visitors see a losing version until the test ends.
- Stop–start cycles — Optimisation happens in fits and starts, not continuously.
- One-size-fits-all mindset — A/B assumes there’s one best variant for everyone. In reality, different visitors often respond differently.
- Operational overhead — Tests mean copywriting, design, development tickets, QA, analysis. Many never get off the backlog.
This is why growth teams are adopting multi-armed bandits as a faster, leaner alternative.
Faster Learning, Less Waste: Why Growth Teams Embrace Bandits
Multi-armed bandits directly address the weaknesses of A/B testing by reallocating traffic as results emerge. Instead of splitting evenly until the end, they tilt traffic toward winners while continuing to explore.
Here’s what makes them attractive:
- Higher ROI on experiments
A/B testing wastes half of your test traffic on losers. Bandits reduce this waste by sending more visitors to winners earlier. - Faster time to results
You don’t have to wait for full significance before acting. Bandits adapt continuously, showing improvements within days rather than months. - Continuous discovery of wins
No stop–start cycles. Bandits keep learning, reallocating, and optimising 24/7. - Scales with more variations
Testing 3–5 versions with A/B dilutes traffic badly. Bandits trim poor performers quickly and keep experimenting with promising ones. - Smaller sample sizes needed
Bandits deliver value even with limited traffic, because they learn and exploit at the same time.
For growth teams, the result is clear: faster insights, less waste, and more conversions per visitor.
The Math of Bandits
Bandits are designed to minimise regret — the conversions lost by showing suboptimal variants.
Formally:
Regret(T) = Expected reward of optimal policy – Expected reward of chosen policy
Where T = number of trials (visitors).
- In A/B tests, regret is large: losers keep getting traffic until the end.
- In bandits, regret is smaller: allocation shifts toward winners quickly.
Core Bandit Algorithms
1. Epsilon-Greedy
- With probability 1–ε, show the best performer so far.
- With probability ε, explore randomly.
Pros: simple, easy to implement.
Cons: exploration isn’t adaptive; fixed ε may be inefficient.
2. Upper Confidence Bound (UCB)
Each variant is scored as:
score_i = mean_i + c * sqrt( (ln N) / n_i )
Where:
- mean_i = observed conversion rate of variant i
- n_i = number of visitors shown variant i
- N = total visitors
- c = exploration parameter
Intuition: variants with less data get a bigger “uncertainty bonus,” ensuring they’re not ignored prematurely.
3. Thompson Sampling (Bayesian Bandit)
The most popular today. Each variant’s conversion rate is modeled as a Beta distribution:
- Start with Beta(1,1).
- Each success = increment α, each failure = increment β.
- For each visitor: sample from each distribution, show the variant with the highest draw.
Example:
- Variant A: 8/100 conversions → Beta(9,93)
- Variant B: 5/50 conversions → Beta(6,46)
Even though A looks stronger, B has more uncertainty. Thompson Sampling sometimes samples B as the winner, ensuring it gets fair exploration.
Over time, the better variant wins most draws — but exploration never disappears completely.
This balance of exploration and exploitation is why companies like Microsoft, Google, and Expedia use Thompson Sampling in production.
Industry Examples
- Netflix: Uses contextual bandits to choose thumbnails for each user. The same show might show different artwork depending on your viewing history. This is contextual personalisation in action.
- Expedia: Built AdaptEx, a contextual bandit platform that reallocates traffic in real time and tailors offers to visitor context.
- Microsoft: Applied Thompson Sampling to ad serving and news recommendations. Result: higher CTR and faster convergence than A/B.
- Amazon: Runs competing recommendation models in parallel, with bandits dynamically sending more traffic to whichever model is performing best.
Contextual Bandits: The Bridge to Personalisation
Standard bandits look for a single best option overall. But visitors are not identical:
- A Google ad click behaves differently from a Facebook retargeting visitor.
- A new user needs different nudges than a loyal customer.
Contextual bandits extend the framework by considering features about each visitor — device, source, history — when making decisions.
Instead of asking “Which variant is best?”, they ask “Which variant is best for this visitor in this context?”
Examples:
- Headline A for search traffic.
- Headline B for repeat visitors.
- Headline C for mobile.
This is the mathematical foundation of personalisation. Netflix and Expedia use it already. Dalton brings it to websites: each visitor gets the experience most likely to convert for them, automatically.
Why This Matters Now
Three trends make bandits especially relevant today:
- Efficiency pressures
Few companies can afford to waste half their visitors on losing tests. - AI-driven experimentation
Generating and implementing variants no longer requires weeks of design and development. AI removes the bottleneck. - Personalisation expectations
Customers expect tailored experiences. One-size-fits-all testing feels outdated.
These forces explain why websites are moving toward self-improving optimisation systems.
Conclusion
Multi-armed bandits address the core weaknesses of A/B testing: they learn faster, waste less traffic, and continuously optimise. Contextual bandits extend this to personalisation, ensuring every visitor gets the experience most likely to convert.
The future of optimisation is continuous, adaptive, and personalised.
Dalton brings to any website, with one line of code.