Glossary Back to glossary

Multi-Armed Bandit

A multi-armed bandit is an algorithm that dynamically allocates traffic to the best-performing option in real-time, balancing data collection with immediate performance.

A multi-armed bandit is an algorithm for making a sequence of decisions under uncertainty, where each decision both earns a reward and teaches you something about which decisions are best. It continuously shifts more of its choices toward the options that are performing well, while still trying the others often enough to keep learning. In conversion rate optimization, it is the engine that decides which version of a page to show each visitor, sending more traffic to the variants that convert and less to the ones that don't, automatically and in real time.

Where the name comes from

A slot machine is sometimes called a "one-armed bandit," because it has a single lever and a reliable habit of taking your money. Now imagine a row of these machines, each paying out at a different and unknown rate. You have a limited number of coins. Which machines do you play, and for how long, to walk away with the most money?

That is the multi-armed bandit problem. Play every machine equally and you learn a lot but waste coins on the bad ones. Commit early to whichever machine looked best in the first few pulls and you risk locking yourself onto a mediocre option before you ever found the great one.

Exploration versus exploitation

This tension has a name: the exploration-exploitation tradeoff, and it is the heart of the problem. Exploration means trying options to gather information. Exploitation means using what you already know to maximize reward right now. Lean too hard on exploration and you keep spending on options you've already learned are weak. Lean too hard on exploitation and you bet everything on incomplete evidence.

A multi-armed bandit algorithm is, at its core, a principled way to balance these two automatically, instead of guessing at the right moment to stop testing and start committing.

How it applies to websites

Swap the slot machines for variants of a web page and the coins for incoming visitors, and the casino becomes CRO. Each variant of your headline, hero image, or call to action is an "arm." Each visitor is a pull of the lever. A conversion is the reward.

The bandit's job is to route the most visitors to the best-converting variant, as quickly as it can be confident which one that is, without prematurely abandoning a variant that simply had a slow start. Unlike a fixed test, it acts on what it learns while the experiment is still running.

Multi-armed bandit vs. traditional A/B testing

A classic A/B test is static. You set the traffic split up front, almost always 50/50, and hold it there until the test reaches statistical significance. The allocation never changes mid-test, and you are not supposed to act on the results until the end. This is statistically clean, but it has a costly side effect: the worse a variant is, the more it hurts you, because half your traffic keeps hitting it for the entire duration of the test.

A bandit is dynamic. It updates traffic allocation continuously as conversions arrive. A variant that pulls ahead earns more visitors; a variant that lags gets throttled. No one has to watch a dashboard and decide when to call it, because the algorithm is already acting on the evidence as it accumulates.

The difference is most visible as money. Suppose one variant converts at 5 percent and the control at 3 percent. A fixed A/B test keeps sending half of all visitors to the 3 percent control for the full run, and every one of those visitors is a conversion you could have captured and didn't. A bandit notices the gap early and steers traffic toward the stronger variant, so you earn more during the test rather than only after a winner is finally declared. The experiment pays for itself while it runs.

The tradeoff is that a bandit is built to earn, not to explain. If you need a precise, defensible effect size with a 95 percent confidence interval, a fixed A/B test is the better instrument, because deliberately shifting traffic toward the winner muddies the clean comparison. If your goal is more conversions rather than a publishable result, the bandit wins.

Thompson Sampling: the math, briefly

The most widely used bandit method, and the one behind most serious systems, is Thompson Sampling. It was first described by William R. Thompson in 1933, in a paper on how to allocate patients between two medical treatments without condemning half of them to the worse one. It sat largely unused for decades before becoming a cornerstone of modern online decision-making.

The intuition: instead of tracking a single conversion rate per variant, the algorithm holds a full probability distribution, its complete belief about what that variant's true rate might be, including how uncertain it is. A new variant with few visitors has a wide, fuzzy distribution. A variant with thousands of visitors has a narrow, confident one.

To choose where the next visitor goes, the algorithm draws one random sample from each variant's distribution and picks whichever sample came out highest. Then it records the outcome, updates that variant's distribution, and repeats. This one move handles exploration and exploitation together: strong variants get picked often because their distributions sit high, but uncertain new variants still get picked sometimes, precisely because their width means a sample occasionally lands high. As data accumulates, the distributions tighten and traffic concentrates on the real winner, no sooner and no later than the evidence warrants.

For a deeper treatment, the standard modern reference is A Tutorial on Thompson Sampling (Russo, Van Roy, Kazerouni, Osband, and Wen, 2018). For evidence that it actually outperforms the better-known alternatives on real data, see Chapelle and Li (2011), and the original idea in Thompson (1933).

Who uses multi-armed bandits

The largest data-driven companies run on them. Netflix uses bandit algorithms to choose the artwork shown for each title, learning which image earns the most plays and increasingly which image earns the most plays for a specific viewer; their engineering team documented a measurable lift in core engagement metrics after replacing a slower train-then-test pipeline with a contextual bandit (Netflix Technology Blog). Amazon applies the same philosophy at the level of the whole site, running a constant stream of experiments and letting winners accumulate rather than betting on occasional large redesigns. Bandits are also standard in display advertising, news recommendation, and clinical trial design, anywhere the cost of showing the wrong thing for too long is high.

Our take

This is where Dalton comes in, because the gap between "the best companies in the world run on bandits" and "your webshop has never run one" is the gap we built the company to close.

Historically, this technology was locked behind two walls. The first is traffic: traditional A/B testing tools often need around 100,000 sessions a month before a test reaches significance in any reasonable time, which is why most stores were quietly told CRO "wasn't for them yet." The second is engineering: building a bandit that allocates traffic correctly, in real time, without slowing your site, is the kind of system Netflix staffs with research teams. A mid-market brand was never going to build it in-house, and the enterprise tools that offered it came with enterprise prices and setup times.

Dalton removes both walls. Underneath, it is a contextual multi-armed bandit using Thompson Sampling on a Bayesian model, the same lineage that runs from Thompson's 1933 idea through the research that proved it wins in practice. We engineered it to make allocation decisions at the edge in under 50 milliseconds, so it never slows your store.

What that means in practice is the part we care about most. You can run many experiments at once across your webshop, approve on-brand variants we generate for you, and then let the system funnel traffic toward winners and throttle underperformers on its own. No babysitting dashboards, no manual stop-the-test decisions. The store tests, learns, and reallocates continuously.

That is what we mean by a self-improving webshop: not a site you redesign every few years and hope for the best, but a site that gets a little better every week on its own, the way Amazon's does, using the same class of algorithm Netflix uses, finally available to brands that were told they were too small to play.