Confidence Level
The probability that your test result is real, not a fluke. A 95% confidence level means you accept a 5% chance you're chasing noise.
A confidence level is the bar you set before you're willing to declare a winner. It's how sure you want to be that a result is real and not just luck. The industry default is 95 percent, borrowed largely from clinical trials, not because there's anything magical about the number itself.
What 95 percent actually means
A 95 percent confidence level describes how often your method gets it right over the long run, not how sure you are about any single test. Set it at 95 percent and you're accepting a 5 percent false-positive rate: if you ran a test on a variant that is genuinely identical to the control, about 1 in 20 such tests would still throw up a "significant" result purely by chance. That 5 percent is the risk you've agreed to live with in exchange for being able to act.
Raising or lowering the bar
The level is a dial, not a law, and it trades two opposite mistakes against each other.
Raise it to 99 percent and you get fewer false positives, but you'll need much more traffic to call any winner, and you'll miss smaller real wins along the way because the bar is too high to clear in reasonable time.
Lower it to 90 percent and you can act faster on less data, but you'll ship more changes that turn out not to help, because you've let more luck through the gate.
Why blindly defaulting to 95 percent can be wrong
The 95 percent convention comes from drug trials, where a false positive could put an ineffective or harmful treatment in front of patients. The cost of being wrong there is enormous, so the bar is set high.
A wrong headline test costs you a handful of sales, not lives. The "right" threshold depends on the cost of being wrong in both directions: the cost of shipping a dud versus the cost of missing a real improvement while you wait for more certainty. For most webshops, rigidly demanding 95 percent on every test means leaving easy wins on the table for the sake of a standard designed for a completely different stakes.
Our take
A confidence level forces a binary question: has the bar been crossed yet, yes or no? You wait, you watch one number climb toward a threshold, and only then do you act. It's a single, all-or-nothing decision made at the end.
Dalton's bandit doesn't work that way. Instead of waiting for one threshold to be crossed, it treats confidence as a continuous spectrum and acts on it the whole time. Variants the system has become sure about earn more traffic; variants it's still unsure about keep getting explored so it can learn more about them. There's no 95-percent-or-bust moment, because allocation is tilting toward what's working with every visitor, in proportion to how strong the evidence is so far.
That sidesteps the threshold problem entirely. You're never stuck choosing between waiting too long for 95 percent certainty or acting too early on a hunch. The system is always doing the proportionate thing: betting more where the evidence is strong, holding back where it's thin, and updating both with every new visitor.