A/B Testing Rigorously (without losing your job)

How you might Lose Your Job if you do the math right
Statistics Setup
One procedure for analyzing results
Pretty Pictures
Conclusion
Acknowledgements

Back to A/B Testing With Multiple Looks.

How you might lose your job if you do the math right

If you're running an A/B test using standard statistical tests and look at your results early, you are not getting the statistical guarantees that you think you are. In How Not To Run An A/B Test, Evan Miller explains the issue, and recommends the standard technique of deciding on a sample size in advance, then only looking at your statistics once. Never before, and never again. Because on that one look you've used up all of your willingness to be wrong.

This suggestion is statistically valid. However it is hazardous to your prospects for continued employment. No boss is going to want to hear that there will be no peeking at your statistics. But suppose that you have managed to get your boss to do that. Then this happens:

"So how is that test going," your boss asks.
"Well," you say, "we reached 20,000 visitors this morning, so I stopped the test and have the results in front of me."
"Great!" your boss exclaims, Which version won?"
You grimace, "Neither. The green button did a bit better, but we only had a p-value of 0.18, so we can't draw a real conclusion at the 95-percent confidence level."
Your boss looks confused, "Well, why did you stop it? I need an answer. Throw another 20,000 visitors at it."
"Well," you say slowly, "we can't. We've used up our allowed five percent of mistakes. We can't just collect more data, we have to accept that the experiment is over and we didn't get an answer. Hopefully we're more lucky with our next A/B test."

They call this a "career limiting conversation." Even if you somehow convince your boss that the math is right, people will wonder why nobody else seems to have this pain point The online testing tools don't discuss it. Blog articles don't mention it. Eventually you'll get replaced by someone who has never heard of Evan Miller. The stats will get done incorrectly. The business will get the answers they want to hear. And nobody will care that you were right.

The good news is that you can be right, without putting your job at risk. This article will explain one valid approach which lets you look at the statistics as often as you like, run tests as long as you need to, and still gives you good guarantees that you will make very few chance errors. This is not the only reasonable approach to use. Future articles will explore others, and discuss the trade-offs that could lead you to prefer one over the others.

Statistics Setup

The upside of doing math right is that we can believe our results. The downside is that we have to do math. We'll limit that math to adding, subtracting, multiplying, dividing, square roots, basic algebra, running computer programs and looking at pretty pictures. But first we need to understand how things are set up so that the rest of it makes sense.

To start with we have a website. Visitors come, and are assigned one of two experiences, A or B. Once you've been assigned an experience, we will make sure that you consistently get that. Some of those users will go through some conversion event, such as signup. For each visitor we will track the first time they go through that conversion event (if ever) and record that fact then. Note that we do not track people who are in the test - just the actual sequence of conversions. This data collection procedure likely is different than what you've seen before, but it will prove sufficient.

So we record a sequence of conversions. Let us suppose that, on average, $r_A$ is the fraction of the people who will convert if they are put into version A. Similarly the average conversion rate for version B is $r_B$. We do not know exactly what those conversion rates are, nor are we going to try to figure that out. All we want to know is which one is bigger, so that we know which one we want everyone to get.

After a very large number of people convert, what portion of them should be A? Well if we have $v$ visitors, then on average $\frac{v}{2}$ wound up in each version. Then:

$$\begin{eqnarray} \text{Portion that are A} & = & \text{# of A's} / \text{# of conversions} \\ & = & \text{# of A's} / (\text{# of A's} + \text{# of B's}) \\ & = & (\frac{v}{2} r_A) / (\frac{v}{2} r_A + \frac{v}{2} r_B) \\ & = & r_A / (r_A + r_B) \\ & = & \frac{r_A}{r_A + r_B} \frac{1/r_A}{1/r_A} \\ & = & \frac{1}{1 + \frac{r_B}{r_A}} \end{eqnarray}$$

If then less than half of the conversions will be in A. But if then more than half of the conversions will be in A. Therefore the ratio $\frac{r_B}{r_A}$ tells us which version is better.

What else do we know about this sequence? Well, if conversion is assumed to be instantaneous, we can make the incredibly strong claim that the sequence of A's and B's that we get behaves statistically like a series of independent (but likely biased) coin flips. At first this seems like an amazing claim - after all every time we we get an A, that is evidence that we put a visitor into A who could have gone into B instead - isn't there correlation of some sort there? But the following thought exercise shows why it is true. First let's randomly label all possible visitors, whether or not we ever see them during the test, A or B. Then as they arrive, instead of labeling them, we just observe the label they already have. It is now clear that what happens with the A's is entirely independent of what happens with the B's.

But there is no statistical difference between what happens if we label everyone randomly in advance, then discover those labels as they arrive, versus randomly labeling visitors as they arrive. Therefore in our original setup what we observe looks like independent flips of a biased coin.

The mathematically inclined reader may wish to insert verbiage about Poisson processes, and demonstrate that this conclusion continues to hold if the actual conversion rates vary over time and conversion is not instantaneous - just so long as the ratio of rates remains constant, and the time to convert for conversions has the same distribution of times. (You can actually relax assumptions even more than that, but that generality is more than we need.)

One procedure for analyzing results

In the last section we found that the sequence of conversions statistically looks like a sequence of independent coin flips with A coming up with a fixed probability. Therefore we're faced with a series of decisions. At each number of conversions $n$, we have $m$ more A conversions than B conversions ($m$ can, of course, be negative). Given $n$ and $m$, we have to decide whether to declare the test done, or say that we do not know the answer and should continue the test.

There are a variety of ways to analyze this problem. We will follow Evan's lead and use a frequentist approach. Which means that we will construct a procedure which, if there is no difference, be unlikely to incorrectly conclude that there is bias. Of course if there is bias, we will be even less likely to make the mistake of calling the bias backwards, so the hypothesis that the versions convert equally is our worst case.

We wind up with something similar to Evan's approach if we wait until we reach a pre-agreed number of conversions, and then try to make up our minds based on the data we now have. This leads to the following problems that our hypothetical boss was rightly unhappy about before:

Even if preliminary evidence says that one version is terrible, we will keep losing conversions until we hit an arbitrary threshold.
If we hit that threshold without having reached statistical proof, we cannot continue the experiment.
Naive attempts to fix the former problems by using the same statistical test multiple times leads to our making far more mistakes than we are willing to accept.

These are the problems we want our new procedure to fix.

Our trick is to always be making decisions, but at a carefully limited rate so that the cumulative likelihood of wrongly deciding that there is a bias when there isn't remains bounded. To that end let us define our maximum cumulative decision limit to be the function $\frac{p n}{n + 10000}$ where $1 - p$ is the desired confidence level that we want to make decisions of. There is nothing special about this function other than the fact that it starts at $0$, gets close to $p$ as $n$ gets large, and happens to be fairly easy to write down. In fact it is just a complicated analog of setting the target sample size at which you plan to compute statistics with Evan's procedure.

Given this function, here is how we can generate our procedure. We can write a computer program that estimates the distribution of $m$ (the difference between how many were A and B) for each number of conversions $n$. For any $n$ where we find that we can decide to stop the test for the most extreme value of $m$ without having our cumulative probability of making a decision exceed the limit function above, we do. This generates a sequence of potential stopping points where we could stop the test. Then when we're running an A/B test, we can at any point look at the number of conversions, and the difference between how many are A and B, then compare with the output of that program and decide whether or not to stop.

This computer program is not hypothetical, and was not that hard to write. Here is the start of an actual sample run.

 bin/conversion-simulation --header --p-limit 0.05 | head
 conversions    difference  finished    standard deviations p-value
 15 15  6.103515625e-05 3.87298334620742    0.000107512
 20 18  8.96453857421875e-05    4.02492235949962    5.6994e-05
 24 20  0.000111103057861328    4.08248290463863    4.4558e-05
 28 22  0.000124886631965637    4.1576092031015 3.216e-05
 31 23  0.000141501426696777    4.13092194661582    3.6132e-05
 34 24  0.000158817390911281    4.11596604342021    3.8556e-05
 37 25  0.000175689201569185    4.10997468263393    3.957e-05
 40 26  0.000191462566363043    4.11096095821889    3.9402e-05
 43 27  0.000205799091418157    4.11746139898033    3.8306e-05

This is outputting a tab delimited file describing a $95\%$ confidence procedure. The columns of interest are conversions and difference. So, for instance, when we get to 15 conversions, if all 15 are of the same type, we can stop the test. If we get to 20 conversions and all but one are of the same type, we can stop the test. And so on.

Making a decision on every single conversion is the possible worst case for making repeated significance errors. Unless you automate your tests, Odds are that you will look less often and make fewer errors. But, as Evan notes, it is during your first few looks that you will experience the majority of your repeated significance errors. Therefore it is reasonable to use the thresholds for this procedure, even if you do look less often.

Pretty Pictures

That program creates a lot of output, that is easier to understand in a graph. I picked a variety of target confidence levels from $80\%$ (more errors than most want to accept) to $99.9\%$ (for organizations who run lots of tests and want them all to be right. I ran each to a million conversions, and then plotted them on the following graph.

In this form the graph is hard to use because the exact cutoff numbers are hard to make out, particularly for small sample sizes. But if you've got statistics experience, the first thing you'll naturally notice is that the shape looks similar to the square root of the number of conversions. In statistics, data often spreads out like a square root, so it is worth trying to divide by the square root to see if we get a more useable graph.

(This paragraph is for people who have taken probability theory or statistics. Please skip if you don't remember that material.) You may remember that the sum of a series of random variables looks like a normal distribution with mean the sum of the means and variance the sum of the variances. That distribution spreads out according to the standard deviation which is the square root of the variance. In our case our random variables are coin flips with value +-1. Under the null hypothesis the mean is 0, the variance is 1, so the distribution after $n$ conversions is a normal with mean 0 and standard deviation $\sqrt{n}$. Therefore we should expect that everything should scale according to the standard deviation, and dividing by it will normalize the graph in a more useful way.

This graph is somewhat complicated, but it lets us calculate A/B test results by hand. I will illustrate the method with numbers that came up in a test that ZipRecruiter was running while I was writing this section. In that test there were $316$ A conversions, and $417$ B conversions.

Calculate the number of standard deviations you are out. That number is the difference between the number of A observations and B observations, divided by the square root of the total number of observations. Sign does not matter. In this case we get $(316 - 417)/\sqrt{316 + 417} = -101/\sqrt{733} = -3.7305..$.
Look on that graph for which lines you're above the cutoff for. In this case we look at the line for $1000$, then count back three to see where $733$ conversions approximately is. We go up that line to $3.73$ standard deviations and see that we're between the $95\%$ and $98\%$ lines. So at a $95\%$ confidence we can stop the test, at a $98\%$ confidence we would not.

The graph makes the exact cutoff numbers hard to tell (you'd have to go to the file for that), but is good enough to be useful in practice.

But this graph tells us more. It can give us a decent idea how much more data we need to use this procedure than we may have been used to using. For instance let's look at the $95\%$ confidence curve. In the graph that curve is mostly between $3$ and $4$ standard deviations. But if you look on a standard table of normal distributions, a $95\%$ confidence threshold is about $2$ standard deviations out. (Actually $1.96$.) Therefore we need to get to $1.5$ to $2$ times as many standard deviations under this procedure. If you have a persistent bias, after $x^2$ times as much data has been collected, a standard deviation is $x$ times as big, and your bias has added up to $x$ times as many standard deviations. Therefore, as a back of the envelope, you're going to need $2.25$ to $4$ times as much data to reach $95\%$ significance under this test than you would if you were being less careful. Of course the upside is that you get valid statistics without having to guess in advance how much data you will need.

For our next trick, let's convert that graph to show the kinds of p-values that we're used to seeing from more traditional statistical tests. Remember that small p-values are many standard deviations out, so the order of the curves flips. Also I will use a log-log scale so that the data is readable.

This graph tells us two things. First of all, if you've got p-values that are being reported using any of a variety of frequentist statistical tests, this can be used to derive rules on whether to stop the A/B test. Secondly if you have been doing multiple looks in the past, this gives you an idea of what level of confidence you've actually been deciding things at. (As Evan claims, it is probably much worse than you thought it was!)

OK, so that is how you use these graphs, but why does it have that shape? In one sense it doesn't really matter, since the details have to do with the arbitrary cutoff rule that we used. But understanding the shape is a good sanity check that we have not gone far astray. Feel free to skip this if you're not interested.

At the first cutoff we used up all of our willingness to make decisiosn for the first few $n$, then later cutoffs come closer together and so we have less willingness to decide for each cutoff. That explains the initial high value in each curve.
The cutoff thresholds only happen at integers, so there is a considerable discretization jumps for small $n$. That explains the jags that smooth out as $n$ gets larger.
The Central Limit Theorem describes the fraction of random sequences at any point that are any number of standard deviations out. But for small $n$, most of the extreme sequences shoot out randomly for a small stretch before fading. Things move more slowly for large $n$, so those sequences shoot out randomly for a longer stretch. But we're cutting off sequences the first time they hit an extreme value, so the number of standard deviations out that you need to go to find a fixed fraction that just hit an extreme value for the first time is always falling. This explains the initial rise in the curve.
In order to limit the total number of errors, our willingness to end decisions at $n$ has to become very small for large $n$. For large $n$ this effect takes over, and that is why the curve eventually starts falling again. In fact we make half of our decisions on the null hypothesis by $10000$ observations, and the placement of the peak near that number is probably not accidental.
If the confidence level is low, then a significant fraction of possible random walks are stopped, which reduces how many random sequences there are in the tails. This effect is negligible for the high confidence curves, which is why the low confidence curves fall back more slowly than the high confidence curves. (The effect is small - compare where they are at $100$ and $1,000,000$ to see it.)

Conclusion

This article has walked through a very different way of of solving the issue of repeated significance testing errors that Evan Miller raised in How Not To Run An A/B Test. The big advantage of this procedure over the standard statistical approach that Evan Miller recommends is that you will find yourself able to stop bad tests fast, and keep running slow tests until they eventually reach significance. This avoids painful conversations that are likely to leave various co-workers (including your boss) frustrated with you.

This article described a universal procedure that hits a very strong statistical standard. See this picture for how you can evaluate it by hand. Following this standard requires several times more data than you would use if you were less careful (or if you avoided multiple looks). In practice this procedure is better than is actually needed, but that will be a topic for another article.

Acknowledgements

My thanks in chronological order to Benjamin Yu, Kaitlyn Parkhurst, Curtis Poe, Eric Hammond, Joe Edmonds, Amy Price, and one more who asked to be anonymous for feedback on various drafts. Without their help, this article would have been much worse. Any remaining mistakes should be emailed to Ben Tilly.

Further discussion of this article may be found on Hacker News.