Running Multiple A/B Tests at the Same Time
Worried about how running an A/B test on your product page impacts another experiment in the checkout? How much data pollution do you get on your hands when running multiple A/B tests at the same time? How harmful is the interference?
Planning test runtime might have you think about these questions. But as Lukas Vermeer (Director of Experimentation at Vista, ex-Booking.com) puts it:
‘Overlapping experiments are the least of several evils.’
How come? Well, for starters, interaction effects are rare. And if there’s an interaction effect between your tests, you can still detect them. Even more, interaction effects can inform you since they provide new information to the table—cross effects of multiple tests.
Alternatively, if you would not run multiple experiments at the same time, the following ‘evils’ can happen:
The ‘Evils’ of Not Running Multiple Experiments Simultaneously
Firstly, you can end with Sequential Isolation issue. When you experiment with fewer changes at the same time, you drastically reduce the experimentation velocity. This is vital because several factors make up your experimentation program's success:
- The number of tests you run per year
- The percentage (%) of tests you learn from (whether they win or lose)
- The average impact per successful experiment
If you limit the number of tests you run for the sake of avoiding data pollution, you also significantly reduce your testing velocity and possible successful tests.
Secondly, you can end with Traffic or Parallel Isolation issue. If you isolate traffic, you drastically reduce your statistical power, because you split up your traffic. So now to get results, you need to run longer tests.
You also have a challenge on your hands—now that you’re shipping two isolated tests, you can’t know the interaction effect. If both isolated tests win, you still have to make a third test that combines them so you can measure the interaction effect.
The last ‘evil’ is making changes to your website or app without testing. In other words, going blind.
How to Measure Interaction Effects
Georgi Georgiev, the owner of Web Focus and Analytics-toolkit.com, tried to measure the impact of running multiple A/B tests next to each other. His conclusion is:
‘In short, there is no certain way to establish the likelihood of harmful interference between concurrent A/B tests, nor the impact of such events.’
What he found was that randomization of visitors doesn’t cancel out interaction effects. He created a simple simulation using only winning tests with big uplifts to learn about possible interaction effects.
If you are worried about interaction effects, you can check the chance of possible interactions on your conversions (or traffic) with the XY calculator from Lukas Vermeer.
Another alternative is to Segment. You can segment data in the same way you would segment traffic sources or device categories and check the outcomes for cross-sectionally.
‘You perform significance testing and in case of interaction effects, you would like to include an adjustment for multiple testing such as Šidák correction. If the number of tested interactions is in the hundreds or more, I would also consider using the Benjamini-Hochberg-Yekutieli False Discovery Rate adjustment,’ Georgi Georgiev explains as an alternative option.
Do you now immediately need to learn about the corrections and adjustments? Not necessarily.
"Generally, these are rigorous ways to address the interaction effects. But also just knowing these effects do hurt detectability level (and if you and your team are ok with this tradeoff), then you take the hit on confidence and move forward," says Ben Labay, Managing Director of Speero.
How to Prevent Interaction Effects
If you still want to prevent any interaction effect between two tests you have a couple of options. These come in handy for important tests like those on pricing:
- Sequential Isolation: plan and run one test after another.
- Isolate traffic by creating mutually exclusive groups. Tools like VWO, Sitespect (non-overlay), and specific plans of Optimizely have functionalities that can ensure people don’t take part in more than one test (Parallel Isolation). But you need adequate traffic for a big enough sample size to run a good test.
- Add an extra variant to your test. If your test runs on the same pages, and you have the same hypothesis or same goal, it’s possible to create a different variant in your test. So instead of A & B, you will have an A, B, and C—where C is Variant B plus the extra test you want to run.
How to Limit Interaction Effects
If you’re really for limiting the tests’ interaction effects, you have the following options:
- Partial time overlap: don’t run tests with complete but partial or small overlap of time.
- Wait for the test to finish. If a test is really important and has a lot of risk you can always make an exception and wait for other tests to finish.
- Retest: Retesting is always good for checking false positives and whether your test was influenced by other tests running in the same time. When you run tests again, you can explore if your conclusion is still valid or holds in current (likely changed) circumstances.
You may think that running multiple tests at the same time will end up in data pollution and harmful interferences. That you would be better going blind. But you would be wrong. Running multiple tests simultaneously is the least of all evils.
Interaction effects are rare. Even if they happen, you can still detect them. Which is a good thing. Because interaction effects bring more than they take.