There are more web-based calculators, vendor calculators, and downloadable spreadsheet calculators than you can count. But not all are created equal.
A/B test statistics aren’t limited to sample size calculations and statistical significance. Some A/B test calculators look the same but differ in their underlying statistics; others are obviously different, but the value of their distinguishing features is unclear.
This post highlights both aspects you should consider—statistics and features—when choosing a test calculator and interpreting results.
Problems with existing A/B test calculators
First, let’s poke the bear a bit by discussing two common problems—on opposite ends of the spectrum—with current A/B test calculators:
- They’re not flexible.
- They’re too complicated for most users.
1. They’re not flexible
A very real and frustrating problem with current test calculators is that few offer the flexibility needed to run proper statistics. Sure, they all have a method to calculate statistical significance between two tests, but there’s much more to testing than that (more later).
A big reason flexibility isn’t available—especially from testing tool calculators—is that the statistics have been simplified to enable the average user to punch in a few data points, like the number of visitors and conversions per variation, and start a test.
The output is a binary response (winner or loser), which isn’t statistics.
Oversimplifying does the analyst little good in the end. They’re left with a lot of questions and few options for proper planning and test analysis.
If you’ve been running experiments for a while, you know that every test comes at a cost to your business. If you rely on an inflexible calculator, a near-term convenience can morph into a long-term liability.
2. They’re too complicated for most users
Flexibility cuts both ways. Sometimes, what kills a good product is that it’s not widely understood or too complicated for the intended user.
One of my favorite scenes from Silicon Valley is when Richard, the CEO of a fictional startup, Pied Piper, tries to explain his product to a focus group. His product could transform how people use the Internet—but the user group can’t wrap their heads around it.
That failure nullifies Richard’s genius. Superior, even revolutionary, products fail in the wake of basic misunderstanding.
It happens to A/B test calculators, too. Even good calculators that don’t ignore the value of more complex calculations often assume that users know how to work with these statistical concepts.
But if like many analysts, you’ve relied on basic calculators, you might have a hard time explaining:
- The relationship between statistical power and MDE;
- How to control for the increased risk of false positives when running an A/B/n test with multiple variations; or
- How to calculate the time that your test will need to run.
A well-developed statistical calculator meets basic needs but also strikes a balance between offering flexibility and ensuring users get clear value from more complex calculations. To meet the latter goal, calculators may need to remove some advanced features; alternatively, they could succeed by adding supporting documentation to ensure those features are used correctly.
In short, a great A/B test calculator should help the average user get a little better at A/B testing. So what does that look like? And which features matter most?
4 things that matter when choosing an A/B test calculator
Choosing which calculator(s) to use for your experimentation program is a big decision. Like picking a pair of running shoes, you want to make sure that it was created for running (not basketball), and that it’s light, comfortable, and so on.
Here are four things worth considering.
1. Centralized calculators
A major usability gap for A/B test calculators is that there isn’t a single, centralized source for all calculators—sample size, duration, statistical significance, etc. Most of the time, they’re designed to make just one type of calculation.
For example, Evan Miller’s sample size calculator calculates the sample size needed for tests but not how long your test will need to run. VWO’s test duration calculator calculates how long your test will need to run but not the sample size.
A major risk with decentralized calculators from various providers is that if you’re not familiar with the statistical methods of each one, you’ll likely end up with some unwanted results.
If you run a sample size calculation with Optimizely’s sample size calculator, for example, then switch to the VWO test duration calculator to estimate the time needed to run your test, results will conflict.
For example, using the sample statistics below, Optimizely will estimate a total sample size of 280,000 for a standard A/B test. If you receive 10,000 visitors per day, it will take roughly 28 days for your test to bake.
If you switch to VWO’s test duration calculator to calculate the time needed for your test to bake, however, you’ll notice that using the same inputs yields a different result—24 days.
Differences in the calculators’ underlying statistics cause this discrepancy, but if you’re not aware of it—or why it’s happening—you’re left only with frustration and uncertainty. And this is just one example. Similar issues surface if you use one provider to calculate sample size and another for post-test analysis.
Finding consistent, convenient features in a set of centralized calculators is essential not just for running an efficient testing program but also to avoid nuanced (but meaningful) statistical differences between providers.
2. Test duration awareness
All A/B test calculators have a statistical significance calculation—just enter the necessary data for each test variation. However, most don’t have a way to determine the number of days needed to reach significance based on your data and the time your test has been running.
There’s a cost to running every experiment, so if a test—no matter the result—won’t be worth the wait to reach significance, you might want to pivot to another test idea.
Consider the below example. Given the current test data, this particular experiment would (likely) need to run for 12 days to reach significance.
That estimate is important for planning. Let’s assume that your test has been live for six days and that your user and conversion samples are on track. The calculator will tell you that six days remain, which is consistent with the pre-test duration calculation.
But should your sample statistics be lower because of changes in traffic, the test will need to run longer. Vice versa if traffic is higher.
A test calculator with built-in test duration calculations (for total days and days remaining) helps analysts weigh risk versus reward—a balance at the core of experimentation—even while a test is mid-flight.
3. Sharing with clients
Having the ability to share pre-test and test analysis data with stakeholders may seem trivial. But it’s incredibly handy. It was a popular request from beta users of our calculator and is heavily utilized, even across our client base.
Why? For one, a calculator that has a “Share” option helps eliminate the need to download data and further manipulate it in a spreadsheet before sending.
It also comes in handy when you might want to share a pre-test analysis for multiple tests. Or, you might want to link out to all your test analyses from a single document—without recreating it somewhere else. There are no limits to the number of links you can create and use.
4. Multiple variations
There’s a common statistical trap called “The Multiple Comparison Problem.” The more test variants you run, the higher probability of false positives.
False positives can happen for a number of reasons. Too small of a sample size or testing multiple variations are common instances. In business terms, this could mean that you’re spending resources implementing something that had no effect.
If you’re using vendor software to run A/B/n tests, then you’ve probably noticed an option to adjust for multiple comparisons. For example, if you use the Adobe Target Sample Size Calculator, there’s a check-box option to correct for multiple offers using the Bonferroni Correction method.
As you’ll see in the screenshot below, when multiple comparisons are accounted for using the Bonferroni Correction, the confidence level for the five offers increases from 95% to 98.75%. When adjusted, the sample size needed for each offer increases as well.
The downside, of course, is that your test will require much more traffic to maintain the same level of statistical power.
There are other correction methods. As Georgi Georgiev details, The Dunnett’s adjustment has a couple of advantages over others, including Bonferroni:
- Dunnett’s is a more efficient type of correction, reducing the sample size needed while preserving the power of a test. Since it uses the inherent correlation between the tests, it is able to make the overall test a bit more powerful.
- While the Bonferroni correction is common in the CRO community, it’s much more conservative than Dunnett’s, which increases testing time while being less precise, so there is no advantage.
No calculator is perfect (but we can still try)
At Speero, we’ve tried to incorporate as many of the above features as possible into our testing calculator. But no calculator is perfect. Here are some things that we’re still working on.
While binomial metrics have only two values (true or false), non-binomial metrics do not. Non-binomial metrics such as average order value (AOV), revenue per visitor (RPV), and average basket size are usually continuous metrics with an unknown amount of variance, so they need to be calculated differently than conversion rate metrics.
Because calculations must control for the variance (i.e. calculate standard error) before performing significance calculations, we’re working to add this as a feature in the future.
Sample ratio mismatch
Sample Ratio Mismatch (SRM) is a term that doesn’t get enough attention in the CRO community, and most testing platforms don’t alert users when it occurs.
So what is SRM? In A/B testing, it is simply when the proportion of visitors sampled in a particular treatment group does not match the expected proportion of the total number of visitors sampled in a given test. There is a significant difference between the two groups.
How do you know when proportions are extreme enough to be concerned? And why should you care? SRM indicates that your test was somehow performing incorrectly (e.g., incorrect bucketing, ramping of variants, pausing variants, interaction effects, variant not working in certain browsers or devices, an advertisment or certain change (based on time) on the website blocked variant, etc.), which caused a disproportionate allocation of traffic between variations.
Our own A/B Test calculator gives you a warning when there is a possibility of a sample ratio mismatch.
All test calculators aren’t created equal. Calculators that are too simple can lead to inaccurate results, and those with greater complexity, too often, don’t help users understand the value of their sophistication.
Here are four things to be aware of when using your existing calculators or picking new ones:
- Centralization. Do the underlying statistical methodologies align?
- Test duration awareness. Can you get real-time visibility into duration estimates?
- Sharing results. Is it easy to share results with everyone who needs them?
- Managing multiple variations. For tests with multiple variations, does it help control for false positives?
You can check out our own A/B Test calculator here.