Experimentation Metrics That Matter 


Alongside individual experiment goals and metrics, you’ll need to have some way to measure the overall success of your experimentation program. But with so much you can track it's easy to fall into one of three traps:

  1. Reporting every single thing. Just because you can measure something doesn’t mean you should. Tracking every metric possible means you’ll quickly become overwhelmed by data, spending time, and effort to track things that have little to no impact on your overall goals. Being overwhelmed by data can make it harder to understand what action you should take or draw conclusions.

  2. Tracking vanity metrics. These are metrics that usually sound important but are red herrings. Things such as the number of variants in tests or measuring your test velocity without also measuring uplift. These aren’t accurate indicators of success and can lead you to focus on moving the wrong metrics.

  3. Thinking it’s all about conversion rates. We’ve written about The Retention Economy–and why it’s not enough for businesses to purely focus on conversion rates or revenue without measuring the long-term impact of customer lifetime value or retention. 


For one, when conversion rates go up it doesn’t always result in a financial payoff–which is usually the initial driver for experimentation programs. E.g you can increase the purchase conversion rate but also reduce average order value resulting in less overall profit. This can be due to businesses vastly underestimating what can be achieved through experimentation (the term ‘CRO’ has a lot to answer for). Experimentation is an accessible research tool, providing statistically valid findings, which can be applied to much more than simple UX changes. You can use testing to make decisions on product innovations, business pivots, pricing, and propositions to name a few. So you’ll need more than conversion rates to understand if these types of experiments are successful. 

To avoid these three pitfalls, here’s what is crucial to measure.

Overall Success Metrics

Your overall success metrics should be aligned to the overall business goal, commonly referred to as the “North Star Metric.” This is most likely focused on revenue or another growth lever. As explained above, however, a focus on revenue alone is leading many businesses to focus on short-term strategies that aren’t as profitable as securing long-term value by focusing on retention or customer lifetime value. So while you might want to measure revenue I’d advise you to also measure these customer retention metrics to get the full pictures. I’ll cover how you can measure both. 

CX Metrics: How to measure Customer Lifetime Value 

Don’t be fooled into thinking that metrics like retention and CLTV are just for subscription businesses. They are just as important to non-subscription business models too. But it can be harder to measure, particularly if you have long sales cycles because with discretionary purchases it’s hard to observe when exactly a customer has in fact ‘churned’ (or latent attrition) or is just in between purchases. Being able to identify unique customers for multichannel retailers is another sticking point e.g. you have guest checkouts or offline experiences that aren’t tracked. 

Don’t let these shortcomings stop you, however. It’s better to get some form of measurement to begin working with.

Here’s the simplest way to calculate CLTV;

Customer lifetime value = average purchase value x average number of purchases per year x average customer lifespan in years.

If you don’t have a single customer view you can work out the above by;

  • Average purchase value - Annual revenue divided by the total number of purchases.
  • Average number of purchases - Total number of sales per year divided by the total number of individual customers who bought from you that year.
  • Customer lifespan - Average number of year’s customers continue to purchase from you.

Some businesses subtract the cost of sales and marketing from the CLTV calculation. However, I’d advise against doing that unless you can generate a CLTV figure for customers per marketing acquisition channel. As it’s likely different acquisition channels vary in cost, as potentially does the CLTV from different marketing channels. Therefore if you plan to use aggregated data to calculate the above keep your average customer acquisition costs (CAC) separate.

In an ideal world, you’ll have a single view of the customer, using a data warehouse to collect all of the data you have about individual customers and be able to measure the impact specific experiments had on CLTV. But most businesses aren’t there yet. So instead use cohort analysis to help you attribute the impact any implemented changes had on your customer lifetime value (and retention.)

Revenue Metrics

How do you work out the revenue uplift from all of your different experiments that are across different stages of the funnel and might not be measured by revenue uplift at the time of testing?  

There are three methods you can use:

Holdback Method

The holdback method is where you keep around 10% of your traffic outside of any test scenarios and measure the difference in revenue generated from the holdback group vs. the test group. 

There’s a downside to this method. It’s counterintuitive to have a holdback group that isn’t also benefiting from the enhancements, as you’ll be losing out as a business, just so you can measure the effectiveness of your program.

Time Period Comparison Method

This is where you look at the period before you started your experimentation program and compare it to the same period after your winning tests were implemented. 

The biggest problem here is it’s unlikely that you’re comparing apples with apples. Between the two time periods, there will be a number of factors that can impact your revenue metrics such as advertising spend, campaign messages, competitor activity, and even external factors like weather or major world events. This is therefore not a good method to use if your program has been running for some time–the longer the gap the more variables have changed. 

Uplift Projection Method

You can work out the uplift per test as you go, annualizing it, and then adding all of the test results together to get your overall revenue uplift. This will be a projection rather than an exact figure. So to add an additional layer of reliability into your calculations, you can monitor any test uplift to see if it stays consistent for 3 months before attributing it to your uplift value. You can also calculate a 20% per month reduction over the annualized figure to simulate external factors eroding the increased revenue over time. 

Choosing a method from the above will depend on how accurately you want your measurements to be, but once you have chosen a method it’s best to stick with it so you have consistency in your results. 

Secondary Metrics

Secondary metrics are those which impact the overall success metric. They are the growth levers of your business. Lean on these and you’ll drive change in your overall goals.

Let’s take Amazon as an example. Their growth levers are: 

  • Adding more categories to their business 
  • Adding more products within each category
  • Increasing traffic to product pages
  • Increasing the conversion rate of purchases 
  • Increasing average order value (AOV)
  • Increasing the number of purchases per person

Each one of these ‘levers’ when improved, will grow Amazon. While some of these levers might feel like you can’t impact them through experimentation, think again. For example, you can smoke test new product categories to measure interest or generate insights about what messaging resonates with customers–to feedback into your acquisition strategy to increase traffic. 

If your overall success metric is around CLTV or retention, your secondary metrics might include customer satisfaction measurements such as CSAT/PSAT, NPS, and Customer Effort Scores CES or SUPR-Q.

A subset of secondary metrics is ‘correlative metrics’ which are also known by a number of other terms; predictive metrics, correlations, or aha moments. 

There’s a popular anecdote that Facebook discovered if a user made seven friends in the first 10 days’ they were more likely to continue using Facebook. While the adding of the seven friends itself did not cause anything, it did help predict the likelihood of people taking a desired future behavior. In order to find out your correlative metrics, you can create propensity models. Here’s a good step by step guide on how to do this. 

However, it’s important to concede statisticians’ most catchy (only?) phrase when it comes to correlative metrics; "correlation does not imply causation" just as the graph below illustrates. 

Source 

Guardrail Metrics


Tracking the number of conversions is great but if you raised this metric by lowering the price you’re not making as much money, worse still you might have increased other things like returns, doing even more damage to the bottom line.

This is why guardrail metrics can be very useful as a sense check that you are in fact improving things. Consider other metrics that directly impact your overall success metrics. For example, if you are concerned with increasing revenue then measuring the ROI (deducting the costs of the experimentation program itself), AOV, and repeat purchases will be important to monitor as guardrails alongside revenue. 

Experimentation Program Metrics 

If we want to improve the overall success metrics we need to improve the experimentation machine itself. The following metrics help you to measure important factors that impact the success of your experimentation program and therefore the results you can generate. 

Test Velocity

On its own, this metric could be considered a vanity metric. After all, what’s the point in running a hundred tests a month if none of them win? This metric needs to be used alongside win rate and percentage uplift to ensure you’re not just testing for testing’s sake. But if all the other metrics are improving this is an important measurement you can influence to increase overall performance.

This metric tends to increase relative to two factors;

  1. The volume of traffic you have
  2. The level of experimentation maturity - as businesses become more mature they enable other areas of the business to experiment and testing becomes part of the decision-making process. You can use our free experimentation maturity audit to help benchmark where you currently are. 

Test Efficiency

Having a quality assurance process will prevent testing from going live with issues that might mean you having to re-run them, waiting time, traffic, and effort. So measuring the percentage of experiments with production issues (or number of days delayed) can help you analyze the quality of your team’s work as well as highlighting any issues with your internal processes. 

Test Quality 

This refers to the number of impactful learnings you make from your experiments. As explained above, tests can do more than just impact revenue and conversion rates. Using this measurement will encourage your team to test bigger ideas that are based on data as these tend to generate the most insights. E.g if you are testing button colors (unlikely to impact revenue or CR in any meaningful way) you’re not going to learn anything impactful. Versus running an experiment around what’s included in different services, leading you to learn which elements your customers perceive as most valuable and where the tipping point is for them to purchase. This kind of insight can change your whole business, what you offer and how you market yourselves. 

Program Agility

This is measured in the number of days it takes from hypothesis, to live test, to implementation (when it’s a winning test). This is important because if your hypothesis sits there waiting to be tested for 6 months - it might be that the data or insights you based the hypothesis on are no longer valid. So this metric can encourage your team to work in experimentation sprints, gathering just enough data and research to then test the hypothesis and then repeat the cycle. 

The ‘winning result to implementation’ part of this measurement is crucial because it relates to revenue or other uplifts/improvements to your experience. Obviously, you’ll want to capitalize on your winning ideas as soon as possible. Measuring this will allow you to understand if your program needs more resources or a change in process to speed things up. 

Win Rate & % Uplift 

You won’t always win–and the industry has varying takes on average win rates from 10% to 60%–but either way, you’ll want to increase your chances. Improvements to this metric can be found in three areas; 

  • Ensuring ideas are based on insights from multiple data sources e.g user research, analytics, and heatmaps. Basing a hypothesis on data rather than a hunch is always going to increase your odds. If multiple sources of data all point to the same issues or idea then you have even better odds of winning. 
  • Using a prioritization method like Speero’s PXL framework to systematically rate your hypothesis, so you focus on those with the highest chances of success and biggest potential uplift. 
  • The execution of your test idea. Using psychology principles, UX, design, creative input, and copywriting expertise to create a treatment that’s your best go at solving the problem. 


While win rate is important (and tends to correlate to experimentation maturity), if you are only ever getting tiny uplifts you might not be generating a great ROI from your efforts. So ideally you want to test bigger bolder ideas. Think about answering your hypothesis in the most extreme ways. Because the bigger and bolder the test, the bigger and quicker the results will be. Measuring how many of your tests generate a certain % uplift (this will be relative to your overall business) can help guide your team in testing bigger ideas that lead to a bigger impact. 

Visualizing Metrics 

One way to help you visualize these metrics and how they relate to one another, is by using the OKR framework in a goal tree format. 

Example of OKRs in a goal tree


Placing your overall metric as the objective and adding the secondary, correlative, and guardrail metrics to the ‘key results.’ 

For example; 

Experimentation Program objective: Increase company revenue by X% by end of the year (overall success metric)

Key results: 

  • 35% win rate with X% of the test generating 5% uplift or $ revenue per test. (Secondary metrics)
  • Test velocity increased by 10% by the end of Q4. (experimentation program metric)
  • Test production errors lower than 5%. (experimentation program metric)
  • 6 customer learnings that can be used in marketing, per quarter (correlative metric).
  • Hypothesis > implementation less than 30 days (guardrail metric).


Single Test Objective: Reduce churn of customers (secondary metric)

Key results

  • Identify what elements of the subscription plan motivates users to upgrade. (experimentation program metric) 
  • Increasing onboard video views (correlative metric).

Once you have identified what metrics are important to track, you’ll want to be able to easily monitor them without having to put in a request to your analysts each time. Using dashboards which show trends as well as providing your key metrics applied to segments of your audience are the two top suggestions we’d make. It will depend on what set up and tools you use and to where and how you create your dashboard, but however you achieve this the best thing to do is share with your wider team, so everyone can feel involved and part of the experimentation process. 

Conclusion

Measuring the output of your work can be hard to set up and get working, but once the hard work is done you will be in a better position to get additional buy-in and resources for your work. You’ll be able to show the impact you have on the business and can help drive data-driven work practices throughout any organization. 

Related Posts