Omniata BlogNovember 19, 2015

Two questions you should ask before declaring an A/B test successful

An A/B test has implications far beyond any single metric. Avoid mistakes. Always address these two questions.

If you are a product manager, marketer, or data analyst for a web or mobile product, it’s highly likely that you are using controlled experiments, also called as split or multivariate tests. On a daily basis, you’re trying to optimize your product and drive business KPIs.

When it comes to interpreting results, the focus is usually on ensuring the statistical validity of experiments. However, a controlled experiment can have implications beyond the metric for which an experiment is evaluated.

As most experiments are setup to focus on a single KPI, results showing one of the tested versions to be a clear winner can be deceiving, as the very same version can negatively impact other metrics (or the user experience) downstream.

Additionally, as long as statistically valid, results are taken at face value, with little investigation into the reasons behind a version’s success. This becomes a big issue when results are deployed into product changes and rolled out to all users based on incomplete information, or even worse, incorrect assumptions.

If you want to avoid erroneous conclusions and derive full value from experiments, in addition to ensuring statistical validity of results, you also need to answer the following questions:

  • Did the test impact any other KPI negatively?
  • What is the reason behind the success of winning versions?

Let’s explore these in detail.

Ensuring that your test didn’t negatively impact other KPIs

The success of experiments is generally measured by a single metric such as “clicks on a button” or “conversion rate”. This can be useful for simple web pages, but for products with more complex user experiences like many mobile applications, an A/B or multivariate test can have downstream implications beyond a single metric.

Example 1:

Let’s take a scenario where we’re testing for “conversion rate”, version A vs version B. In this common scenario, version A wins with reliable statistical significance. Better conversion being better than worse conversion, we logically decide to roll out changes from version A to all users. However, a closer look might reveal that while version A had higher conversion rate as percentage, it may be a loser in average transaction value. In a worse scenario, it might even have the unintended effect of driving fewer people to the screen, ultimately decreasing overall revenue.

Example 2:

Assume that you are a mobile game developer who wants to test the following hypothesis:

Increasing difficulty level of a game will increase the Conversion Rate because users will buy premium In-app purchases to bypass the difficulty level

To validate your hypothesis you test different versions of “difficulty levels” within the game, but surprisingly, you notice that none of the versions is a clear winner - based on conversion rate, which was the primary success criteria. You will conclude that difficulty level has no effect on conversion rates, and hence is irrelevant to increasing monetization. If you look beyond that single metric however, you may notice that even though the changes in conversion rate were not statistically significant, the total revenue generated from users of version A increased significantly! By limiting your focus on Conversion Rate, you might have missed an opportunity to increase monetization in your game. We will explore this further in next section.

When analyzing the experiment results, it’s critical to account for impact across all major KPIs - not just a conversion event. Changes to a product should only be made after evaluating the test results holistically and taking into account the change across all relevant KPIs.

The sample dashboard below shows the impact of a controlled experiment on some relevant metrics including statistical significance for a mobile game. You can also filter the data by criteria of the individual users to get more granular insights, such as limiting the results to users from Canada, or users in the 18-30 age range.

By doing so, you can ensure that the winning version did not negatively impact other KPIs.

Identifying reasons behind the success or failure of versions

Evaluating test results across all metrics can only tell you IF changes made in the better-performing version should be rolled out across all users. It does not shed any insight into WHY the particular version was successful, a failure or inconclusive. If known, these insights can be valuable in designing future versions of the product.

Investigating such questions requires drilling down into the data; slicing, dicing, and filtering all the relevant dashboards and charts by different experiments and versions. The general principle is to start investigating from the differences in the top level KPIs across versions, and then drilling into the metrics more granularly until you’ve reached to the level of raw data.

Example 2, continued:

In Example 2 discussed earlier, you noticed a statistically significant overall revenue increase from subjects of Version A over the duration of experiments. You decide to roll out the difficulty level of the winning version across all users and make more money. But imagine if you can also learn WHY increasing the difficulty resulted in increased revenue, you can apply this learning in designing your other games or apply it within the same game.

A sample analysis of the above experiment would be:

  • You check results of all top level KPIs by experiment versions and notice:
    • Daily ARPU, conversion rate and average transaction value do not differ significantly.
    • Total Revenue for version A is significantly higher.
  • You start investigating why revenue increased for Version A
  • You look at users and retention rates by experiments and notice:
    • The D5, D6, and D7 retention rate for users belonging to Version A are much higher.
    • The increased revenue was due to the first time purchases made by returning users who otherwise would have left the game. Somehow, the number of users and revenue both increased in a way that did not show any statistically significant difference in Daily ARPU and Conversion Rates across Versions.

Isn’t this counter intuitive? The most difficult version in fact increased retention. On further investigation, you might find the real reason - Users did not feel that the game was challenging enough to keep them engaged for a longer time and hence dropped off. Increasing difficulty made the game interesting. Hence the revenue increase was not due to existing users making more purchases, and, as a matter of fact came from users who otherwise might not have played the game.

Once you know the reason, you may want to run more experiments to identify the sweet spots of difficulty levels, or use these insights while designing your new game.

Please feel free to contact us for more information on how Omniata can help you with A/B and multivariate testing.