A/B Testing with Multiple Simultaneous Experiments

April 21, 2015
At Adform, A/B testing is becoming a highly desired feature in our research and data­ science teams. One of our main reasons for A/B testing is to compare statistical models.

At Adform, A/B testing is becoming a highly desired feature in our research and data­ science teams. One of our main reasons for A/B testing is to compare statistical models. For example, let’s say we have built two models to determine which advertisement to display for a particular user, and now want to find out which resulted in the higher user click-through rate. The fairest way to figure that out is to run an A/B test: we split the users into two groups and assign a different model to each group. Later, we can calculate the click-through rate for each group and prove that one model performed statistically significantly better than the other.

This isn‘t only applicable in our data science teams – in fact, any product that involves user interaction can be A/B tested to ensure that changes or new features are having a positive effect on the relevant key performance indicators (KPIs). Without the ability to run meaningful A/B tests, it is impossible to be certain that changes and new features will result in the improvements that we anticipate. For this reason, we decided to build a generic A/B testing service that can be used throughout the technology and development departments in Adform.

In this post, I will discuss the complexities around one of the biggest problems we faced in implementing a generic A/B testing service: the problem of supporting multiple simultaneous experiments. Our solution is based on Google’s approach to A/B testing, which we have simplified and adapted for use at Adform.


Allowing a User to Be in More Than One experiment

One of the first issues we encountered was understanding when it is acceptable for a user to be in two experiments at once. Of course, if we have several experiments testing the same issue, then any given user can be in only one of those two experiments. On the other hand, if we have two unrelated experiments that are testing completely independent features, then it’s fine for a user to be part of two experiments at the same time.

We needed a way to define and represent the experiments in a way that would enable us to capture the different semantics between similar and independent experiments. For this, we borrow Google’s terminology:

  • Layer – A layer represents a specific type of A/B test. For example, we may have a layer for `representing the A/B testing of statistical models used for [some purpose]’.
  • Experiment – An experiment lives within a parent layer, and each experiment represents a specific test. For example, we may have an experiment for `testing model #5’.

These terms are depicted more clearly in the diagram below, which shows an A/B testing configuration for two independent features.

Ab Testing 01
It now becomes clear when a user can participate in multiple experiments: within each layer, a user can be in only one experiment; but across layers, this constraint does not apply. Next, we just need a way to assign users to an experiment within each layer.


Ensuring the Traffic Is Always Allocated Fairly

When users are being assigned to multiple experiments, it is easy to allocate the traffic (i.e. map users to the experiments) in a way that is unfair and that leads to biased and meaningless results. In our scenario of two independent features, the users are now distributed horizontally across this next diagram.

Ab Testing 02Here, we can see that if a user is assigned to the experiment Feature A: On, they will also be assigned to Feature B: On. So, if we later analyse the KPIs for the purple users versus those for the red users, we will have no idea whether any change in the KPIs has come from Feature A or Feature B. That’s why we need to allocate the traffic in a fair way, as shown in the diagram below. 

Ab Testing 03Above, the users assigned Feature A: On are equally split between Feature B: On and Feature B: Off. Now, if we look at the KPIs for the purple users versus those for the red users, we know that any change we see is purely down to Feature B. We have achieved this by partitioning experiments within each layer equally across experiments within the other layer.

In this simple example, where we have only two layers and two experiments within each layer, with a 50/50 traffic split, the partitioning solution is simple. However, our generic service needed to be able to handle scenarios where there are many layers, and many experiments within each layer, with arbitrary traffic splits. To solve this, we took another approach. Instead of ensuring that the experiments are partitioned fairly and assigning a user to a fixed position across all layers, we randomly assigned users to a position within each layer. To build the logic behind this, we introduced a new concept:

  • Bucket – A layer is split into many buckets, and a bucket maps to a single experiment. This is used in traffic allocation: a user is assigned a bucket and that bucket determines which experiment they land in.

As long as the users are uniformly distributed across the buckets in each layer, and there is no dependence between the assignments in one layer and the assignments in another, then we will end up with a fair allocation, but with much simpler logic than the experiment-partitioning approach would require. We also must ensure that any given user is always assigned to the same bucket within each layer – we don’t want users to end up in different experiments in subsequent requests. Effectively, we need a random but deterministic mapping from users to buckets, which is independent between layers.

To achieve this, we concatenate the user ID with each layer ID, and then use a hash function to map the user to a bucket within each layer:

Ab Test Code

As long as we use a good hash function and choose a power of two for our number of buckets (click here to read an explanatory post on Stack Overflow), this will give the random but deterministic mapping that we need. For four users, we may end up with a mapping that looks something like the following diagram.

Ab Testing 05 New


Configuring Experiments with Eligibility Criteria

Another requirement for our A/B testing framework was to enable experiments to have some kind of eligibility criteria that allow the experiment to be active only under certain conditions. This means associating a set of attributes with the experiment. For example, the attributes {‘country’: ‘UK’, ‘gender’: ‘male’} indicate that a user is eligible for an experiment only if they are male and live in the UK. If a user lands in an experiment that they are not eligible for, we give them the ‘default experience’ and to avoid introducing bias, that user’s data is discarded and not considered when calculating KPIs.

When using the layer model with eligibility criteria, there are situations where we can introduce new layers for the purpose of minimising the amount of data discarded. For example, let’s say that we want to test Feature A, in the UK and Denmark only. We could define a layer with four experiments, as shown below.
Ab Testing 07

This works, and adheres to our original definition of a layer; however, this configuration results in some traffic being discarded unnecessarily. For example, if a UK user lands in a Danish experiment, they will be given the default experience, but will not participate in an experiment. This is clearly not optimal – only 50% of UK and Danish users will participate in an A/B test. To avoid this wastage, we can simply use two layers instead of one:

Ab Testing 04

Defining multiple layers in this way is acceptable, whenever the users eligible for each experiment are disjointed. If this was not the case in the above example, then we may end up with a conflict: a user could be assigned On in one layer and Off in another. If we wanted to add another A/B test – say, testing Feature A on all male users – then we would have to define all experiments in a single layer, as there is some overlap in the eligible users.

In conclusion, A/B testing can become fairly complex when running multiple simultaneous experiments, and there are some subtle issues that must be addressed in order to ensure that experiments are fair and meaningful. At Adform, we have built a general framework that can be used with any of our products, and helps us avoid handling the complexities of A/B testing in multiple projects.