In marketing terms, a multi-armed bandit solution is a ‘smarter’ or more complex version of A/B testing that uses machine learning algorithms to dynamically allocate traffic to variations that are performing well, while allocating less traffic to variations that are underperforming.
The term "multi-armed bandit" comes from a hypothetical experiment where a person must choose between multiple actions (i.e., slot machines, the "one-armed bandits"), each with an unknown payout. The goal is to determine the best or most profitable outcome through a series of choices. At the beginning of the experiment, when odds and payouts are unknown, the gambler must determine which machine to pull, in which order and how many times. This is the “multi-armed bandit problem.”
One real-world example of a multi-armed bandit problem is when a news website has to make a decision about which articles to display to a visitor. With no information about the visitor, all click outcomes are unknown. The first question is, which articles will get the most clicks? And in which order should they appear? The website’s goal is to maximize engagement, but they have many pieces of content from which to choose, and they lack data that would help them to pursue a specific strategy.
The news website has a similar problem in choosing which ads to display to its visitors. In this case, they want to maximize advertising revenue, but they may be lacking enough information about the visitor to pursue a specific advertising strategy. Similar to the issue with news articles, they typically have a large number of ads from which to choose. Which ads will drive maximum revenue for their news site?
The website needs to make a series of decisions, each with unknown outcome and ‘payout.’
There are many different solutions that computer scientists have developed to tackle the multi-armed bandit problem. Below is a list of some of the most commonly used multi-armed bandit solutions:
Epsilon-greedy
This is an algorithm for continuously balancing exploration with exploitation. (In ‘greedy’ experiments, the lever with highest known payout is always pulled except when a random action is taken). A randomly chosen arm is pulled a fraction ε of the time. The other 1-ε of the time, the arm with highest known payout is pulled.
Upper confidence bound
This strategy is based on the Optimism in the Face of Uncertainty principle, and assumes that the unknown mean payoffs of each arm will be as high as possible, based on observable data.
Thompson Sampling (Bayesian)
With this randomized probability matching strategy, the number of pulls for a given lever should match its actual probability of being the optimal lever.
In a real-world scenario, we sometimes have data that can help inform decision-making when choosing between the various actions in a multi-armed bandit situation. This information is the ‘contextual bandit,’ the context and environment in which the experiment occurs.
In website optimization, contextual bandits rely on incoming user context data as it can be used to help make better algorithmic decisions in real time.
For example, you can use a contextual bandit to select a piece of content or ad to display on your website to optimize for click-through rate. The context is any historical or current information you have about the user, such as previously visited pages, past purchase information, device information or geolocation.
If you are a news portal, and you know that a person coming to your site has read entertainment articles in the past, you can select your top entertainment article to display at the top of the page for them to see. If you can see that your user is in Miami, you can display the local weather or other relevant information.
In deciding whether to use multi-armed bandits instead of A/B testing, you must weigh the tradeoff of exploitation vs. exploration (sometimes known as ‘earn or learn’.)
With A/B testing, you have a limited period of pure exploration where you allocate traffic in equal numbers to Version A and Version B. Once you declare a winner, you move into a long period of exploitation, where 100% of users go to the winning variation. One issue with this approach is that you waste resources on the losing variation while trying to gather data and learn which is the winner.
With multi-armed bandit testing, the tests are adaptive, and include periods of exploration and exploitation at the same time. They move traffic gradually towards winning variations, instead of forcing you to wait to declare a winner at the end of an experiment. This process is faster and more efficient because less time is spent on sending traffic to obviously inferior variations.
One of the main disadvantages to multi-armed bandit testing is its computational complexity. Simply put, it is just more difficult and resource-intensive to run multi-armed bandit tests.
There are some known situations where multi-armed bandit testing typically works best:
Headlines and short-term campaigns
The opportunity cost for waiting for the results of A/B tests make bandit algorithms a better choice for short-lived pieces of content like headline testing for new articles or holiday promotions.
Long-term dynamic changes
When the item being tested changes significantly enough to invalidate the results of an A/B test over time, multi-armed bandits provide an alternative to repeatedly retesting by continuously exploring.
Targeting
Targeting is another example of a long-term use of bandit algorithms. If certain types of users are more common than others, the multi-armed bandit can apply learned targeting rules sooner for more common users, while continuing to experiment on less common users.
Automation for scale
If you have multiple components to continuously optimize, the multi-armed bandit approach gives you a framework to partially automate the optimization process for low-risk problems, which can be too costly to analyze individually.
Optimizely uses a few multi-armed bandit algorithms to intelligently change the traffic allocation across variations to achieve a goal. Depending on your goal, you choose between the objectives: