This month I exercise my Second Amendment rights and deal with multi-arm bandits. I don’t like A/B test. They take a long time and I don’t really trust the results. There is always some chance it is a fluke.

**One-armed Bandit**

I want to be robbed of my money, so I’m going play the slot machines. This slot machine has an expected probability of paying out. Let’s say 30% of the time there is a payoff, which is represented by the dotted line below. After I pay the slot machine 100 times, I can compute what the payoff percentage is and see how close it is to what was set on the machine. I can also use math I learned before to get bar for a 95% confidence interval. The more I play the machine (trials), the tighter my confidence bounds become.

The confidence interval, , where = 1.96 for 95% confidence, is the number of trials and $p$ is the probability of a payout.

**A/B Test**

For an A/B, you are comparing two things and seeing which one is better. For web-related stuff your payoff is a conversion, whether or not the user takes a specific action. Below is a simulation of two possibilities, one with a true 4% conversion rate and one with a true 5% conversion rate. In the beginning, the confidence interval overlaps until with enough data, you can tell that one is better than the other.

The problem with this is that you need to wait until the specified time period of your trial is over. You’re doing an experiment and you can’t stop it midway through, because it would be unscientific. You are also missing out on a lot of conversions, because you need to split the users for your trial, forcing them to use the worst arm.

**Multi-armed Bandit**

The multi-armed bandit problem is framed by having many slot machines each with an unknown payout. Your goal is to maximize payout and you get to choose which arm you pull each time. How do you maximize your payout? You can think of each arm as a version of your website that you want to test.

**$latex\epsilon$-Greedy**

Our first attempt at solving this problem is being greedy and picking the best known arm to pull. You need to make a choice between exploration to find the best arm and exploitation to get the most payout. $\epsilon$ is chosen such that the best arm is chosen $1 – \epsilon$ of the time and the other arms are chosen $\epsilon$ of the time.

Pros: Easy to implement.

Cons: Can also get stuck on bad arms in the beginning. Will continue to explore bad arms after finding good arm.

**UCB**

The UCB1 algorithm is based on an upper confidence bound. Instead of determining the best arm from the historical payout percentage, we chose arm based on optimistic assumption that it could pay out as much as its upper confidence bound. As I plotted on one of the figures above, the confidence interval gets smaller as you do more trials. Arms with less trials have the potential to be better.

We define our expected reward as

successful_pulls / arm_pulls + sqrt(2 * log(total_pulls) / arm_pulls)

where successful_pulls / arm_pulls is the expected payout percentage in epsilon-Greedy. The more times you pull the arm, the UCB goes down.

Pros: Spreads pulls around while favoring better arms. Good when there is an obviously better arm.

Cons: Not good when probabilities differ by a little bit.

**Softmax**

If you want to be more mathematically inclined, you can try to attempt to sample arms with probability according to their expected payout.

Like epsilon, tau is a parameter you need to choose. In practice it is varied over time, called simulated annealing. You can think of tau as a temperature. When tau is high, the differences between the arms don’t matter as much. When tau is low, the differences get amplified. Simulated annealing turns the temperature down over time, giving the system time to explore before it switches to exploitation.

Pros: Spreads pulls around by expected payout.

Cons: Have to fiddle with tau. If difference between arms is small, will sample arms with equal probability. Does not use information about how many times arms have been pulled.

**Thompson Sampling**

Thompson sampling is what you do if you know what you’re doing. You choose arms based on the probability of it being the best, not the expected payout like softmax. You can think of each arm as a Bernoulli random variable, which is 1 for a probability and 0 for a probability p$. The more instances of seeing nothing, will make more likely to be low. The more instances of seeing 1, the more likely $p$ is close to 1. This beta distribution tells you precisely how likely the arm will have a certain conversion rate. You randomly sample the beta distribution of each arm and choose the arm with highest conversion rate as plucked from the sample.

Pros: Don’t waste time on bad arms. Will focus on arms that are most likely to be the best.

Cons: Expensive

**In Practice**

Random sampling from a beta distribution with every new user is not cheap. To make things easier, you will probably be updating your models at regular intervals. You probably should have an explicit explore phase since you are updating your models more slowly. You should be continuously running your multi-arm bandit test, because there is no reason not to. You should always be testing and measuring.

**Resources**