Problem of the Month: Multi-arm Bandit

This month I exercise my Second Amendment rights and deal with multi-arm bandits. I don’t like A/B test. They take a long time and I don’t really trust the results. There is always some chance it is a fluke.

One-armed Bandit

I want to be robbed of my money, so I’m going play the slot machines. This slot machine has an expected probability of paying out. Let’s say 30% of the time there is a payoff, which is represented by the dotted line below. After I pay the slot machine 100 times, I can compute what the payoff percentage is and see how close it is to what was set on the machine. I can also use math I learned before to get bar for a 95% confidence interval. The more I play the machine (trials), the tighter my confidence bounds become.


The confidence interval, E = z_c\sqrt{\frac{p(p-1)}{N}}, where z_c = 1.96 for 95% confidence, N is the number of trials and $p$ is the probability of a payout.

A/B Test

For an A/B, you are comparing two things and seeing which one is better. For web-related stuff your payoff is a conversion, whether or not the user takes a specific action. Below is a simulation of two possibilities, one with a true 4% conversion rate and one with a true 5% conversion rate. In the beginning, the confidence interval overlaps until with enough data, you can tell that one is better than the other.


The problem with this is that you need to wait until the specified time period of your trial is over. You’re doing an experiment and you can’t stop it midway through, because it would be unscientific. You are also missing out on a lot of conversions, because you need to split the users for your trial, forcing them to use the worst arm.

Multi-armed Bandit

The multi-armed bandit problem is framed by having many slot machines each with an unknown payout. Your goal is to maximize payout and you get to choose which arm you pull each time. How do you maximize your payout? You can think of each arm as a version of your website that you want to test.


Our first attempt at solving this problem is being greedy and picking the best known arm to pull. You need to make a choice between exploration to find the best arm and exploitation to get the most payout. $\epsilon$ is chosen such that the best arm is chosen $1 – \epsilon$ of the time and the other arms are chosen $\epsilon$ of the time.

Pros: Easy to implement.

Cons: Can also get stuck on bad arms in the beginning. Will continue to explore bad arms after finding good arm.


The UCB1 algorithm is based on an upper confidence bound. Instead of determining the best arm from the historical payout percentage, we chose arm based on optimistic assumption that it could pay out as much as its upper confidence bound. As I plotted on one of the figures above, the confidence interval gets smaller as you do more trials. Arms with less trials have the potential to be better.

We define our expected reward as

successful_pulls / arm_pulls + sqrt(2 * log(total_pulls) / arm_pulls)

where successful_pulls / arm_pulls is the expected payout percentage in epsilon-Greedy. The more times you pull the arm, the UCB goes down.

Pros: Spreads pulls around while favoring better arms. Good when there is an obviously better arm.

Cons: Not good when probabilities differ by a little bit.


If you want to be more mathematically inclined, you can try to attempt to sample arms with probability according to their expected payout.

\frac{\exp(r_j/\tau)}{ \sum_i \exp(r_i/\tau)}

Like epsilon, tau is a parameter you need to choose. In practice it is varied over time, called simulated annealing. You can think of tau as a temperature. When tau is high, the differences between the arms don’t matter as much. When tau is low, the differences get amplified. Simulated annealing turns the temperature down over time, giving the system time to explore before it switches to exploitation.

Pros: Spreads pulls around by expected payout.

Cons: Have to fiddle with tau. If difference between arms is small, will sample arms with equal probability. Does not use information about how many times arms have been pulled.

Thompson Sampling

Thompson sampling is what you do if you know what you’re doing. You choose arms based on the probability of it being the best, not the expected payout like softmax. You can think of each arm as a Bernoulli random variable, which is 1 for a probability p and 0 for a probability 1 - p%. The beta distribution tells you after seeing evidence, how likely a Bernoulli random variable will have a certain p$. The more instances of seeing nothing, will make p more likely to be low. The more instances of seeing 1, the more likely $p$ is close to 1. This beta distribution tells you precisely how likely the arm will have a certain conversion rate. You randomly sample the beta distribution of each arm and choose the arm with highest conversion rate as plucked from the sample.

Pros: Don’t waste time on bad arms. Will focus on arms that are most likely to be the best.

Cons: Expensive

In Practice

Random sampling from a beta distribution with every new user is not cheap. To make things easier, you will probably be updating your models at regular intervals. You probably should have an explicit explore phase since you are updating your models more slowly. You should be continuously running your multi-arm bandit test, because there is no reason not to. You should always be testing and measuring.


Book of the Week: Bandit Algorithms for Website Optimization


This week I read Bandit Algorithms for Website Optimization, because I wanted more reference material for the Problem of the Month: Multi-Arm Bandit post. This book is a starting point if you find things on the internet too mathematical. It has code examples in Python. The figures in the book are plotted using R.

Red Light District

Recently, I’ve been thinking a lot about red light recently. Amsterdam is famous for De Wallen, its largest red-light district. Prostitution is legal and regulated in the Netherlands. Why are they called red-light districts? According to wikipedia, it is because of red lights used in the brothel signs. But why red? After some thoguht, it makes perfect sense. The same logic for the red-light district is used by Apple (AAPL) in Night Shift, a feature added in iOS 9.3.

Color Temperature

A black-body will radiate light according to how hot it is. Things start out red and get bluer as it gets hotter. The color temperature is the equivalent temperature of an object required to product the spectrum of light. Light isn’t just one color. It is made up of light of different frequencies. The temperature is a way to describe the distribution of those frequencies. As the sun sets, the color gets redder, resulting a lower color temperature. Blue light has been found to be bad for your sleep. At night, you want to get rid of the blue, which leaves you with red.

Sex and Sleep

Before Apple came out with Night Shift, I used f.lux to help get to sleep. It changes the color of your monitor to emit less blue light, getting your body ready to sleep. When I’m camping, I use my headlamp has a red light mode to protect my night vision when walking around at night. Fisherman also use red lights at night to sneak up on fish. Brothels are frequented at night, so the red lights help the customers be well rested for work the next day. Makes sense.

Book of the Week: Weapons of Math Destruction


This week I read Weapons of Math Destruction by Cathy O’Neil, mathbabe, because she accuses me of building WMDs. O’Neil is a data scientist with a PhD in mathematics from Harvard, who was previously a professor and a quant for D.E. Shaw during the Financial Crisis of 2008. I have a few issues with the book other than it saying that I have fallen to the dark side of big data. The book goes over how models affect a person’s life through their lifetime starting with school, college, courts, work and voting. O’Neil’s main point is that correlation does not imply causation and unfortunate individuals are lumped into groups by association, which damage their prospects for life. For me, there are probably bigger underlying issues that need to be addressed. The models are merely a reflection of reality. Businesses treat people like cogs, because they are cogs.

Weapons of Math Destruction

The privileged, we’ll see time and again, are processed more by people, the masses by machines.

Let’s say you have a sample of people and you’re trying to predict whether or not a person is a rapist. In your dataset is Brock Turner. Additionally,  your language models have “Stanford rapist” as a high occurring bigram since there are tons of news article. The correlation between Stanford and rapist is high. Now you ask your trained algorithm, whether or not a new person, Alice, is a rapist or not. Your algorithm has never seen Alice, but based on her information, it tells you that Alice is a rapist. Alice is shocked. Now being a wealthy person going to Stanford, Alice gets her father’s lawyers to subpoena and do discovery on the algorithm. After a few months the data scientist comes back and says, she was labelled as a rapist, because she went to Stanford and Brock Turner went to Stanford. Because Alice was privileged, she was able to correct this oversight, the masses aren’t so lucky.

There are three elements to a Weapon of Math Destruction (WMD): opacity, scale and damage. WMDs are opaque, because you don’t know what the inputs are and how those inputs are used in computing a score. You don’t know what you did to deserve such a bad score and you have no way of rectifying the situation even though it may be a case of bad data. WMDs are usually adopted at scale to deal with many people who are difficult to sort of individually as a cost saving measure. It becomes a competitive advantage. If things don’t go way according to the model, you can suffer damages, like higher interest rates, rejection from jobs, increased incarceration and denied education.

The people who construct models created flawed models, because of their own biases. Models that allocate police use violent crime and nusiance crime, but if you look at the cost of crime, I’d say there is probably no bigger criminal than bankers. How much of our police force is dedicated to that? How many bankers went to jail for cheating people out of money. People have lost more money to bankers than they have to petty theft. Yet we don’t arrest bankers, because they are the ones who pay the bills. Where is the justice? That said I am thankful to the police who safeguard my assets.

O’Neil says people get screwed, because of how big data puts you in a group of people like you and that group may be trash. I think this is antiquated. Algorithms I work on, deal with the individual person. Things are personalized to a specific person, not a group. Sometimes when I try to explain it to people, they don’t believe it.

US News and Education

The college rankings from US News are pretty pervasive, but they don’t mean a thing. They were something thought up to drive magazine sales. There is not scientific rigor involved and colleges routinely try to game the rankings. In 2014, the Saudi’s made King Adulaziz University’s math department rank just behind Harvard by paying highly-cited professors $72,000 to work for 3 weeks and change their affiliation on Thomson Reuters. The university was only in existence for 2 years. Instead of focusing on providing a quality affordable education, colleges focus on increasing their US news rank. Even donating $1 as an alumni helps, because percentage of alumni donations is used to ranking. You would think the cost of tuition would be an important factor, but it is left out.

Another sad thing is that teachers were graded on how well your students improved, they called this valued-added. If you have a class filled with special education and smart students, they don’t really improve. The special education students will continue to score low and the smart students will continue to score high. What you want are some students in the middle, which you can improve. You can also be screwed by their teacher by the year before. Let’s say their previous teacher “corrected” their test answers, so it looks like the students improved. Now when you get them and when they test, they scored lowered than before, because you neglected to “correct” their answers. You get fired and the previous teacher gets a bonus. Doesn’t this sound like Wall Street banking? Even if there wasn’t any cheating, the student class sizes are so small, the results are not statistically meaningful due to sampling error.

Predatory Behavior

They rake in lead generation fees by providing a superfluous service to people, many of whom are soon targeted for services they can ill afford.

There is a dirty industry for targeting specific people with misleading ads to generate leads. For-profit colleges will pay up to $150 for good leads. You can sell good car insurance leads for $20. Each sucker has a price on their head. With Facebook (FB), you can target the exact demographics that are likely to be vulnerable to predatory payday loans. When you use Facebook, you shouldn’t forget that you are the product that they are selling.

Allstate insurance had a model for how likely you were to shop for lower prices. If you weren’t likely to shop around, they charged you more. It was part of their price optimization strategy.

Low Wage Workers

After reading about how low wage workers need to deal with last minute changes to scheduling due to optimization algorithms makes me think that Uber is great for the world. When there is uncertainity, it is difficult for you to plan for the future. With Uber, you can work when you want to. Well, sort of. Unless there is a surge going. Then you’ll be out there driving.

Simpson’s Paradox

A Nation at Risk has statistical errors that affect public policy by saying America’s schools were failing. It took researchers at Sandia National Laboratories to identify that the conclusion was false due to the Simpson’s Paradox. Each of the individual groups were improving, but there were more poor students and minorities taking the SATs.


Systems will only become more opaque as AI becomes more prevalent. But are humans any better. In 2015, 43% of Republicans thought Obama was Muslim. If I had a choice, I might welcome our robot overlords.

Problem of the Month: Steering

I have a 1/12-scale monster truck that I need to steer around the real world without hitting things and causing mayhem. It has a camera, so it can see the world and react to what it sees by changing the steering angle of the wheels through a servo. It does drive by itself, but not very well yet.

PID Loop

The most common of providing feedback to maintain some value is a PID loop.

\alpha = -\tau_P CTE - \tau_D \frac{d}{dt} CTE - \tau_I \sum CTE

The steering angle, \alpha, can be determined by some constants times the proportional, integral and differential cross track error (CTE). This should take care of overshooting and systematic biases. This usually works well in practice, but you need to determine good values for \tau through experimentation. I had some problems with experimentation as I will explain below.

Ackermann Steering

Image from wikipedia

If you have a no-slip condition on the wheels, you wind up with Ackermann steering, where the turning wheels are tangent to a turning circle whose center goes through the rear axles. On thing that I didn’t appreciate before is that the two wheels turn at different angles.

In the diagram below, the box represents the car with the upper right corner of the box being the right front wheel. If you were to simulate movement with a fixed steering angle, you will get circles of different radii and different centers that line on the rear axle axis. This makes the motion of the car difficult to predict for high steering wheel angles if your feedback is too slow.


It helps a lot to know what is ahead if you are trying to optimize steering with low frequency of updates. In practice, you have think of having an A* graph traversal, where the graph nodes are different chosen steering angles. The path of the car is a series of connected arcs. It is very difficult to negotiate a blind corner based on PID loop feedback between current view from car and steering angle alone.

We don’t drive based on reacting. We have some model of the world and behave according to expectations.


It is a lot cheaper and faster to create a virtual world to fine tune your algorithms before testing them in real life. The red target is set to go around in an ellipse while the blue car is told to track the target.


Book of the Week: Sprint


This week I read Sprint: How to Solve Big Problems and Test New Ideas in Just Five Days by Jake Knapp, because I’m impatient and want to solve big problems. I think of the book as a mix of design thinking, agile sprints, usability testing and lean startup. Knapp has refined the sprint process through experimentation and this books serves as a step by step guide. The stories in the book were fresh to me since they were more recent examples and companies in Google Venture’s portfolio.


The purpose of a sprint is to generate and test ideas to solve a problem before fully committing resources.

  • Decider is needed to make decisions
  • Facilator is needed to make sure things stay on track
  • 5 days
  • ideal size for sprint team is 7 or fewer people.
  • start at 10 AM, end at 5 PM
  • lunch at 1 PM to break up day in half.

I was going to summarize each day of the sprint, but you’re better off reading the book. The checklist in the back of the book does a good job of detailing the specifics. If I was going to do a sprint, I would have a copy of the book on hand. Knapp has done enough sprints to know what works and what doesn’t.

There is one thing I would like to highlight. Group brainstorming doesn’t work. You need to work individually and then show, critique and combine ideas. People try to brainstorm, because it sounds cool and inclusive, but in practice it falls flat.

Also improving your drawing ability by reading Drawing on the Right Side of the Brain will make it easier to sketch your ideas during the sprint.

Retiring at 33

Justin, Mr. Root of Good, retired at age 33. Bloomberg had an article on early retirement. The reason you’ve probably been hearing about it more is that stocks have gained a lot in the past 7 years. If you kept putting money in the market since 2009, you would have amounted a substantial nest egg. There are a bunch of people who write about early retirement, but they also try to sell you on that lifestyle. Their job is to write about early retirement, because that is the product they are selling. I think they are still working.

Financial Independence / Retire Early

I got the book recommendations for The Life-Changing Magic of Tidying Up and A Guide to the Good Life from the financialindependece subreddit, which advocates the financial independence / retire early (FIRE) life. Some argue against it, because the sacrifices you make put you on the differed life plan. Experiences make people happy. You don’t want to retire earlier at the expense of missing out on experiences.

Think about what you really want in life and create a plan to achieve it. Is there a point to continue working for money when you don’t have to anymore? Be very clear about the reasons why you are working and what you are spending your money on. Mountains of cash buys freedom from worry and puts you on the level of FU.

There are 3 things you can do to retire earlier.

  • Make more money.
  • Spend less.
  • Die earlier.


Historically, I spent about $34,000 a year for the past 3 years. As my rent went up over time, I also spent less money to compensate, because I didn’t need to continue buying furniture and kitchen equipment. Retiring early means no employer to pay for health insurance, so expenditures will go up, but let’s just take $34,000 as a starting point.

Safe Withdrawal Rate

Based on my expenses, I need about $850,000 in investments if I follow the 4% safe withdrawal rate from the Trinity Study. This assumes a 30 year retirement period, so I need more money if I think I’m going to live longer than 30 years and less money if I’m going to die soon. When you’re going to die earlier than expected, you usually have some medical condition. The leading cause of bankruptcy is medical expenses. If you don’t seek treatment, you can have a glorious sunset instead. The amount of money you need depends on your expected expenses and expected retirement duration. Once you hit your target number, you never have to work again.


What to do in retirement? The best thing to do is to take an extended period of time off of work and see how retirement is like. This could be a sabbatical or lumping all your vacation time in a single chunk. The point is to take a test drive before you take the plunge.

Some of the most fun experiences in my life occurred between jobs. Your mind is in a different state when you don’t have a job. If people say they are bored, they aren’t doing the right things. My hobby is reading. There are more books than I can read in my lifetime. I want to devour each one before I die.