Bandits and Corporate Schizophrenia

machine_testThe story has it that Richard Feynman and his biographer Ralph Leighton had a hard time at restaurants deciding whether to order the best dish they had tried so far or something new, which –who knows– might be even better. They decided to formalize this as a mathematical problem. I bet you have found yourself in a similar situation and you have made your choice without the aid of a mathematical model, simply following your intuition. What about companies? They face this kind of problem every other day. How do they choose their menu?

In 1991, James G. March introduced the important notion of exploration versus exploitation trade-off in organizations. Exploration and exploitation are essential for organizations, but they compete for scarce resources. As a result, organizations make implicit and explicit choices between the two. Compared to returns from exploitation, returns from exploration are always systematically less certain, more remote in time and organizationally more distant from the locus of action. On the other hand, learning processes characteristically improve exploitation more rapidly than exploration.

Since performance is a joint function of potential return from an activity and present competence of an organization at it, organizations exhibit increasing returns to experience. Positive local feedback produces strong path dependence and can lead to suboptimal equilibria. It is quite possible for competence in an inferior activity to become great enough to exclude superior activities with which an organization has little experience. (James G. March, “Exploration and Exploitation in Organizational Learning”)

In his original paper, March used a couple of simple theoretical models for mutual learning and competitive advantage to  derive some interesting insights into the relation between accumulation and utilization of knowledge in organizations. Subsequently, he reformulated his discussion of learning in terms of the bandit model. The so-called multi-armed bandit model is the canonical representation of the exploration–exploitation problem across literatures as diverse as economics, statistics, or computer science.


The name “multi-armed bandits” comes from a whimsical scenario in which a gambler faces several slot machines, (a.k.a. “one-armed bandits”) and has to decide which machines to play, how many times to play each machine and in which order. The machines look identical but, when played, each provides a random reward from a probability distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls.

The crucial issue here is the trade-off between acquiring new information by playing a new machine (exploration) and capitalizing on the information available so far as a result of the machines the gambler has already played (exploitation).

Bandit problems embody in essential form a conflict evident in all human action. This is the conflict between taking those actions which yield immediate reward and those (such as acquiring information or skill, or preparing the ground) whose benefit will come only later. — Peter Whittle

Many real-world problems, from the scheduling of clinical trials to the optimization of advertising placement, website morphing or traffic routing, can be modelled as multi-armed bandit problems. They have also been used to model the problem of managing research projects in a large organization, like a science foundation, a pharmaceutical company or a technology company. Given a fixed budget, the problem is to allocate resources among competing projects, whose properties are only partially known at the time of allocation, but which may become better understood as time passes.

Choices must be made between gaining new information about alternatives and thus improving future returns (which suggests allocating part of the investment to searching among uncertain alternatives), and using the information currently available to improve present returns (which suggests concentrating the investment on the apparently best alternative). The problem is complicated by the possibilities that new investment alternatives may appear, that probability distributions may not be table, or that they may depend on the choices made by others. (James G. March, Op. Cit.)

Real problems come in many different flavours and disguised under the hood of endless variation, and mathematical models can only provide a stylized approximation. However, would you ignore the insights which can be gained by a sensible but systematic use of mathematical simulation? In case you are wondering, the solution to the Feynman’s restaurant problem is that the number of dishes you must try before switching to ordering the best of them is:

\text{Number of dishes to try } = \sqrt{2(\text{Meals to be eaten at the restaurant}+1)} -1

You can afford to forget about the maths and go with your gut in a restaurant, but in the corporate arena the smart guys will certainly choose to do their homework. And while people do not normally exhibit an exaggerated greediness in a restaurant, in the market they will come hard if they like your dish more.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.