Experimentation Adrift

2018-03-30

You can learn numerous lessons from failure, almost all of them bogus. This one I have thought about enough over the years that I think it is worth writing down.

I started working on Google+ a few weeks before it publicly launched, and was there about a year. The first two months were frantic, but after that I managed to devote a few hours a week to a project that really interested me: bringing experiments to G+.

Google is very fond of experimentation and I had spent time working in other parts of the company seeing it in action. It really works. You can take a product and make it better through careful application of the scientific method.

It did not really work for G+ when I worked there. (It has been five years, so I could not say how they use experiments today.) Nothing went terribly wrong, but I did not manage to build what I wanted.

Experiments were run, metrics collected and studied. But the culture of scientific feedback I had seen in other parts of the company did not appear. Features were designed, implemented, and final decision made using processes that had nothing to do with what was learned from prior experiments.

Effective scientists hypothesize, design and run an experiment to test their hypothesis, and then use what they learn for their next hypothesis.

There is a lot of hand waving in this process. We conveniently minimize discussion of how exactly we generate hypotheses. Vague words like intuition or heuristic get used to describe it. One thing that is clear is there is a cycle. You have to learn from your experiments. Your next hypothesis (or specifically, software feature) needs to change when the experiment does not go the way you expected it to go.

I think this is a common problem in software A/B testing in the industry. Two possibilities are tried. The winner at metrics of your choice (engagement, revenue, clicks, likes, etc) is rolled out. For the next feature, another A/B test. A and B are two choices the designer liked, or management liked, or maybe even something a programmer liked. Either way, they are not two choices made after integrating what was learned from prior A/B experiments. And this experiment is not designed to maximize what you can learn about your software features. When this technique is applied for long enough, the result is at best some kind of stuttering hill climb, in steps of some unknown length up and hill you could not name. At worst, it is a kind of long-term P-hacking, where most of your A/B results are a wash, but sometimes it looks like experiments really got you something.

In retrospect, the parts of the company that make very effective use of experiments got lucky. A handful of excellent engineers in a past life did a long science PhD under in a lab with enough other well-trained scientists that the necessary skills were ingrained. The brutal introspection and honesty required to deal with the hard fact that an experiment has shown that a truth you cherished is wrong.

These effective experimenters had another thing going for them: regular releases and a long-term pace. When a team is in "start-up" mode, rushing to release features required by a large paying customer or to match a competitor's product, you can't take a week to refine a hypothesis and three more to design an experiment to test it. There are things to be done.

Finding both of these, training and time, is extremely difficult in a technology company. But there is no way around it. Scientific training takes years. Doing science takes years. So for now I am left to conclude that scientific culture is largely incompatible with silicon valley culture. There are a few places where it thrives and has enormous impact. In general however, when I hear of scientific process being applied to software I am extremely skeptical. What a pity.