Blameless post-mortems are one of the most notable (and perhaps most misunderstood) features of Etsy’s engineering culture. John Allspaw wrote about them on Code as Craft back in May, 2012. In it, he talks about human errors and the “Bad Apple theory,” which is that the best way to eliminate error is to eliminate the “bad apples” who introduce error.
Most of the time, when we talk about blameless post-mortems, it’s in the context of outages. What I think though is that once you accept the reasoning behind building a culture of learning around outages (as opposed to a culture of blame), it also changes, or at least should change, how you think about management in general.
Etsy’s practices around post-mortems are drawn largely from the field of accident investigation. One of the key concepts taken from that field is that of local rationality. You can read about it in this rather dry paper, Perspectives on Human Error: Hindsight Biases and Local Rationality, by David Woods and Richard Cook. To oversimplify, in the moment, people take actions that seem sensible to them in that context. Even when people take what seem to be negligent shortcuts, they do so confident that what they’re doing is going to work —they just happen to be wrong.
The challenge is in building resilient systems that enable the humans interacting with them to exercise local rationality safely. Disasters occur when the expected outcomes of actions differ from the actual outcomes. Maybe I push a code change that is supposed to make error messages more readable, but instead prevents the application from connecting to the database. The systems thinker asks what gave me the confidence to make that change, given the actual results. Did differences between the development and production environments make it impossible to test? Did a long string of successful changes give me the confidence to push the change without testing? Did I successfully test the change, only to find out that the results differed in production? A poor investigation would conclude that I am a bad apple who didn’t test his code properly and stop before asking any of those questions. That’s unlikely to lead to building a safer system in the long run. Only in an organization where I feel safe from reprisal will I answer questions like the ones above honestly enough to create the opportunity to learn.
I mention all of this to provide the background for the real point I want to make, which is that once you start looking at accidents this way, it necessarily changes the way you think of managing other people in general. When it comes to the bad apple theory in accident investigation, the case is closed, it’s a failure. Internalizing this insight has led me to also reject the bad apple theory when it comes to managing people in general.
Poor individual performance is almost always the result of a systems failure that is causing local rationality to break down. All too often the employee who is ostensibly performing poorly doesn’t even know that they’re not meeting the expectations of their manager. In the meantime, they may be working on projects that don’t have clear goals, or that they don’t see as important. They may be confronted with obstacles that are difficult to surmount, often as a result of conflicting incentives.
There are a million things that can lead to poor outcomes, only a few of which are due to the personal failings of any given person working on the project. If you accept that local rationality exists, then you accept that people are doing what they believe is expected of them. If they knew better, they would do better.
All this is not to say that there are never cases where an employment relationship should end. Sometimes people are on the wrong team, or at the wrong company. What I would say though is that the humane manager works to construct a system in which people can thrive, rather than getting rid of people who aren’t succeeding within a system that could quite possibly be unfit for humans. Even in the case where a person simply lacks the skills to succeed at the task at hand, someone else almost certainly assigned them the task or agreed to let them work on it. Their being in the position to fail reflects as poorly on the system as it does on the individual.
These principles are easier to apply within the limited context of investigating an incident than the general context of managing an organization, or the highly personal relationship been a manager and the person who reports to them. Focusing on the system and how to optimize it for the people who are part of it is the bedrock of building a just culture. As managers, it’s up to us to create a safe place for employees to explain the choices they make, and then use what we learn from those explanations to shore up the system overall. Simply tossing out the bad apples is a commitment to building a team that is unable to look back honestly and improve.
Playing with Vagrant
Vagrant is one of those things I hear people talking about but that I’ve never gotten around to playing with, or hadn’t gotten around to playing with until today. Vagrant is a solution to the problem of setting up a local development environment for Web development. Depending on the platform you use, this can be rather difficult. I don’t even want to think about Windows development, but even in a Unix-like environment (OS X or Linux), you can still run into problems.
Basically, your system probably has some version of the language runtime you’re using that doesn’t match the server’s, and reconciling the difference is painful. I’ve always hated solutions like virtualenv and dvm. Vagrant works by running a full-blown virtual machine somewhat transparently. Helpfully, the virtual machine mounts one of your local directories so that you can edit files in your tool of choice but run your development server on the development machine, which should match your server pretty closely. For example, to experiment with Google App Engine development, I created a Vagrant instance using
ubuntu/trusty64
(the latest Ubuntu LTS release), then I provisioned it using the following file:When the instance is provisioned, it automatically downloads the Google App Engine SDK and extracts it. Then I can dig in.
This is a super-simple application. Next I want to try setting up a Vagrant instance that’s provisioned using the chef server at work with our production Hadoop configuration so that I can easily launch Hadoop jobs from my laptop rather than logging into a remote machine to do it.