rc3.org

Strong opinions, weakly held

Tag: continuous deployment

Getting bug fixes into the pipeline

I don’t work at a company that does continuous deployment like Flickr, or Etsy, or Facebook. We are not close to doing continuous deployment. I’m not sure anybody at the company besides me is interested in continuous deployment. Generally speaking, when we create new features, the product team creates a spec, then developers write a short tech spec, then quality assurance comes up with a test plan, then we write the feature, test it, launch it, and then test it again.

There are many reasons why we do things this way, some of them good, some of them bad. For one thing, our product is aimed at businesses, and so our features are driven by feedback from customers and from people who work with our customers every day. As a developer, I don’t really know what sorts of features our customers would pay for. The second is that our product is a key piece of infrastructure for our customers. When we make changes that affect how it works, there’s pretty substantial risk of making our existing customers unhappy, so we have to be very careful about the changes we make to existing functionality in our product. Finally, our product and processes have been around since before continuous deployment or even automated tests were industry practices, and it’s tough to fight against history.

Anyway, one persistent problem we have is putting bug fixes for minor issues or general maintenance of the code base onto the schedule. The product people decide which features get implemented, and nobody wants to add testing time by approving additional development work that isn’t essential to whatever business requirement is the top priority at the time.

The trick is to find a way to hijack our process so that we can get more developer priorities onto the release schedule. What I realized recently is that there are gaps in our process — times when most of the team is focused on preparing for an upcoming release or we’re designing features and waiting between steps in the review process. We should be filling that time with work on other bug fixes and back end improvements.

So after discussing it with the team, we decided to create a new process. Generally we check all of the features and fixes for the release to come into trunk. Now we’ve created a new branch for each developer. They’ll claim bugs that they want to see fixed and check the fixes along with automated tests in them into their own branch. Then, as we review the bugs and fixes and our testers find time to verify those fixes, they will be merged into the trunk for the subsequent release.

If we were using Git instead of Subversion, I’d probably recommend creating a new branch for every bug fix or feature, but one branch per developer is the least painful approach with Subversion. It’ll be interesting to see how this new approach works. I get an email every night with a list of all the errors that are logged on our production servers on a daily basis, and I’m looking forward to making those email messages much shorter.

How Etsy does deployment

Erik Kastner at Etsy explains how their team handles deployment. They have moved to a continuous deployment system based on a tool they built themselves. Here’s the bottom line:

Using Deployinator we’ve brought a typical web push from 3 developers, 1 operations engineer, everyone else on standby and over an hour (when things went smoothly) down to 1 person and under 2 minutes.

I particularly liked this question:

It’s a metric I’ve come to call the “quantum of deployment”: what’s the smallest number of steps, with the smallest number of people and the smallest amount of ceremony required to get new code running on your servers?

I wish everyone would write up how their deployment process works, just because I find it so darn interesting.

The continuous integration tool chain

Since last year, I’ve been somewhat obsessed with continuous deployment. One of my work projects for this year is to start moving toward a continuous deployment model. Making progress is going to require a lot of engineering work, and also a few cultural changes.

My first big goal is to start deploying bug fixes as soon as they have been tested. Eventually it would be nice to break features out of the deployment process, but I think that if you have identified a bug and checked in a fix, the fix shouldn’t wait around for the next scheduled release to go live.

In September, I wrote about what you need to do to prepare for continuous deployment. Now I’m going to write a little bit about the tools I’m planning on using.

For continuous integration, I’m using Hudson, it’s surprisingly simple to set up as long as you have a build script that will compile all your code and run your tests. The first step is to get it to check out our code, make a build, and run unit tests against the build. I’ve accomplished that on my local machine, the next step will be to set it up on a server and have it start running the build automatically when people check in changes.

To improve test coverage, I’m using the Eclipse plugin EclEmma. It builds your code and shows exactly which lines of code are not covered by your unit tests. I find that test coverage tools are indispensable, especially when you’re working with a code base that was not written in a TDD regime. (A fair amount of code in this code base was written before TDD was under discussion.)

For testing Web interfaces, I plan on using Selenium. Ben Rometsch at Carsonified explains how to run Selenium tests through Hudson.

We also have a Web services component of our application that I plan on testing end to end. For those purposes, I’m writing a tool in Ruby that accepts tests in YML format, submits a request to the Web service, validates the response against the expected data, and then goes and checks the database to make sure the proper data was stored there as well. At some point I’ll be hooking that into Hudson as well. At that point, the process will look something like this:

  1. Check out the code
  2. Make a build
  3. Run the unit tests
  4. Set up a test version of the database
  5. Deploy the application in a test environment
  6. Start the application
  7. Run the Web service tests

The Web interface is a separate project and will follow its own process.

Getting continuous integration up and running is but one of four steps toward getting a continuous deployment regime set up. The next big step I’m taking is nailing down the build process and then updating the build scripts to remove a few manual steps we’re performing right now.

You shouldn’t hate releasing code

You shouldn’t hate releasing software. Especially if your software is a Web site, not something that comes in an installer or disk image. And yet it’s really easy to get into a state where releases suck. I think that the thought of making releases suck less is the one of the main attractions of continuous deployment.

Here’s a partial list of reasons why releases can suck:

  1. Releases require downtime. When downtime is required, releases have to be scheduled far in advance, and the release schedule starts to affect other things. It becomes harder to work productively near releases (especially if your source code management is not up to par), and usually downtime releases must occur in the middle of the night or on weekends.
  2. The prerequisites for each release are not well documented. I’ve been in situations where the release time rolls around, everybody is online and ready, and the proper build wasn’t tagged. Or the production build doesn’t have the production log settings (a mistake I personally have made a number of times). Or the list of database changes that need to be made wasn’t prepared. Or the new build is deployed and the database changes weren’t applied and so you have to start over.
  3. Releases are too large. At some point, releases become large enough that nobody really knows everything that a release impacts. At the smallest level, a release includes a single bug fix. It’s easy to understand the possible risks involved with such a release. A release that includes hundreds or even dozens of changes can become unmanageable.
  4. Releases are insufficiently tested. Nothing’s worse than everyone waiting in a holding pattern while developers furiously bang away to fix a bug that slipped into the release and was only found when the code went out.
  5. People don’t have enough to do. In many cases, lots of people have to be available for a release, but they spend a lot of that time waiting for things to happen. Frustrating.
  6. Releases take too long. Sometimes the release process is just too cumbersome. Getting releases right is important but ultimately the actual release process is just a form of housekeeping. The more people that are involved and the longer it takes, the more resources are used that could be better applied to something else. Plus most of the work associated with a release is really boring.

My theory is that releases don’t have to suck. The releases I’m involved with exhibit few of these problems, but my goal for 2010 is to eliminate the remaining problems to excise the remaining suckiness.

If you liked this, you may also like my post When deployments go wrong from a couple of years ago.

Preparing for continuous deployment

Since I watched the Tim Fitz video that I linked earlier, I’ve been thinking about continuous deployment. Whether you are really going to deploy every day or not, the idea that your code base is always in a position where it could be deployed immediately is enticing. As Tim points out, getting your code into that shape means that you’re getting a lot of things correct.

So I thought I’d break it down a bit and look at the steps required to get an application in shape for continuous deployment. From the presentation, it seems like the major steps are as follows:

  • Get your house in order with regard to deployment. If you’re going to be deploying frequently, you need a fully automated deployment system that works quickly and effectively across your whole cluster. The complexity goes up if you’re pushing database schema changes that would ordinarily require downtime. You also need the ability to easy roll back your deployments in case things go wrong.
  • Set up a robust suite of automated tests. To set up continuous deployment with confidence, you need to feel pretty comfortable that bugs won’t make it past your test suite. If you don’t have tests, this is a big hurdle to overcome.
  • Set up a continuous integration system. Once your tests are set up, you need to automate them so that they’re run every time code is committed. If necessary, you have to set up this system in parallel so that tests run quickly enough not to gum things up. (In the presentation, Fitz mentions that their full test suite takes 6 CPU hours to run and is split among 40 machines).

  • Build a monitoring system that is effective in catching post-deployment problems. When you’re deploying all the time and relying almost solely on automated testing to verify your builds, you want to monitor your application to make absolutely sure that everything is in order while it runs. This is your last and best line of defense against errors that are causing problems in production but that you haven’t noticed yourself.

While all of those things are necessary to implement a continuous deployment system effectively, it seems like these are all features you’d want for any application. I’m going to explore getting the application I work on into a state where it could move to a continuous deployment and write up what it takes to get all of the prerequisites set up.

The big question is whether it’s possible to invest the effort to put the work into setting things up so that continuous deployment could work if you aren’t actually going to do it. The truth is that while all of these features are really nice, there are a lot of applications out there that get by without them. Even so, I’d love to give it a shot.

Tim Fitz on continuous deployment

I’ve discussed continuous deployment before. The idea is simple — ship more often — ideally, with every single commit. The last post was about blog post explaining continuous deployment by Eric Ries, CTO of IMVU. For more on the concept see this presentation by his colleague, Tim Fitz, on the same subject. I am very much intrigued by this concept, if only because it frustrates me when release cycles slow down so much as applications mature.

Thoughts on Continuous Deployment

Eric Ries posts about continuous deployment at O’Reilly Radar. This is the second post on this approach I’ve seen lately, and I find it to be something interesting to think about, if not to be adopted.

Point five in his five step plan for migrating to continuous deployment is root cause analysis, or “the five whys”. He has a whole post just on that topic that’s definitely worth reading. The case for using root cause analysis is self-evident, the rest of the continuous deployment process is more interesting to discuss.

The basic idea is simple: deploy changes to the application as rapidly as possible, where the definition of “as possible” means “without breaking things”. This is a technique that clearly works for big sites. What I’d really like to do is spend a day or a week watching this process unfold, because it leads me to all sorts of questions.

My first question is, how does quality assurance play into this process? The process Eric describes involves lots of automated testing, but most shops have testers as well. Do they test on the production system as changes go live?

The second question is, what kind of version control strategy is being used to back up this process? I’d expect that each developer has their own branch, and that they merge from the production branch to their branch constantly, and only merge back to the production branch when they have a change they are ready to publish. That way developers can commit code frequently without worrying about it being deployed by mistake.

My third question is whether this process is more suited to some kinds of applications than others? Let’s take World of Warcraft for example — they have a big, heavy release process that involves lots of internal testing and lots of customer testing before important releases go out the door. Even with exhaustive testing, after every release there’s a round of tweaking to deal with all of the unintended consequences of their changes. A change to one constant in World of Warcraft (say, the amount of attack power a warrior gets per point of strength) has changes that ripple through the entire game. It also seems like they’d have to group changes to see how those changes interact with one another without inflicting them on players first.

They could try continuous deployment to the Public Test Realm, but everything that happens there is reported to and dissected by the player community. It’s risky.

I’d love to read a whole lot more about this kind of process and how it scales. It’s pretty much how everybody works on small applications that they maintain by themselves, but I’m very curious to hear more about how it works for big teams.

© 2025 rc3.org

Theme by Anders NorenUp ↑