EveryBlock’s post-mortem on the Amazon Web Services outage that took down their site sets a standard of accountability and transparency that every engineering team should aspire to. Rather than blaming AWS for their downtime, Paul Smith explains that had they followed the architectural guidelines provided by AWS, they would have been fine. Nearly all potential outage scenarios can be mitigated given sufficient resources, but it rarely makes sense to build the infrastructure to avoid all of the known outage possibilities. It’s just too expensive and time consuming. What I respect is Paul Smith’s acknowledgement that it was those choices that resulted in the site’s downtime, rather than the problems with the AWS data center.
Update: This Webmonkey post on the outage is worth a read as well.