Fortunately, Etsy is anything but the only company in the world with this problem. When I was preparing for 2014, I scheduled meetings with my counterparts at a number of other companies to learn what they were doing in terms of data and analytics. When I came back, I had a pretty good idea of what we needed to do next year to harmonize all of our dispirate systems – set up infrastructure to allow them to communicate with one another using streams of events, otherwise known as logs.
Then, just about the time I was finished writing the 2014 plan, Jay Kreps published The Log: What every software engineer should know about real-time data’s unifying abstraction on the LinkedIn Engineering Blog. In it, he documents the philosophical underpinnings of pretty much everything I talked about in the plan.
The point of his post is that a log containing all of the changes to a data set from the beginning is a valid representation of that data set. This representation offers some benefits that other representations of the same data set do not, especially when it comes to replicating and transforming that dataset.
As it turns out, even though most engineers use logging, they are probably underexploiting logs in the form that Kreps describes in a pretty serious way. How do we scale up on the Web? We move more operations from synchronous to asynchronous execution, and we diversify our data stores and the ways in which we use them. (Even at Etsy, where we are heavily committed to MySQL, we use it in a variety of ways, some of which resemble key/value stores.) Representing data as a log makes it easier to handle asynchronous operations and diverse data stores, providing a clear path to scaling upward.
Kreps’ thoughts on logs are a must-read for any engineer working on the Web. I expect I’m not the only person who found that post to be perfectly timed.