For the past 18 months or so, I’ve been working on data engineering at Etsy. That involves a pretty wide array of technologies – instrumenting the site using JavaScript, batch processing logs in Hadoop, warehousing data in an analytics database, and building Web-based tools for analytics, among other things. One of the persistently tough problems is moving data between systems in an accurate and timely fashion, sometimes transforming it into a more useful form in the process.
Fortunately, Etsy is anything but the only company in the world with this problem. When I was preparing for 2014, I scheduled meetings with my counterparts at a number of other companies to learn what they were doing in terms of data and analytics. When I came back, I had a pretty good idea of what we needed to do next year to harmonize all of our dispirate systems – set up infrastructure to allow them to communicate with one another using streams of events, otherwise known as logs.
Then, just about the time I was finished writing the 2014 plan, Jay Kreps published The Log: What every software engineer should know about real-time data’s unifying abstraction on the LinkedIn Engineering Blog. In it, he documents the philosophical underpinnings of pretty much everything I talked about in the plan.
The point of his post is that a log containing all of the changes to a data set from the beginning is a valid representation of that data set. This representation offers some benefits that other representations of the same data set do not, especially when it comes to replicating and transforming that dataset.
As it turns out, even though most engineers use logging, they are probably underexploiting logs in the form that Kreps describes in a pretty serious way. How do we scale up on the Web? We move more operations from synchronous to asynchronous execution, and we diversify our data stores and the ways in which we use them. (Even at Etsy, where we are heavily committed to MySQL, we use it in a variety of ways, some of which resemble key/value stores.) Representing data as a log makes it easier to handle asynchronous operations and diverse data stores, providing a clear path to scaling upward.
Kreps’ thoughts on logs are a must-read for any engineer working on the Web. I expect I’m not the only person who found that post to be perfectly timed.
Felix Salmon on the evolution of Netflix
Felix Salmon: Netflix’s dumbed-down algorithms
Good piece on the evolution of the Netflix user experience, necessitated by the transition from the “rent movies by mail” model to the streaming content model, and the licensing fees for streaming content online. I find that Netflix is OK if you just want to watch something, and mostly terrible if you want to watch a specific thing. The Netflix experience has been changed to discourage you from attempting the latter.