rc3.org

Strong opinions, weakly held

Tag: data engineering

Great reads of 2013: Jay Kreps on logs

For the past 18 months or so, I’ve been working on data engineering at Etsy. That involves a pretty wide array of technologies – instrumenting the site using JavaScript, batch processing logs in Hadoop, warehousing data in an analytics database, and building Web-based tools for analytics, among other things. One of the persistently tough problems is moving data between systems in an accurate and timely fashion, sometimes transforming it into a more useful form in the process.

Fortunately, Etsy is anything but the only company in the world with this problem. When I was preparing for 2014, I scheduled meetings with my counterparts at a number of other companies to learn what they were doing in terms of data and analytics. When I came back, I had a pretty good idea of what we needed to do next year to harmonize all of our dispirate systems – set up infrastructure to allow them to communicate with one another using streams of events, otherwise known as logs.

Then, just about the time I was finished writing the 2014 plan, Jay Kreps published The Log: What every software engineer should know about real-time data’s unifying abstraction on the LinkedIn Engineering Blog. In it, he documents the philosophical underpinnings of pretty much everything I talked about in the plan.

The point of his post is that a log containing all of the changes to a data set from the beginning is a valid representation of that data set. This representation offers some benefits that other representations of the same data set do not, especially when it comes to replicating and transforming that dataset.

As it turns out, even though most engineers use logging, they are probably underexploiting logs in the form that Kreps describes in a pretty serious way. How do we scale up on the Web? We move more operations from synchronous to asynchronous execution, and we diversify our data stores and the ways in which we use them. (Even at Etsy, where we are heavily committed to MySQL, we use it in a variety of ways, some of which resemble key/value stores.) Representing data as a log makes it easier to handle asynchronous operations and diverse data stores, providing a clear path to scaling upward.

Kreps’ thoughts on logs are a must-read for any engineer working on the Web. I expect I’m not the only person who found that post to be perfectly timed.

Defining data engineering

Last year I started working in the world of Big Data, and at the time, I didn’t know that “data science” and “data engineering” were separate things. At some point, I looked at what my team is working on and realized that the distinction between the two is important, and that the team is firmly entrenched in the data engineering camp.

Data scientists get all the glory and attention, but without data engineering, there’s no way for data scientists to practice real science. I’ll talk more about this in another post.

In this first post, I want to talk about the four basic layers of the data engineering stack. These apply whether you’re working to enable people to collect analytic data for a Web-based business, or building the infrastructure for scientists to analyze rainfall patterns. The layers are:

  1. Instrumentation
  2. Data crunching
  3. Data warehousing
  4. End-user tools

Let’s look at an example from Web analytics, because that’s what I understand the best. A tool like Google Analytics spans all four layers but end users only have a measure of control over two of them. When you add the Google Analytics JavaScript to your Web site, you’re setting up the instrumentation. Google crunches the data they collect, and they warehouse it for you. You can then view reports using the Web interface. Google Analytics is a great general purpose tool, but the lack of control and visibility is what limits its potential.

At Etsy, we have our own custom instrumentation, our own Hadoop jobs to crunch the logs the instruments write to, our own data warehouse, and, for the most part, end-user tools for exploring that data that we wrote ourselves.

All of the data engineering team’s projects involve at least one layer of the stack. For example, we worked with our mobile team to add instrumentation to our native iOS and Android apps, and then we made changes to our Hadoop jobs to make sure that the new incoming data was handled correctly. The new mobile data also has implications for our end-user tools.

Along with building up the data infrastructure, managing data quality is the other priority of data engineering. It’s possible to lose data at every layer of the stack. If your instrumentation is built using JavaScript, you lose data from browsers that don’t have it enabled. Your instruments usually log through calls to some kind endpoint and if that endpoint is down or the connection is unreliable, you lose data. If people close the browser window before the instruments load, you lose data. If your data crunching layer can’t properly process some of the data from the instruments (often due to corruption that’s beyond your control), it’s lost. Data can be lost between the data crunching layer and the data warehouse layer, and of course bugs in your end-user tools can give the appearance of data loss as well.

In terms of skills and daily work, data engineering is not much different than other areas of software development. There are cases where having a background in math or quantitative analysis can be hugely helpful, but many of the problems are straightforward programming or operations problems. The big problems tend to be scaling each of the layers of the stack to accommodate the volume of data being collected, and doing the hardcore debugging and analysis required to manage data loss effectively.

That’s a quick description of what life in data engineering is like. I am planning on writing a lot more about this topic. If you have questions, please comment.

Why Web developers should care about analytics

I’m pretty sure the universe is trying to teach me something. For as long as I can remember, I’ve been dismissive of Web analytics. I’ve always felt that they’re for marketing people and that, at least in the realm of personal publishing, paying attention to analytics makes you some kind of sellout. Analytics is a discipline rife with unfounded claims and terrible, terrible products, as well as people engaging in cargo cultism that they pretend is analysis. Even the terminology is annoying. When people start talking about “key performance indicators” and “bounce rate” my flight instinct kicks in immediately.

In a strange turn of events, I’ve spent most of this year working in the field of Web analytics. I am a huge believer in making decisions based on quantitative analysis but I never connected that to Web analytics. As I’ve learned, Web analytics is just quantitative analysis of user behavior on Web sites. The problem is that it’s often misunderstood and usually practiced rather poorly.

The point behind this post is to make the argument that if you’re like me, a developer who has passively or actively rejected Web analytics, you might want to change your point of view. Most importantly, an understanding of analytics gives the team building for the Web a data-based framework within which they can discuss their goals, how to achieve those goals, and how to measure progress toward achieving those goals.

It’s really important as a developer to be able to participate in discussions on these terms. If you want to spend a couple of weeks making performance improvements to your database access layer, it helps to be able to explain the value in terms of increased conversion rate that results from lower page load time. Understanding what makes your project successful and how that success is measured enables you to make an argument for your priorities and, just as importantly, to be able to understand the arguments that other people are making for their priorities as well. Will a project contribute to achieving the overall goals? Can its effect be measured? Developers should be asking these questions if nobody else is.

It’s also important to be able to contribute to the evaluation of metrics themselves. If someone tells you that increasing the number of pages seen per visit to the site will increase the overall conversion rate on the site, it’s important to be able to evaluate whether they’re right or wrong. This is what half of the arguments in sports statistics are about. Does batting average or on base percentage better predict whether a hitter helps his team win? What better predicts the success of a quarterback in football, yards per attempt or yards per completion? Choosing the right metrics is no less important than monitoring the metrics that have been selected.

Finally, it often falls on the developer to instrument the application to collect the metrics needed for analytics, or at least to figure out whether the instrumentation that’s provided by a third party is actually working. Again, understanding analytics makes this part of the job much easier. It’s not uncommon for non-developers to ask for metrics based on data that is extremely difficult or costly to collect. Understanding analytics can help developers recommend alternatives that are just as useful and less burdensome.

The most important thing I’ve learned this year is that the analytics discussion is one that developers can’t really afford to sit out. As it turns out, analytics is also an extremely interesting problem as well, but I’ll talk more about that in another post. I’m also going to revisit the analytics for this site, which I ordinarily never look at, and write about that as well.

© 2025 rc3.org

Theme by Anders NorenUp ↑