Last year I started working in the world of Big Data, and at the time, I didn’t know that “data science” and “data engineering” were separate things. At some point, I looked at what my team is working on and realized that the distinction between the two is important, and that the team is firmly entrenched in the data engineering camp.
Data scientists get all the glory and attention, but without data engineering, there’s no way for data scientists to practice real science. I’ll talk more about this in another post.
In this first post, I want to talk about the four basic layers of the data engineering stack. These apply whether you’re working to enable people to collect analytic data for a Web-based business, or building the infrastructure for scientists to analyze rainfall patterns. The layers are:
- Instrumentation
- Data crunching
- Data warehousing
- End-user tools
Let’s look at an example from Web analytics, because that’s what I understand the best. A tool like Google Analytics spans all four layers but end users only have a measure of control over two of them. When you add the Google Analytics JavaScript to your Web site, you’re setting up the instrumentation. Google crunches the data they collect, and they warehouse it for you. You can then view reports using the Web interface. Google Analytics is a great general purpose tool, but the lack of control and visibility is what limits its potential.
At Etsy, we have our own custom instrumentation, our own Hadoop jobs to crunch the logs the instruments write to, our own data warehouse, and, for the most part, end-user tools for exploring that data that we wrote ourselves.
All of the data engineering team’s projects involve at least one layer of the stack. For example, we worked with our mobile team to add instrumentation to our native iOS and Android apps, and then we made changes to our Hadoop jobs to make sure that the new incoming data was handled correctly. The new mobile data also has implications for our end-user tools.
Along with building up the data infrastructure, managing data quality is the other priority of data engineering. It’s possible to lose data at every layer of the stack. If your instrumentation is built using JavaScript, you lose data from browsers that don’t have it enabled. Your instruments usually log through calls to some kind endpoint and if that endpoint is down or the connection is unreliable, you lose data. If people close the browser window before the instruments load, you lose data. If your data crunching layer can’t properly process some of the data from the instruments (often due to corruption that’s beyond your control), it’s lost. Data can be lost between the data crunching layer and the data warehouse layer, and of course bugs in your end-user tools can give the appearance of data loss as well.
In terms of skills and daily work, data engineering is not much different than other areas of software development. There are cases where having a background in math or quantitative analysis can be hugely helpful, but many of the problems are straightforward programming or operations problems. The big problems tend to be scaling each of the layers of the stack to accommodate the volume of data being collected, and doing the hardcore debugging and analysis required to manage data loss effectively.
That’s a quick description of what life in data engineering is like. I am planning on writing a lot more about this topic. If you have questions, please comment.
One explanation of the hype behind Big Data
Cam Davidson-Pilon talks about 21st Century Problems. Here’s how he describes most of the great technological leaps of the 20th century:
And here’s the truth behind the hype about Big Data we see so much of these days: