Last year I started working in the world of Big Data, and at the time, I didn’t know that “data science” and “data engineering” were separate things. At some point, I looked at what my team is working on and realized that the distinction between the two is important, and that the team is firmly entrenched in the data engineering camp.
Data scientists get all the glory and attention, but without data engineering, there’s no way for data scientists to practice real science. I’ll talk more about this in another post.
In this first post, I want to talk about the four basic layers of the data engineering stack. These apply whether you’re working to enable people to collect analytic data for a Web-based business, or building the infrastructure for scientists to analyze rainfall patterns. The layers are:
- Data crunching
- Data warehousing
- End-user tools
At Etsy, we have our own custom instrumentation, our own Hadoop jobs to crunch the logs the instruments write to, our own data warehouse, and, for the most part, end-user tools for exploring that data that we wrote ourselves.
All of the data engineering team’s projects involve at least one layer of the stack. For example, we worked with our mobile team to add instrumentation to our native iOS and Android apps, and then we made changes to our Hadoop jobs to make sure that the new incoming data was handled correctly. The new mobile data also has implications for our end-user tools.
In terms of skills and daily work, data engineering is not much different than other areas of software development. There are cases where having a background in math or quantitative analysis can be hugely helpful, but many of the problems are straightforward programming or operations problems. The big problems tend to be scaling each of the layers of the stack to accommodate the volume of data being collected, and doing the hardcore debugging and analysis required to manage data loss effectively.
That’s a quick description of what life in data engineering is like. I am planning on writing a lot more about this topic. If you have questions, please comment.
July 25, 2013 at 10:07 am
Is what you’ve described really any different from other software engineering? Lots of software goes through the four layers you’ve described. I’m not seeing a clear distinction that makes something data engineering. It is simply the fact that a lot of data is being processed? Is it just that the emphasis is on ensuring data isn’t lost?
Data scientist I understand, because there is a methodology to finding information in all the data. Data engineering I’m not clear on (perhaps because this is the first time I’ve seen anyone specifically mention it).
July 26, 2013 at 1:41 pm
On a day to day basis, it’s not fundamentally different than other software engineering. It’s just a problem set. To me the most difficult mental adjustment was in starting to think quantitatively about acceptable data loss. If you’re writing a credit card processing system, you need complete fidelity end to end. If a transaction fails, you need to log it, understand it, and hopefully fix it. When it comes to building these types of systems, data loss is inevitable. The tricky part is quantifying that loss, understanding the implications of it on your analysis, and keeping it to a manageable level.
July 29, 2013 at 11:37 am
Great breakdown, thanks!