rc3.org

Strong opinions, weakly held

Tag: big data

Defining data engineering

Last year I started working in the world of Big Data, and at the time, I didn’t know that “data science” and “data engineering” were separate things. At some point, I looked at what my team is working on and realized that the distinction between the two is important, and that the team is firmly entrenched in the data engineering camp.

Data scientists get all the glory and attention, but without data engineering, there’s no way for data scientists to practice real science. I’ll talk more about this in another post.

In this first post, I want to talk about the four basic layers of the data engineering stack. These apply whether you’re working to enable people to collect analytic data for a Web-based business, or building the infrastructure for scientists to analyze rainfall patterns. The layers are:

  1. Instrumentation
  2. Data crunching
  3. Data warehousing
  4. End-user tools

Let’s look at an example from Web analytics, because that’s what I understand the best. A tool like Google Analytics spans all four layers but end users only have a measure of control over two of them. When you add the Google Analytics JavaScript to your Web site, you’re setting up the instrumentation. Google crunches the data they collect, and they warehouse it for you. You can then view reports using the Web interface. Google Analytics is a great general purpose tool, but the lack of control and visibility is what limits its potential.

At Etsy, we have our own custom instrumentation, our own Hadoop jobs to crunch the logs the instruments write to, our own data warehouse, and, for the most part, end-user tools for exploring that data that we wrote ourselves.

All of the data engineering team’s projects involve at least one layer of the stack. For example, we worked with our mobile team to add instrumentation to our native iOS and Android apps, and then we made changes to our Hadoop jobs to make sure that the new incoming data was handled correctly. The new mobile data also has implications for our end-user tools.

Along with building up the data infrastructure, managing data quality is the other priority of data engineering. It’s possible to lose data at every layer of the stack. If your instrumentation is built using JavaScript, you lose data from browsers that don’t have it enabled. Your instruments usually log through calls to some kind endpoint and if that endpoint is down or the connection is unreliable, you lose data. If people close the browser window before the instruments load, you lose data. If your data crunching layer can’t properly process some of the data from the instruments (often due to corruption that’s beyond your control), it’s lost. Data can be lost between the data crunching layer and the data warehouse layer, and of course bugs in your end-user tools can give the appearance of data loss as well.

In terms of skills and daily work, data engineering is not much different than other areas of software development. There are cases where having a background in math or quantitative analysis can be hugely helpful, but many of the problems are straightforward programming or operations problems. The big problems tend to be scaling each of the layers of the stack to accommodate the volume of data being collected, and doing the hardcore debugging and analysis required to manage data loss effectively.

That’s a quick description of what life in data engineering is like. I am planning on writing a lot more about this topic. If you have questions, please comment.

One explanation of the hype behind Big Data

Cam Davidson-Pilon talks about 21st Century Problems. Here’s how he describes most of the great technological leaps of the 20th century:

What these technologies have in common is that are all deterministic engineering solutions. By that, I mean they have been created by techniques in mathematics, physics and engineering: often being modeled in a mathematical language, guided by physics’ calculus and constrained and brought to life by engineering. I argue that these types of problems, of modeling deterministically, are problems that our fathers had the luxury of solving.

And here’s the truth behind the hype about Big Data we see so much of these days:

Statistical problems describe the space we haven’t explored yet. Statistical problems are not new: they are likely as old as deterministic problems. What is new is our ability to solve them. Spear-headed by the (constantly increasing) tidal wave of data, practitioners are able to solve new problems otherwise thought impossible.

What to do with data scientists

I’ve been thinking a lot lately about where data scientists should reside on an engineering team. You can often find them on the analytics team or on a dedicated data science team, but I think that the best place for them to be is working as closely with product teams as possible, especially if they have software engineering skills.

Data scientists are, to me, essentially engineers with additional tools in the toolbox. When engineers work on a problem, they come up with engineering solutions that non-engineers may not see. They think of ways to make work more efficient with software. Data scientists do the same thing as engineers, but with data and mathematics. For example, they may see an opportunity to use a classifier where a regular software engineer may not. Or they may see a way to apply graph theory to efficiently solve a problem.

This is what the Javier Tordable presentation on Mathematics at Google that I’ve linked to before is about. The problem with having a data science team is lack of exposure to inspiring problems. The best way to enable people to use their specialized skills to solve problems is to allow them to suffer the pain of solving the problem. As they say, necessity is the mother of invention.

The risk, of course, is that if a data scientist is on one team, they may not have any exposure at all to problems that they could solve that are faced by other teams. In theory, putting data scientists on their own team and enabling them to consult where they’re most needed enables them to engage with problems where they are most needed, but in practice I think it often keeps them too far from the front lines to be maximally useful.

It makes sense to have data scientists meet up regularly so that they can talk about what they’re doing and share ideas, but I think that most of the time, they’re better off collaborating with members of a product team.

Big Data and analytics link roundup

Here are a few things that have caught my eye lately from the world of Big Data and analytics.

Back in September, I explained why Web developers should care about analytics. This week I noticed a job opening for a Web developer at Grist that includes knowledge of analytics in the list of requirements. That doesn’t exactly make for a trend, but I expect to see a lot more of this going forward.

Also worth noting are the two data-related job openings at Rent the Runway. They have an opening for a data engineer and one for a data scientist. These two jobs are frequently conflated, and there is some overlap in the skill sets, but they’re not the same thing. For the most part what I do is data engineering, not data science.

If you do want to get started in data science, you could do worse than to read Hilary Mason’s short guide. Seth Brown has posted an excellent guide to basic data exploration in the Unix shell. I do this kind of stuff all the time.

Here are a couple of contrary takes on Big Data. In the New York Times, Steve Lohr has a trend piece on Big Data, Sure, Big Data Is Great. But So Is Intuition. Maybe it’s different on Wall Street, but I don’t see too many people divorcing Big Data from intuition. Usually intuition leads us to ask a question, and then we try to answer that question using quantitative analysis. That’s Big Data to me. For a more technical take on the same subject, see Data-driven science is a failure of imagination from Petr Keil.

On a lighter note, Sean Taylor writes about the Statistics Software Signal.

How will society adjust to ever-easier data collection?

The New York Times ran two opinion pieces this weekend right next to each other that both stand at the intersection of the how the government and politics work and social change that results from technological change. In the first, Joe Nocera argues that the big question in the resignation of David Petraeus is whether we’re comfortable with the FBI snooping through our email on relatively flimsy grounds:

But the Petraeus scandal could well end up teaching some very different lessons. If the most admired military man in a generation can have his e-mail hacked by F.B.I. agents, then none of us are safe from the post-9/11 surveillance machine. And if an affair is all it takes to force such a man from office, then we truly have lost all sense of proportion.

The second was about what increased use of data in political campaigns means long-term. As I’ve mentioned, I’ve been working in the analytics world this year, so this topic is highly relevant to me. It’s also very complicated. On one hand, improving our ability to collect and analyze data enables us to better understand what people want and expect from our products, or, in the case of campaigns, our politicians. On the other hand, combining our more advanced understanding of human behavior with deeper data sets creates the opportunity for more effective manipulation in addition to more effective communication.

While the people creating big data tools may not be evil, the organizations that use them going forward may not agree to the same principles. The big question in both the Petraeus case and in the use of big data by campaigns is that regardless of our level of comfort with the government, campaigns, or companies knowing so much about us, we don’t really have control over the gathering of that information.

Big Data and civil rights

Alistair Croll’s post about Big Data and civil rights is important, but I don’t think he has it exactly right. The article does a good job of explaining how Big Data differs from data warehousing as it was traditionally done. It also illustrates how Big Data can be misused as a vector of discrimination, violating people’s civil rights.

While Big Data may enable people to discriminate in innovative new ways, or to make better guesses about who to discriminate against. However, I don’t think this is necessarily a new civil rights challenge. Discrimination against protected groups is already illegal, at least in the United States. In fact, at least when it comes to employment law and housing, the doctrine of disparate impact also applies. It holds that you can be liable for discrimination even if your practices are not intentionally discriminatory, if those practices have a disproportionately adverse effect on members of a protected group.

It is not the use of Big Data to implement discriminatory practices but rather the discrimination itself that is the fundamental problem. The challenge in fighting discrimination is what it always was, proving the discrimination in the first place. In the era of Big Data, the data and algorithms provide a more concrete paper trail than the unspoken, unrecorded discrimination that still occurs every day.

Defining Big Data

Here’s an interesting definition of Big Data, from Douglas Patterson:

One useful definition of big data — for those who, like us, don’t think it’s best to tie it to particular technologies — is that big data is data big enough to raise practical rather than merely theoretical concerns about the effectiveness of anonymization.

Distributed systems do not provide free reliability

Distributed database creator and LinkedIn engineer Jay Kreps pokes a hole in the widespread myth that systems that scale horizontally are inherently reliable. This is an important post, because intuitively it seems like a system that scales horizontally and makes provisions for fault tolerance should be reliable. Indeed, that’s the value proposition for many people. Rather than having to be smart enough to provision big servers intelligently and figure out how to make them fault tolerant, you can just throw commodity hardware at the problem and be ready to swap out systems when you encounter the occasional hardware failure.

If you enjoy that article, move on to Daniel Abadi’s post on Replication and the latency-consistency tradeoff. It’s not about system failure but about the performance characteristics of distributed systems. This is the sort of real-world issue that is glossed over when you talk to people about distributed systems a lot of the time.

The hype about the new distributed database systems is that they make life easy. The truth is that they’re incredibly complex, but they make it possible for small companies to achieve things that were out of reach for all but the largest companies until very recently. I’m just starting to wrestle with some of this stuff and you can expect more posts about this topic.

Big Data demands better shell skills

At work, I’ve been experimenting with Apache Solr to see whether it’s the best choice for searching a very large data set that we need to access. The first step was to just set it up and put a little bit of data into it in order make sure that it meets our current and anticipated future requirements. Once I’d figured that out, the next step was to start loading lots of data into Solr to see how well it performs, and to test out import performance as well.

Before I could do that, though, I generated about 33 million records to import, which take up about 10 gigabytes of disk space. That’s not even 5% of the space that the full data set will take up, but it’s a start.

What I’m quickly learning is that when it comes to dealing with Big Data, knowledge of the Unix shell is a huge advantage. To give an example, I’m currently using Solr’s CSV import feature to import the test data. If we wind up using it in production, writing our own DataInputHandler will certainly be the way to go, but I’m just trying to get things done right now.

Here’s the command the documentation suggests you use to load a CSV file into Solr:

curl http://localhost:8983/solr/update/csv --data-binary @books.csv 
    -H 'Content-type:text/plain; charset=utf-8'

I quickly found out that when you tell curl to post a 10 gigabyte file to a URL, it runs out of memory, at least on my laptop.

These are the kinds of problems for which Unix provides ready solutions. I used the split command to split my single 10 gigabyte file into 33 files, each a million lines long. split helpfully named them things like xaa, xab, etcetera, all the way through xbh. You can use command line arguments to tell split to use more meaningful names. Anyway, then I used a for loop to iterate over each of the files, using curl to submit them:

for file in x* ; do
    curl http://localhost:8983/solr/update/csv --data-binary @$file 
        -H 'Content-type:text/plain; charset=utf-8'
done

That would have worked brilliantly, except that Solr wants you to list the fields in the file on the first row of your CSV file, so only the first file imported successfully. I wound up opening all of the others in vim* and copying the headers over rather than writing a script, proving that I need to brush up on my shell skills as well, because prepending a line to a file is easy if not elegant.

Once the files were updated, I used the loop above to import the data.

When it comes to working with big data sets, there are many, many tasks like these. Just being able to use pipes to make sure that your very large data files are always compressed can be a life-saver. Understanding shell scripting is the difference between accomplishing a lot in a day through automation or doing lots of manual work that makes you hate your job.

* I should add that MacVim gets extra credit for opening 33 252 megabyte files at once without complaining. I just typed mvim x* and up popped a MacVim window with 33 buffers. Unix is heavy duty.

How the New York Times’ Netflix toy was made

So many people posted about how fascinated they were by the New York Times’ feature A Peek Into Netflix Queues that I didn’t even bother to blog about it at the time. It shows the popularity of movies on a map by zip code. It’s a great toy that will eat an hour of your time before you know it.

The Society for News Design published a piece last week explaining how the Netflix toy was built as well as a copy of the static version that was included in the paper, which I didn’t even know existed. It’s a fascinating case study, as interesting to me as the visualization itself.

Via Simon Willison.

© 2024 rc3.org

Theme by Anders NorenUp ↑