rc3.org Rafe Colburn on software development (and other topics)

Posts Tagged ‘big data’

Distributed systems do not provide free reliability

Distributed database creator and LinkedIn engineer Jay Kreps pokes a hole in the widespread myth that systems that scale horizontally are inherently reliable. This is an important post, because intuitively it seems like a system that scales horizontally and makes provisions for fault tolerance should be reliable. Indeed, that’s the value proposition for many people. Rather than having to be smart enough to provision big servers intelligently and figure out how to make them fault tolerant, you can just throw commodity hardware at the problem and be ready to swap out systems when you encounter the occasional hardware failure.

If you enjoy that article, move on to Daniel Abadi’s post on Replication and the latency-consistency tradeoff. It’s not about system failure but about the performance characteristics of distributed systems. This is the sort of real-world issue that is glossed over when you talk to people about distributed systems a lot of the time.

The hype about the new distributed database systems is that they make life easy. The truth is that they’re incredibly complex, but they make it possible for small companies to achieve things that were out of reach for all but the largest companies until very recently. I’m just starting to wrestle with some of this stuff and you can expect more posts about this topic.

Big Data demands better shell skills

At work, I’ve been experimenting with Apache Solr to see whether it’s the best choice for searching a very large data set that we need to access. The first step was to just set it up and put a little bit of data into it in order make sure that it meets our current and anticipated future requirements. Once I’d figured that out, the next step was to start loading lots of data into Solr to see how well it performs, and to test out import performance as well.

Before I could do that, though, I generated about 33 million records to import, which take up about 10 gigabytes of disk space. That’s not even 5% of the space that the full data set will take up, but it’s a start.

What I’m quickly learning is that when it comes to dealing with Big Data, knowledge of the Unix shell is a huge advantage. To give an example, I’m currently using Solr’s CSV import feature to import the test data. If we wind up using it in production, writing our own DataInputHandler will certainly be the way to go, but I’m just trying to get things done right now.

Here’s the command the documentation suggests you use to load a CSV file into Solr:

curl http://localhost:8983/solr/update/csv --data-binary @books.csv 
    -H 'Content-type:text/plain; charset=utf-8'

I quickly found out that when you tell curl to post a 10 gigabyte file to a URL, it runs out of memory, at least on my laptop.

These are the kinds of problems for which Unix provides ready solutions. I used the split command to split my single 10 gigabyte file into 33 files, each a million lines long. split helpfully named them things like xaa, xab, etcetera, all the way through xbh. You can use command line arguments to tell split to use more meaningful names. Anyway, then I used a for loop to iterate over each of the files, using curl to submit them:

for file in x* ; do
    curl http://localhost:8983/solr/update/csv --data-binary @$file 
        -H 'Content-type:text/plain; charset=utf-8'
done

That would have worked brilliantly, except that Solr wants you to list the fields in the file on the first row of your CSV file, so only the first file imported successfully. I wound up opening all of the others in vim* and copying the headers over rather than writing a script, proving that I need to brush up on my shell skills as well, because prepending a line to a file is easy if not elegant.

Once the files were updated, I used the loop above to import the data.

When it comes to working with big data sets, there are many, many tasks like these. Just being able to use pipes to make sure that your very large data files are always compressed can be a life-saver. Understanding shell scripting is the difference between accomplishing a lot in a day through automation or doing lots of manual work that makes you hate your job.

* I should add that MacVim gets extra credit for opening 33 252 megabyte files at once without complaining. I just typed mvim x* and up popped a MacVim window with 33 buffers. Unix is heavy duty.

How the New York Times’ Netflix toy was made

So many people posted about how fascinated they were by the New York Times’ feature A Peek Into Netflix Queues that I didn’t even bother to blog about it at the time. It shows the popularity of movies on a map by zip code. It’s a great toy that will eat an hour of your time before you know it.

The Society for News Design published a piece last week explaining how the Netflix toy was built as well as a copy of the static version that was included in the paper, which I didn’t even know existed. It’s a fascinating case study, as interesting to me as the visualization itself.

Via Simon Willison.