Normally the performance of utilities and scripting languages really isn’t an issue – they’re all fast enough for the task at hand, but sometimes, that isn’t the case. For example, my team has built a database replication system that copies many millions of records from a set of sharded databases to a data warehouse every day. When it exports data from the originating databases, it needs to add a database identifier to every record, represented by a line in a TSV file.
The easiest way to do this is to pipe the output of the mysql command to a script that simply appends a value to each line. I started by using sed, for reasons of simplicity. This command appends a tab and the number 1 to every line of input:
sed 's/$/\t1/'
Unfortunately, as the amount of data being replicated increased, we found that the CPU on the box running the replication script was pegged at 100%, and sed
was using most of it, so I started experimenting with alternatives.
To test, I used a 50 million line text file. The table the sample was taken from has over 3 billion rows, so you can see why the performance of this simple piece of code becomes important.
My approach to testing is simple, cat the file through the transformation and then just redirect the output to /dev/null. Here’s the baseline (the test was run on my MacBook Pro):
$ time cat test.tsv > /dev/null
real 0m0.615s
user 0m0.006s
sys 0m0.608s
Here’s how sed performs:
$ time cat test.tsv | sed 's/$/\t1/' > /dev/null
real 0m57.405s
user 0m56.845s
sys 0m1.970s
I read on one of the Stack Exchange sites that Awk might be faster, so I tried that:
$ time cat test.tsv | awk '{print $0 "\t2"}' > /dev/null
real 3m51.618s
user 3m50.367s
sys 0m3.676s
As you can see, Awk is a lot slower than sed, and it doesn’t even use a regular expression. I also read that using Bash with no external commands might be faster, so I tried this out:
$ time cat test_5m.tsv | while read line; do echo "$line\t2"; done > /dev/null
real 7m24.761s
user 3m16.709s
sys 5m54.428s
Those results are from a test file with 5 million lines, 1/10 the size of the other tests. The Bash solution is roughly 10 times slower than the Awk solution. At this point, I felt a little stuck. Nothing I tried outperformed the sed approach that we were already using. For some reason, I thought it might be worth it to try a Perl one-liner. Perl is known for having good performance for a scripting language, but at the same time, I couldn’t imagine that Perl could outperform sed, which is much simpler. First, I tried a direct translation of the sed solution:
$ time cat test.tsv | perl -pe 's/$/\t2/' > /dev/null
real 0m42.030s
user 0m41.296s
sys 0m2.805s
I was surprised by this result, Perl beat sed handily. I’ve run these tests a number of times, and I’ve found that Perl reliably outperforms the sed equivalent in an apples to apples comparison. Of course, I didn’t have to use a regular expression here, I was just matching the end of the line. What happens when I leave the regex out?
$ time cat test.tsv | perl -ne 'chomp; print "$_\t2\n"' > /dev/null
real 0m12.938s
user 0m12.344s
sys 0m2.280s
This time I just strip the line ending, print out the line with the text I want to append, and then add the line ending back in. The original sed command is more than four times slower than this Perl one-liner, that’s a massive improvement.
There are a couple of lessons here. The first is that when you’re doing simple text processing, you may as well just use Perl one-liners. The idea that sed and awk are superior because they are smaller and simpler is not borne out by real-world results. (They may be faster for some things, but it’s clearly no sure thing.) Perl is mature and is obviously highly optimized.
The second is that while premature optimization may be the root of all evil, when you’re performing the same operation billions of times, even very small gains in efficiency can have huge impact. When the CPU on the server was pegged at 100% for a few days and the load average spiked at over 200, every gain in efficiency became hugely important.
If you want to dig into Perl one liners, the perlrun man page is one place to start.
Update: For the tests above, I used the default OS X versions of these tools. The versions were Perl 5.16.2, Awk 20070501, and some version of BSD sed from 2005.
Here are some other numbers, using GNU sed (4.2.2) and Awk (4.1.1), installed via Homebrew (rather than the old, default versions that are installed on OS X.) Perl still wins against Awk, but it’s a lot closer:
$ time cat test.tsv | gawk '{print $0 "\t2"}' > /dev/null
real 0m23.503s
user 0m23.234s
sys 0m1.596s
$ time cat test.tsv | gsed 's/$/\t1/' > /dev/null
real 2m32.154s
user 2m31.332s
sys 0m2.014s
On the other hand, the latest GNU sed takes it on the chin. It’s slower than Perl, Awk, and the old OS X default version of sed.
The strengths of low variance political configurations
Today I got around to listening to the A Not So Simple Majority episode of This American Life. The story is about the school board in East Ramapo in New York, where a group of people who all send their children to private religious schools took over the public school board, and have since been gutting the public school system and funneling the money to their private schools. The religion of the group in question isn’t really important. The story is infuriating on many levels, and at the end it left me thinking about how to prevent this sort of thing from happening. I think that the lesson is in the dangers of small-scale democracy. The school board in East Ramapo has a lot of power, not just to manage schools but also to set local property tax rates, and was subject to capture by a relatively small group of people.
At the other end of the spectrum we have the US Presidency. It’s a nationwide election, and separation of powers insures that the President can’t do that much anyway. The election cycle is long and painful. This all leads to low variance outcomes — President Bush and President Obama may not personally have that much in common, but America has not been a radically different place under one of them than the other. The entire system is built to reduce the variance between Presidents. Generally speaking, the smaller the electorate, the higher the variance. That’s why the House of Representatives features a much broader ideological spectrum than the Senate, for example.
Getting back to East Ramapo, I was reminded of the article about the small municipalities around St. Louis, Missouri that I recently linked to. They’re really too small to be well-governed or even governable. Similarly, there was the recent case of Bell, California, where the elected officials in a town of 38,000 made themselves the highest-paid municipal officials in the country. I wonder whether the problems in East Ramapo School District would never have occurred if the entire county had a single school district, rather than the nine it currently has.
People seem to reflexively romanticize small-scale democracy, but it’s exploitable and breakable in many ways. We should be warier of it.