rc3.org

Strong opinions, weakly held

Month: August 2014

Surprisingly, Perl outperforms sed and Awk

Normally the performance of utilities and scripting languages really isn’t an issue – they’re all fast enough for the task at hand, but sometimes, that isn’t the case. For example, my team has built a database replication system that copies many millions of records from a set of sharded databases to a data warehouse every day. When it exports data from the originating databases, it needs to add a database identifier to every record, represented by a line in a TSV file.

The easiest way to do this is to pipe the output of the mysql command to a script that simply appends a value to each line. I started by using sed, for reasons of simplicity. This command appends a tab and the number 1 to every line of input:

sed 's/$/\t1/'

Unfortunately, as the amount of data being replicated increased, we found that the CPU on the box running the replication script was pegged at 100%, and sed was using most of it, so I started experimenting with alternatives.

To test, I used a 50 million line text file. The table the sample was taken from has over 3 billion rows, so you can see why the performance of this simple piece of code becomes important.

My approach to testing is simple, cat the file through the transformation and then just redirect the output to /dev/null. Here’s the baseline (the test was run on my MacBook Pro):

$ time cat test.tsv > /dev/null

real    0m0.615s
user    0m0.006s
sys 0m0.608s

Here’s how sed performs:

$ time cat test.tsv | sed 's/$/\t1/' > /dev/null

real    0m57.405s
user    0m56.845s
sys 0m1.970s

I read on one of the Stack Exchange sites that Awk might be faster, so I tried that:

$ time cat test.tsv | awk '{print $0 "\t2"}' > /dev/null

real    3m51.618s
user    3m50.367s
sys 0m3.676s

As you can see, Awk is a lot slower than sed, and it doesn’t even use a regular expression. I also read that using Bash with no external commands might be faster, so I tried this out:

$ time cat test_5m.tsv | while read line; do echo "$line\t2"; done > /dev/null

real    7m24.761s
user    3m16.709s
sys 5m54.428s

Those results are from a test file with 5 million lines, 1/10 the size of the other tests. The Bash solution is roughly 10 times slower than the Awk solution. At this point, I felt a little stuck. Nothing I tried outperformed the sed approach that we were already using. For some reason, I thought it might be worth it to try a Perl one-liner. Perl is known for having good performance for a scripting language, but at the same time, I couldn’t imagine that Perl could outperform sed, which is much simpler. First, I tried a direct translation of the sed solution:

$ time cat test.tsv | perl -pe 's/$/\t2/'  > /dev/null

real    0m42.030s
user    0m41.296s
sys 0m2.805s

I was surprised by this result, Perl beat sed handily. I’ve run these tests a number of times, and I’ve found that Perl reliably outperforms the sed equivalent in an apples to apples comparison. Of course, I didn’t have to use a regular expression here, I was just matching the end of the line. What happens when I leave the regex out?

$ time cat test.tsv | perl -ne 'chomp; print "$_\t2\n"' > /dev/null

real    0m12.938s
user    0m12.344s
sys 0m2.280s  

This time I just strip the line ending, print out the line with the text I want to append, and then add the line ending back in. The original sed command is more than four times slower than this Perl one-liner, that’s a massive improvement.

There are a couple of lessons here. The first is that when you’re doing simple text processing, you may as well just use Perl one-liners. The idea that sed and awk are superior because they are smaller and simpler is not borne out by real-world results. (They may be faster for some things, but it’s clearly no sure thing.) Perl is mature and is obviously highly optimized.

The second is that while premature optimization may be the root of all evil, when you’re performing the same operation billions of times, even very small gains in efficiency can have huge impact. When the CPU on the server was pegged at 100% for a few days and the load average spiked at over 200, every gain in efficiency became hugely important.

If you want to dig into Perl one liners, the perlrun man page is one place to start.

Update: For the tests above, I used the default OS X versions of these tools. The versions were Perl 5.16.2, Awk 20070501, and some version of BSD sed from 2005.

Here are some other numbers, using GNU sed (4.2.2) and Awk (4.1.1), installed via Homebrew (rather than the old, default versions that are installed on OS X.) Perl still wins against Awk, but it’s a lot closer:

$ time cat test.tsv | gawk '{print $0 "\t2"}' > /dev/null

real    0m23.503s
user    0m23.234s
sys 0m1.596s

$ time cat test.tsv | gsed 's/$/\t1/' > /dev/null

real    2m32.154s
user    2m31.332s
sys 0m2.014s

On the other hand, the latest GNU sed takes it on the chin. It’s slower than Perl, Awk, and the old OS X default version of sed.

The difficulty of tracking police violence

Journalist D. Brian Burghart is compiling a database of police killings in the United States. He has a theory as to why it’s up to him to compile this database:

The biggest thing I’ve taken away from this project is something I’ll never be able to prove, but I’m convinced to my core: The lack of such a database is intentional. No government—not the federal government, and not the thousands of municipalities that give their police forces license to use deadly force—wants you to know how many people it kills and why.

He then goes on to explain all the ways people have attempted to thwart him in his effort compile such a database on his own.

The article is really interesting, and more importantly, he needs help from volunteers to fill in the details in his database. In addition to shining yet another bright light on the issue of race (and racism) in America, they also underscore the need for greater accountability for law enforcement, and databases like Fatal Encounters are one way to increase this accountability.

My favorite thing about President Obama

In an article about Presidential vacations, Ezra Klein makes this observation about the President:

Obama has limited patience for the idea that he should act like he’s doing his job rather than just do his job. A Democratic political consultant once told me that the problem with President Obama is that he doesn’t care to "get caught trying." That is to say, he doesn’t put on a show of trying to get things done when he doesn’t think the show will help get the thing done.

I know this gets him into all kinds of trouble (and behaving the same way has probably gotten me into all kinds of trouble as well), but I really wish more people were like this.

Policing based on fear can’t work

I’m a bit fixated on cases where the personal advice one might give to a friend or family member does not align with the policy we should pursue (see this post). Sunil Dutta, a 17 year veteran of the LAPD, recommends offering no resistance when you are detained by police, if you want to prevent a bad outcome:

Even though it might sound harsh and impolitic, here is the bottom line: if you don’t want to get shot, tased, pepper-sprayed, struck with a baton or thrown to the ground, just do what I tell you. Don’t argue with me, don’t call me names, don’t tell me that I can’t stop you, don’t say I’m a racist pig, don’t threaten that you’ll sue me and take away my badge. Don’t scream at me that you pay my salary, and don’t even think of aggressively walking towards me. Most field stops are complete in minutes. How difficult is it to cooperate for that long?

That’s the least nuanced paragraph in the piece, but I think it’s worth talking about. This is good advice from one person to another. I’ve been pulled over a number of times and this is the strategy I’ve always followed. I don’t want to go to jail or even get a ticket – what can I do to put the cop at ease so that I hopefully get the outcome I want?

My nephew is going to get his driver’s license soon. This is the advice I’d give him. An altercation with a cop is a no win scenario for the other party.

On the other hand, it’s completelly troubling to see a cop say this to the public. Basically, this cop is warning all of us that unless you can make a policeman feel safe, you’re liable to be subjected to violence and potentially deadly violence until the policeman does feel safe. Bear in mind that in the vast majority of interactions between police and the public, the policeman is the only armed party. Even more troubling is the fact that the biases of the police officer bear greatly on how safe the cop feels, regardless of the situation, making it vastly more likely that members of some age and ethnic groups will be subjected to violence or mistreatment than others.

The policy question is, how do we create a system where interactions between the police and civilians are governed by more than the cop’s feeling of being threatened. Maybe that means going back to partner policing. Maybe that means changing policy so that cops encounter armed citizens less often. It seems like a lot of cops shoot unarmed people out of fear that they are going for the cop’s gun – maybe cops shouldn’t carry guns as a matter of course. Whatever the solution is, serving notice to the public that if you scare us, we will hurt you can’t be part of it.

We’re seeing that on a mass scale in Ferguson, Missouri right now. An entire community has scared the police, and they’re responding by inflicting harm on that community on a daily basis until it stops. Policing that’s governed by fear can’t be effective. For another approach, see Jason Kottke’s post, Policing by Consent.

The ethics of Web experiments

Creating tools that facilitate online A/B experiments is a big part of my job. My team makes sure that we’re collecting data as accurately as possible, and we also created a tool that aggregates the results of experiments and performs statistical analysis of them to insure that our analysis is valid. Needless to say, the controversy over experiments run by Facebook and OkCupid has been interesting to watch from a distance.

For some background on my involvement with Web experiments, you can read this post a member of my team wrote experiments at Etsy back in 2012. I think it holds up pretty well.

Last week Christian Rudder wrote about OkCupid’s experiments on the OkTrends blog, in a provocative post entitled We Experiment On Human Beings! It was written in his inimitable style, with a pugilistic tone. OkCupid ran some pretty radical experiments, and Rudder isn’t apologizing for any of them. He was the interviewed on NPR and refused to apologize for anything OkCupid did.

I am a big believer in iterating on products through experimentation. As I wrote a couple of years ago, quantifying user behavior and analyzing it is what liberates us to some degree from the realm of anecdote and opinion. That said, there’s a reason why there are so many ethical guidelines in academia for experiments on human beings.

Writing at Kottke.org, Tim Carmody has the best argument I’ve read for why OkCupid’s experiments were problematic. I think that everyone who’s responsible for experimenting on the Web ought to read it and think about how it bears on the kinds of experiments they’re running.

Experimentation is a singularly powerful tool for refining ideas and testing the viability of features on the Web, but it’s also easy to abuse, especially in a social context. Fortunately, in the world of e-commerce, experiments are usually about making it easier to check out or testing out changes to search that hopefully make it easier for customers to find items they want to buy.

We’ve seen how a cavalier attitude toward user privacy on the part of Web companies has led to restrictions on cookies that make it more difficult to track user activity. These regulations restrict many kinds of bad behavior, but they also make it more difficult to do legitimate analsysis as well. I worry that a cavalier attitude about the ethics of experimentation will lead to regulations in that area that make it problematic to run any kind of Web experiment.

Many people are already suspicious of any kind of data-driven approach to problem solving. I’m as cynical as anyone about industry self-regulation. While it makes sense not to publicize experiments, we should discuss the kinds of experiments we run, and the role that ethical considerations play in expeirment design.

© 2024 rc3.org

Theme by Anders NorenUp ↑