At work, I’ve been experimenting with Apache Solr to see whether it’s the best choice for searching a very large data set that we need to access. The first step was to just set it up and put a little bit of data into it in order make sure that it meets our current and anticipated future requirements. Once I’d figured that out, the next step was to start loading lots of data into Solr to see how well it performs, and to test out import performance as well.
Before I could do that, though, I generated about 33 million records to import, which take up about 10 gigabytes of disk space. That’s not even 5% of the space that the full data set will take up, but it’s a start.
What I’m quickly learning is that when it comes to dealing with Big Data, knowledge of the Unix shell is a huge advantage. To give an example, I’m currently using Solr’s CSV import feature to import the test data. If we wind up using it in production, writing our own DataInputHandler will certainly be the way to go, but I’m just trying to get things done right now.
Here’s the command the documentation suggests you use to load a CSV file into Solr:
curl http://localhost:8983/solr/update/csv --data-binary @books.csv
-H 'Content-type:text/plain; charset=utf-8'
I quickly found out that when you tell curl to post a 10 gigabyte file to a URL, it runs out of memory, at least on my laptop.
These are the kinds of problems for which Unix provides ready solutions. I used the split
command to split my single 10 gigabyte file into 33 files, each a million lines long. split
helpfully named them things like xaa
, xab
, etcetera, all the way through xbh
. You can use command line arguments to tell split
to use more meaningful names. Anyway, then I used a for
loop to iterate over each of the files, using curl to submit them:
for file in x* ; do
curl http://localhost:8983/solr/update/csv --data-binary @$file
-H 'Content-type:text/plain; charset=utf-8'
done
That would have worked brilliantly, except that Solr wants you to list the fields in the file on the first row of your CSV file, so only the first file imported successfully. I wound up opening all of the others in vim* and copying the headers over rather than writing a script, proving that I need to brush up on my shell skills as well, because prepending a line to a file is easy if not elegant.
Once the files were updated, I used the loop above to import the data.
When it comes to working with big data sets, there are many, many tasks like these. Just being able to use pipes to make sure that your very large data files are always compressed can be a life-saver. Understanding shell scripting is the difference between accomplishing a lot in a day through automation or doing lots of manual work that makes you hate your job.
* I should add that MacVim gets extra credit for opening 33 252 megabyte files at once without complaining. I just typed mvim x*
and up popped a MacVim window with 33 buffers. Unix is heavy duty.
November 22, 2011 at 1:44 pm
Concur. This is one of the more useful links I’ve found in this vein:
http://gregable.com/2010/09/why-you-should-know-just-little-awk.html
November 22, 2011 at 1:46 pm
I was talking to someone about awk not long ago, and we both agreed that we didn’t know anyone who learned awk after they had learned Perl. Once you know Perl, you just start using it for everything in this vein, even if awk (or sed or a basic shell scripting) would do just as well or better.
November 22, 2011 at 6:10 pm
Maybe I’m the only person who ever learned awk after perl. I work in the shell all the time these days (much of it on Somewhat Large Data, mostly extremely verbose log files). Perl does have the benefit of being incredibly fast in addition to having stolen lots of sweet tricks from awk, sed, and bash, but I find that for ad hoc purposes (which is about 80% of how I spend my time) learning awk and sed has gone a long way to making things that would have seemed intimidating even in perl quite easy.
As for your header needs, sed can do what you want very easily:
$ sed -i ‘1ifield1,field2,field3,field4,field5’ *.csv
“i” is the command to insert a line before the current position, “1” tells it to operate only on line 1. “-i” tells it to overwrite the files in place, so beware of that! Make it “-i.bak” if you want backup files.
November 22, 2011 at 6:41 pm
Oh yeah, the best thing about awk is that its default behavior is to split lines at any and all whitespace, so if you have files with a mix of tabs and varyingly multiple spaces between fields (for example formatted output like from ls or df), awk figures all that out for you, and maps the fields as you’d expect. Want a directory listing of all files bigger than 100,000 bytes?
I don’t know of a faster way to pull of that kind of filter.
October 12, 2012 at 6:19 pm
IMO awk is short for awkward.
@daveadams: ever tried to use find . -maxdepth 0 -size +100k instead of awking?