rc3.org

Strong opinions, weakly held

Author: Rafe (page 21 of 989)

Big Data and analytics link roundup

Here are a few things that have caught my eye lately from the world of Big Data and analytics.

Back in September, I explained why Web developers should care about analytics. This week I noticed a job opening for a Web developer at Grist that includes knowledge of analytics in the list of requirements. That doesn’t exactly make for a trend, but I expect to see a lot more of this going forward.

Also worth noting are the two data-related job openings at Rent the Runway. They have an opening for a data engineer and one for a data scientist. These two jobs are frequently conflated, and there is some overlap in the skill sets, but they’re not the same thing. For the most part what I do is data engineering, not data science.

If you do want to get started in data science, you could do worse than to read Hilary Mason’s short guide. Seth Brown has posted an excellent guide to basic data exploration in the Unix shell. I do this kind of stuff all the time.

Here are a couple of contrary takes on Big Data. In the New York Times, Steve Lohr has a trend piece on Big Data, Sure, Big Data Is Great. But So Is Intuition. Maybe it’s different on Wall Street, but I don’t see too many people divorcing Big Data from intuition. Usually intuition leads us to ask a question, and then we try to answer that question using quantitative analysis. That’s Big Data to me. For a more technical take on the same subject, see Data-driven science is a failure of imagination from Petr Keil.

On a lighter note, Sean Taylor writes about the Statistics Software Signal.

Vim and ctags

I’ve been living in Vim for the past year or so, and doing so has yielded great rewards in terms of productivity and one decent post on the grammar of Vim. I’ve tried to be miserly with Vim plugins. There are tons of great plugins for Vim, but I feel like it’s more important to master the built in features of the editor before I start adding new ones. For example, I’m still not a proficient user of registers, and I still miss opportunities to perform operations using the inside and around grammar.

Ctags is a really powerful external tool that I don’t feel violates my self-imposed plugin moratorium. It creates an index of your code so that you can navigate within and between files based on identifiers in the code. If you’re accustomed to using an IDE like Eclipse, this is functionality you come to take for granted. In the Vim world, you need to use Ctags to get it. I’ve gotten by all year (and for the past couple of decades of vi usage) without bothering with Ctags, but searching for a method definition with Ack for the billionth time finally led me into getting serious about setting up Ctags for my projects.

Getting started with Ctags is easy. Once you’ve installed the Ctags package, just go to the directory with your source code and run Ctags like this:

ctags *.rb

This produces tags for all of the Ruby files in the current directory, helpfully storing them in a file named tags. It’s an ASCII file, you can view it if you want to. Once you’ve done so, you can open Vim (or another editor that supports Ctags), and then use the tags to navigate around. You can use :tag to jump to a specific tag, or Control-] to jump to whatever tag is under the cursor. Once you’ve jumped to a tag, you can jump back to the previous location with Control-T. There’s a Vim tip that explains the tag-related commands.

The static nature of tag files means that some customization of your environment is necessary to get things working smoothly. As soon as you start changing your files or you pull in updates from version control, the tags are out of date, so you need a way to automatically keep them up to date. You also need to set things up properly so that Vim can properly look up tags for files in other directories.

Let’s talk about the second challenge first, because you need to figure out how you’re going to solve it before you can solve the first problem. You can generate tags recursively using the -R flag with ctags. To generate the tags for all of the files in the current directory and its subdirectories, you can run:

ctags -R .

This may seem like a good idea, but there are some issues that aren’t worth going into in this blog post.

Another option is to put a tags file in each directory. One reason is that having tags in each directory facilitates keeping your tags up to date automatically. I’ll discuss that shortly. The other relates to how Vim locates the tags in other directories. You can configure Vim with the locations to search for tag files using the tags setting. By default, it looks like this:

tags=./tags,./TAGS,tags,TAGS

Vim keeps track of two “current” directories, which may be the same. The first is the current directory of the shell used to open Vim. You can find out what it is by using the :pwd command. The second is the directory of the file in the current buffer. You can show the full path to the current buffer with the command :echo expand('%:p'). This configuration indicates that Vim should search for tags in the tags file in its current directory, and then check the tags file in the directory of the file in the current buffer.

This works fine for simple projects where all of the files are in the same directory. In many cases, though, projects span a nested directory structure and you may want to look up a tag in one directory that’s in a source file in another directory. Here’s my tags setting:

set tags=./tags,tags;

I got rid of the capitalized tag file names because I don’t roll that way. I also added a trailing semicolon to the setting, which enables Vim to recurse all of the subdirectories of its current directory looking for tag files. The tag search path can be broadened as much as you like, but that’s sufficient for me. The only catch is that I have to open Vim from my project’s home directory if I want to be able to search the tags for the whole project.

This scheme works perfectly with the “one tags file per directory” approach. Now the trick is to generate all of the files and keep them up to date. There are a number of approaches you can take, helpfully listed in the Exuberant Ctags FAQ. I’m using strategy #3, because it’s the one the author of Ctags recommends and I don’t know anything he doesn’t.

I have a script called projtags that generates tag files in every directory under the current directory. When I want to use ctags for a project, I switch to the project directory and run this script. You can find it in this Gist.

To update the tags for a file when I save a file in Vim, I use an autocommand in my Vim configuration. The source for that is in another Gist that you can copy. The function updates the tags whenever you save a file and there’s a tags file in the directory of the file being saved. This prevents new tag files from being created in random directories that aren’t part of projects. The functions delete the tags for the current file being saved from the tags file using sed and then uses ctags -a to append the tags for the file being saved to the tags file. This is faster than generating tags for all of the files in the directory. You can just paste the contents of the Gist into your .vimrc file.

I also want to update my tags whenever I pull in other people’s changes from version control. I could just run my projtags script when I pull new files, but for one of our projects, it takes about 40 seconds to run. Too slow. Instead, I have a script called updatetags that finds all of the directories where the tags file is not the newest file in the directory and regenerates the tags file for those directories. It also generates tags in directories that were added since the last run. (It’s in a Gist as well.)

The final step is invoking the script. There are a lot of ways to do so, but I use Git, and I want the script to run automatically after I pull in code from remote repositories. To cover all cases, I run the following commands (from the home directory of the repository):

ln -s $SCRIPT_DIR/updatetags .git/hooks/post-checkout
ln -s $SCRIPT_DIR/updatetags .git/hooks/post-merge

The $SCRIPT_DIR variable is just a placeholder for the actual directory where updatetags lives.

I should add that one special bonus when you have your tags set up properly is that you can open a tag from the command line rather than a file, using the -t flag. So if you want to open a class named UserHistory you can just type:

vim -t UserHistory

I immediately found this to be fantastically efficient.

This system of managing tag files may be grossly inefficient. If you have a better way of managing your tags, I’d love to hear about it in the comments.

For more information:

G. K. Chesterton on software development

John D. Cook blogs a great quotation from G. K. Chesterton that advises caution before removing something. In essence, he challenges people who would remove something unnecessary to go and figure out why it was erected in the first place. Needless to say, this is an issue we deal with a lot in writing software.

In the comments of the post, an interesting debate plays out about the role of code, automated tests, and documentation in helping people figure out why the code that they want to delete was originally written.

Written properly, your code should be self-documenting in terms of its basic workings. Donald Knuth’s literate programming takes this approach to an extreme, but we can move a long way toward it just by naming things well and structuring our programs to emphasize readability. Comments still have their place in areas where efficiency trumps readability, but generally, we should always prefer writing our code so that it’s comprehensible without them.

Unit tests verify that code is functioning properly. They have some use as documentation, but their purpose is to enable you to refactor your code with the assurance that you haven’t broken anything as long as the tests still pass.

Documentation in comments or elsewhere should be the story of why the code works the way it does. Why was a particular algorithm chosen? What compromises did you have to make for performance? What’s likely to fail under strain? These are the sorts of questions that are difficult to answer in the code itself, but are important to anyone who’s expected to maintain the code in the future.

For another take on learning about why things are as they are before making changes, check out my post from last February on getting off to a successful start at a new job. My advice to myself served me pretty well in 2012.

More on functional programming

Yesterday I linked to Uncle Bob’s intro to functional programming. There are some interesting reactions floating around today. Tim Bray agrees that FP is important, but doesn’t like the magical example or the fact that it’s written in Lisp. The reliably cranky Cédric Beust pushes back on the article with vigor, mainly on the point that functional programming is the “next big thing” in software engineering.

For what it’s worth, I think that functional programming is worth learning more about because it will make you a better programmer in whatever language you use, not because one day you’ll be using FP and abandoning the use of mutable variables.

Getting started with functional programming

I fully intend to write a post talking about stuff I learned in 2012, but in truth, I’ll probably never get around to it. The blog suffered last year because I was so busy stuffing new stuff into my head that I didn’t have the energy to write much of it down.

One of the big things I learned was that while I’ve programmed in a lot of languages, they all came from the same family and I used them all in the same way. That left a huge gaping hole in my experience called “functional programming.” There’s another hole involving functions as first class objects that I’m trying to fill up as well.

If you know nothing about functional programming, Uncle Bob has a short and useful introduction that’s worth reading. If you want to master the concepts, I recommend The Little Schemer.

I still don’t do much functional programming in my data to day life beyond the occasional bit of Scala hacking, but I find that functional concepts make it really easy to break down certain kinds of problems regardless of which language I’m using. For example, it’s really easy to write a binary search implementation using a functional approach.

In the larger scheme of things, I was able to get away with ignoring functional programming for a long time, but I don’t think that’s possible any more. Not only are functional languages picking up steam, but functional techniques are everywhere these days if you know where to look for them.

How big a Web business can you build with no engineers?

I was reading Andrew Sullivan’s announcement that he is going independent and starting his own company to publish his blog, and what interested me most is the employee list. He has editors and interns, and that’s it. No designer. No engineers.

My understanding is that his blog has been published using Movable Type forever, and probably will continue to be published using Movable Type. Maybe that’s incorrect, but I assume it’s some canned publishing tool. He’s going to be using TinyPass to collect subscription fees, and there won’t be any advertising so he doesn’t need advertising technology.

What I wondered was, what’s the biggest Web-based business that has no in house engineering resources? There are merchants on Amazon, eBay sellers, and Etsy sellers that have pretty big businesses. There are also existing blogs that don’t appear to have engineering resources that do well, like Daring Fireball and The Wirecutter. There are certainly other types of businesses as well, like musicians and writers who offer their products only through digital downloads, or people who produce videos and make their money by publishing them on YouTube.

This is another aspect of the Web as an industry. You can build a big Web-based business without actually digging into any of the details of how to build things on the Web.

Update: Here’s a post from Andre Torrez on a related theme.

Setting Apache MaxClients for small servers

Every once in awhile, the server this blog runs on chews up all of its RAM and swap space and becomes unresponsive, forcing a hard reboot. The problem is always the same — too many Apache workers running at the same time. It happened this morning and there were well over 50 Apache workers running, each consuming about 15 megs of RAM apiece. The server (a virtual machine provided by Linode) has 512 megs of RAM, so Apache is consuming all of the VM’s memory on its own.

At first I decided to attack the problem through monitoring. I had Monit running on the VM but it wasn’t actually monitoring anything. I figured that I’d just have it monitor Apache and restart it whenever it starts consuming too many resources. I did set that up, but I wondered how Apache was able to get itself into such a state in the first place.

The problem was that Apache was configured very poorly for my VM. Because I’m running PHP apps with the PHP module, I’m running Apache using the prefork module. For more information on Apache’s Multi-Processing Modules, check out the docs. Basically, prefork doesn’t use threads, so you don’t have to make sure your applications and libraries are thread-safe.

Anyway, here are the default settings for Apache in Ubuntu when it comes to resource limits:

StartServers          5
MinSpareServers       5
MaxSpareServers      10
MaxClients          150
MaxRequestsPerChild   0

In preform mode, Apache can handle one incoming request per process. So in this case, when Apache starts, it starts five worker processes. It also tries to keep five spare servers idle for incoming demand. If it has ten idle servers, it starts shutting down processes until the number of idle servers goes below ten. Finally, MaxClients is the hard limit on the number of workers Apache is allowed to start. So on my little VM, Apache feels free to start up to 150 workers, at 15 megs of RAM apiece, using up to 2.25 gigabytes of RAM, which is more than enough to consume all of the machine’s RAM and swap space.

This number is far, far, far too high for my machine. I had to do this once before but when I migrated from Slicehost to Linode some time ago, I forgot to manually change the Apache settings. I wound up setting my machine to a relatively conservative MaxClients setting of 8. I’m still tweaking the other settings, but for a server that’s dedicated to Web hosting, you may as well set the StartServers setting to the same as the MaxClients setting so that it never has to bother spinning up new server processes to meet increasing demand.

Currently my configuration looks like this:

StartServers          8
MinSpareServers       1
MaxSpareServers       8
MaxClients            8
MaxRequestsPerChild   0

The only danger with this low setting is that if there are more than 8 simultaneous incoming requests, the additional requests will wait until a worker becomes available, which could make the site really slow for users. Right now I only have about 60 megs of free RAM, though, so to increase capacity I’d need to either get a larger VM, move my static resources to S3, or set up a reverse proxy like Varnish and serve static resources that way.

Garann Means on JavaScript templates

using javascript templates

Garann Means explains the whys and wherefores of JavaScript templates. Reading this article reminds me that my front end development skills are so ten years ago.

The Web as an industry

Andre Torrez makes the observation that making things on the Internet isn’t for enthusiastic amateurs any more:

I think the thing that is eating industries: newspapers, music, movies, second rate mobile phone manufacturers…it’s eating us too. Being literate in tech isn’t enough anymore. As Robin said above, knowing how to put up a web page or write a little web app is fine for a niche hobby or an amateur pursuit, but if you want things to look good and work and be something more than a semi-broken thing you have to invest a real amount time and thought.

When I got started working on the Web, you could find gainful employment simply by being a person who really liked messing around with computers. I’m not talking about jobs making Web pages for local businesses, either, but for big businesses. Those days are definitely gone.

Back when I started, there were only “full stack” developers. Everybody did pretty much everything. Then the industry evolved to have “designers” and “engineers.” Things have become more specialized since. Now it’s not uncommon to find people developing software at Web companies who don’t know HTML at all. I would never have predicted that ten years ago.

Tony Horwitz on the gun lobby’s historical forebearer

The NRA and the “Positive Good” of Maximum Guns

As you know, I’m fascinated by historical patterns, so I pretty much had to link to this guest post by historian Tony Horwitz at Ta-Nehisi Coates’ blog. Comparing the “gun power” to the “slave power” is inflammatory, but the parallels are unmistakable. In fact, this pattern can be seen in most movements who fear their prerogatives being rolled back incrementally. Simply holding your ground is never enough.

Older posts Newer posts

© 2025 rc3.org

Theme by Anders NorenUp ↑