The long journey toward production

Last week one of the data analysts at work asked me to help him out with a script he was writing. The script generates a CSV file and uploads it to an FTP server. He had one file containing a sequence of SQL queries, and another shell script that executes that script and then uploads the results via FTP. I thought it would be fun to write up what it took to convert those bits of code into something that meets the definition of production service in our environment.

The first, most obvious problem, was that the script was running on the analyst’s development VM, not on a production server. Relatedly, it was running from his personal crontab. The only traces that this production service even existed were in his personal space. That seemed wrong. It also queries tables in his personal schema and had the credentials for his database account hard coded in the script.

Fortunately we already have a production cron server that’s hooked into our deployment system along with a version controlled directory for the scripts to schedule cron jobs.

Relatedly, we are mostly a PHP shop, and we write cron jobs as command line PHP scripts. This may not be to your tastes (or mine), but it’s what we do. So the script needed to be ported to PHP. It also needed to extend our standard base class for crons. This provides basic features like logging to our centralized log management system, as well as conveniences like locking to prevent overlapping runs of the script and the ability to accept an email address to which to send alerts.

To get all of this working I had to rewrite the script in PHP, implementing the functionality to generate the CSV file and then send it via FTP. The SQL queries required to collect the data creates a couple of temporary tables and then runs a query against those tables. My first thought was that I would just run the queries natively from PHP through our database library, but the temporary tables only last the duration of a database session and there’s no guarantee that the queries will be run within the context of a single session, so the tables were disappearing before I could retrieve data from them.

Instead I had to put all of the queries into one variable and then run them through the command line client for the database using PHP’s proc_open function, piping the contents of the variable to the external process. I also switched things up to use the appropriate database credentials, which required the analyst to update the permissions for that table. Ideally, we’ll eventually change things up so that the data is stored in a production schema.

At that point, I had a script that would work but it didn’t have any error handling and it wasn’t a subclass of the base cron script we use. Adapting it to use the base cron script was pretty straightforward. Error handling for these types of scripts is a bit more complex. I opted to do one check to see whether the CSV file was created successfully, and then to catch any errors that occurred with FTP and alert. Fortunately, the base cron script makes it easy to send email when failures occur, so I didn’t have to write that part.

Finally, I just had to pick a time for the script to run, add the crontab entry, and then push the script through our deployment system. Or at least that was the idea. For whatever reason, the script works when I run it manually but it does not appear to be running through cron, so I’m running it manually every day for now. I also realized that if the script runs before the big data job that generates the data for it finishes, or that job fails for any reason, then the output of the script will be wrong. That means I need another layer of error handling to detect problems with the big data job and send an alert rather than uploading invalid data.

Why write this up? It’s to point out that for most projects, getting something to work is just a small, small part of building a production service. Exporting a CSV file from a database query and uploading it to an FTP server takes just a few minutes. Converting that into a service that runs within the standard infrastructure, and handles failure conditions smoothly takes hours.

There are a few takeaways here. The first is that anything we can do to make it easier to build production services is almost certainly worth the investment. Having a proper cron base script was really helpful. I’m creating a superclass of that base class that’s designed just for these specific kinds of jobs to make this work easier next time.

The second is an acknowledgement on the part of everyone involved in a project that getting something working is just the beginning, not the end of the project. The work of making a service production-ready isn’t fun or glamorous, but it’s what separates the hacker from the software engineer. Managers need to account for the time it takes to get something ready for production when allocating resources. And everybody needs to be smart about figuring out the level of reliability any service needs. If a service is going to run indefinitely, you need to be certain that it will work when the person who wrote it goes on vacation.

The third is that at any company, people are building services like this all the time outside the production context. You usually find out things went wrong with them at the worst possible time.

Great writeup, Rafe. Hoping for more of these in the future!

For whatever reason, the script works when I run it manually but it does not appear to be running through cron, so I’m running it manually every day for now.

Almost certainly this is an environment variable issue. I’ve run into this with PATH, LD_LIBRARY_PATH, and any number of other things. My own bash cronjob-wrapper script (for centralizing logging, notifications, etc, similar to your parent PHP class) runs a subset of the user environment script explicitly to avoid this kind of thing.

Simplest way to find out would be to put a dump of $_ENV into your code (or just a throwaway script) and compare the output when run manually and when run via your production cron environment.

3 Comments

John
March 31, 2013 at 9:08 pm

Great article! This is one thing that many newer engineers wildly underestimate.
Chris Adams
April 1, 2013 at 9:34 am

Great post – and something I wish was as widely discussed as language or platform minutiae.

Your last point touches on something too few senior managers truly grasp: lack of resources to do things The Right Way, however that is defined locally, means that people will do what they need, not that they’ll wait for official support. The real skill is keeping an eye on this and figure out how to make engineering resources available as efficiently as possible to clean up successful small projects.
daveadams
April 1, 2013 at 11:27 am

Great writeup, Rafe. Hoping for more of these in the future!

For whatever reason, the script works when I run it manually but it does not appear to be running through cron, so I’m running it manually every day for now.

Almost certainly this is an environment variable issue. I’ve run into this with PATH, LD_LIBRARY_PATH, and any number of other things. My own bash cronjob-wrapper script (for centralizing logging, notifications, etc, similar to your parent PHP class) runs a subset of the user environment script explicitly to avoid this kind of thing.

Simplest way to find out would be to put a dump of $_ENV into your code (or just a throwaway script) and compare the output when run manually and when run via your production cron environment.

rc3.org

Strong opinions, weakly held

The long journey toward production

3 Comments

Leave a Reply Cancel reply

Recent Posts

Details

rc3.org

Strong opinions, weakly held

The long journey toward production

Previous post

Next post

3 Comments

Leave a Reply Cancel reply

Recent Posts

Details