Last week one of the data analysts at work asked me to help him out with a script he was writing. The script generates a CSV file and uploads it to an FTP server. He had one file containing a sequence of SQL queries, and another shell script that executes that script and then uploads the results via FTP. I thought it would be fun to write up what it took to convert those bits of code into something that meets the definition of production service in our environment.
The first, most obvious problem, was that the script was running on the analyst’s development VM, not on a production server. Relatedly, it was running from his personal crontab. The only traces that this production service even existed were in his personal space. That seemed wrong. It also queries tables in his personal schema and had the credentials for his database account hard coded in the script.
Fortunately we already have a production cron server that’s hooked into our deployment system along with a version controlled directory for the scripts to schedule cron jobs.
Relatedly, we are mostly a PHP shop, and we write cron jobs as command line PHP scripts. This may not be to your tastes (or mine), but it’s what we do. So the script needed to be ported to PHP. It also needed to extend our standard base class for crons. This provides basic features like logging to our centralized log management system, as well as conveniences like locking to prevent overlapping runs of the script and the ability to accept an email address to which to send alerts.
To get all of this working I had to rewrite the script in PHP, implementing the functionality to generate the CSV file and then send it via FTP. The SQL queries required to collect the data creates a couple of temporary tables and then runs a query against those tables. My first thought was that I would just run the queries natively from PHP through our database library, but the temporary tables only last the duration of a database session and there’s no guarantee that the queries will be run within the context of a single session, so the tables were disappearing before I could retrieve data from them.
Instead I had to put all of the queries into one variable and then run them through the command line client for the database using PHP’s proc_open function, piping the contents of the variable to the external process. I also switched things up to use the appropriate database credentials, which required the analyst to update the permissions for that table. Ideally, we’ll eventually change things up so that the data is stored in a production schema.
At that point, I had a script that would work but it didn’t have any error handling and it wasn’t a subclass of the base cron script we use. Adapting it to use the base cron script was pretty straightforward. Error handling for these types of scripts is a bit more complex. I opted to do one check to see whether the CSV file was created successfully, and then to catch any errors that occurred with FTP and alert. Fortunately, the base cron script makes it easy to send email when failures occur, so I didn’t have to write that part.
Finally, I just had to pick a time for the script to run, add the crontab entry, and then push the script through our deployment system. Or at least that was the idea. For whatever reason, the script works when I run it manually but it does not appear to be running through cron, so I’m running it manually every day for now. I also realized that if the script runs before the big data job that generates the data for it finishes, or that job fails for any reason, then the output of the script will be wrong. That means I need another layer of error handling to detect problems with the big data job and send an alert rather than uploading invalid data.
Why write this up? It’s to point out that for most projects, getting something to work is just a small, small part of building a production service. Exporting a CSV file from a database query and uploading it to an FTP server takes just a few minutes. Converting that into a service that runs within the standard infrastructure, and handles failure conditions smoothly takes hours.
There are a few takeaways here. The first is that anything we can do to make it easier to build production services is almost certainly worth the investment. Having a proper cron base script was really helpful. I’m creating a superclass of that base class that’s designed just for these specific kinds of jobs to make this work easier next time.
The second is an acknowledgement on the part of everyone involved in a project that getting something working is just the beginning, not the end of the project. The work of making a service production-ready isn’t fun or glamorous, but it’s what separates the hacker from the software engineer. Managers need to account for the time it takes to get something ready for production when allocating resources. And everybody needs to be smart about figuring out the level of reliability any service needs. If a service is going to run indefinitely, you need to be certain that it will work when the person who wrote it goes on vacation.
The third is that at any company, people are building services like this all the time outside the production context. You usually find out things went wrong with them at the worst possible time.
The argument for the Java ecosystem
There’s a lot to hate about Cade Metz’s article on server-side Java from last month’s Wired magazine. For one thing, Java has been more successful on the server than on the client for at least 15 years. The fact that you can use it to build large-scale Web applications is not news, nor is the fact that the JVM is a great host for languages other than Java.
That said, the tradeoffs that go into choosing a software development platform are interesting. The Wired article is mostly concerned with scaling at levels that are unrealistic for nearly anyone to plan for. And indeed, platform is only a small part of the scaling discussion. I’m sure that most of the software written for Healthcare.gov was written in Java, and it didn’t solve any of the problems that are endemic to that sort of project (about which more some other time).
In the early stages, any Web company should focus almost solely on developer productivity. The hard part is building software people want to use, and iterating rapidly on your ideas. (I think everybody already knows this.) You’ll probably have rewritten everything by the time you even start to approach scaling issues of the kind face by companies like Pinterest or Twitter, to say nothing of the really big companies like Facebook and Google. Just choose whatever helps you build software as efficiently as possible.
If it were up to me, though, I’d build software using a JVM-based language, for reasons that the Wired article doesn’t really get into. The fact that Java and the JVM are heavily used at big companies like Google and a lot of others that are significantly more boring means that the platform will continue to grow and evolve. In that sense, it’s an incredibly safe choice.
Java also has an incredibly robust open source ecosystem, and mature libraries for almost everything. Non-Java languages that run on the JVM can make use of these libraries as well. One of the first things Java developers notice when they move to non-Java platforms is that the key libraries that you take for granted in the Java ecosystem tend to be immature or completely missing.
Another key advantage is that the world of Big Data is built to a huge degree on Java. Hadoop is written in Java, as are most Hadoop jobs. Storm, Hadoop’s realtime processing cousin, is also written in Java. One key advantage here is that if you’re already writing your software in Java or some other JVM-based language, you can transfer those skills to start writing Hadoop jobs. More importantly, you can share library code between your standard software and your Big Data jobs. This can be a huge advantage. Again, in the early stages of a company, your “Big Data” can easily be processed using R or Python and you won’t need Hadoop at all, but it’s likely you will eventually.
Finally, all businesses eventually have to integrate with “enterprise” software to some degree. Maybe you’re experimenting with the Community Edition of Vertica for data warehousing, or your HR department has purchased some special software for managing benefits. The bottom line here is that every enterprise software package integrates Java (and Microsoft) first. In many cases, those are the only options for integration. If you’re not on one of those platforms, you’re stuck with awful options like the Unix ODBC implementations. This isn’t reason enough to choose Java on its own, but it’s something you get for free.
Every company of significant size is eventually going to find its way to supporting some JVM-based software. Maybe they’re running Hadoop, or they use Solr for search, or Cassandra as a data store. Why resist?
There are a lot of other advantages to choosing the JVM. There are lots of developers out there who know Java. The Java development tools are very mature. Write once/run anywhere was always overstated, but it’s still very easy to write your code on OS X or Windows, and then deploy and run it on Unix servers without any tweaks or even making a new build. Because Java is so heavily used by big companies, and API stability over time is valued, so you don’t often have to do much work to upgrade to newer versions of the JVM when they arrive. Indeed, I’ve worked on projects where my development JVM was different from the production JVM and never ran into problems.
There are great arguments to be made for all of the other popular platforms. I’ve built software using PHP, Ruby on Rails, Python and Django, and plenty of other languages and libraries, and they all have real strengths. However, I think that the advantages of betting on the JVM outweigh the alternatives in most cases.