After the server crash I mentioned the other day, I solicited suggestions on how to keep my server from running into the same problems again. Monit was one package that several people suggested, so I decided to give it a try. Monit is a monitoring package, written in C, that can be run as a daemon on a server. It has facilities for monitoring the entire server, specific applications, files and directories, and network services. For example, you can configure it to alert you when your server’s CPU utilization goes above 80%, or when your server’s memory usage stays above 75% for five minutes, or when a specific file or directory are missing or changed.

At the application level, it can automatically restart services that go down, or restart them if they start behaving incorrectly. For example, you can tell it to restart Apache if it spawns too many processes or starts using too much memory. Needless to say, configuring Monit is pretty complex, but it provides one of the more readable configuration file formats I’ve seen, and there are recipes available for most common applications.

There’s a whole lot more flexibility available as well. Monit runs at intervals (I have it set up to run its checks every two minutes), and you can tell it to only alert you if a check fails a certain number of times in a row, or even if it fails two out of three times or three out of five times. You can cap the number of times it will try to restart a service before giving up. The fact that it’s able to do all of these things without being impossible to actually work with is a credit to the developers.

To make sure my server doesn’t use up all its memory again, I have set up monitoring to alert me if the server’s memory usage goes above 50%. (It usually runs at about 20% memory utilization.) I also have a few other basic alerts set up that were in the default configuration file. Then I set it up to make sure that MySQL, Apache, sshd, and Postfix are running and that they are listening on the proper ports.

The funny thing about monitoring is that I kind of want something to go wrong just to make sure Monit is doing its job. I wound up setting the alert thresholds to an artificially low level briefly to make sure that Monit would generate alerts, and it did, but I’d like to see it trap an event in the wild to make sure everything is working. Maybe I’ll kill Apache tonight and see if Monit starts it back up.

Monitoring services is a problem that, as a developer, I’ve never solved to my satisfaction. I’ve written scripts that test a service and generate an email if things go wrong, but they’re always one offs and keeping them up to date can be a pain. I think that next I’m going to investigate how to extend Monit to add my own monitoring scripts to the mix. Since it’s already hooked into the services on the server, it would be great to be able to set it up to automatically restart those services as needed based on the results of my own script, or just send alerts in cases where nothing can be done automatically but the administrator needs to know something has gone wrong.

I expect that my Monit-related adventures will continue. I’ll keep you updated.