Over the past two days I’ve been linked to by Daring Fireball and BoingBoing. I’m running WordPress on a virtual server from SliceHost with 512 megs of RAM. Because I’m an incompetent systems administrator and just run everything with the defaults, the server did not react well to the additional traffic. Here’s a list of the things I did to whip the server into shape.
The first problem was that the load average on the server was spiking and it was becoming non-responsive. Even logging into the server via SSH took minutes, and Web pages weren’t really loading at all. WordPress was not using the database efficiently and the database load was killing the server. I attacked this problem by taking advantage of caching.
I discovered that query caching was not properly enabled in MySQL. The cache was enabled, but the cache size was set to zero, so nothing was being cached. After tweaking things a bit, I wound up giving MySQL a 10 megabyte cache. (You can read about setting up the MySQL cache in this article.) Since my server often runs into RAM problems, I didn’t want to allocate too much RAM to a new feature.
I also set up WordPress to use caching as well, using the DB Cache Reloaded plugin. I like it a bit better than plugins that cache entire pages like WP Super Cache. Those plugins are probably worth it for really big sites that get millions of hits a day, but my traffic is relatively low most of the time, so my goal is just to make sure it doesn’t blow up entirely when traffic spikes. DB Cache Reloaded does the job.
That made things work, mostly. However, I also ran into problems with MySQL going away. In those cases, WordPress just generates a page saying it can’t connect to the database. I’ve seen that happen during traffic spikes before, and I’m still not sure what causes it. My guess is that it’s because of some kind of lock contention issue. WordPress uses MyISAM tables, which don’t support row level locking. I may switch them over to InnoDB over the holiday. I had to log in and restart MySQL a couple of times over the past 24 hours, but it hasn’t happened again yet today.
Once I stopped overtaxing the database, things started slowing down because Apache was spawning so many processes that it used all of the memory on the server. Basically, when Apache spawns more than 50 processes, the server starts getting low on memory, which slows things down, which causes Apache to take longer to serve requests, which causes even more processes to be spawned as incoming requests pile up until the server grinds to a stop. I looked at my Apache configuration and saw that it was allowed to spawn as many as 150 processes. Given that they consume about 25 megabytes of memory each, this did not work well with my puny server. Cranking the MaxClients
setting down to 25 did the trick here.
When I changed that setting, I also lowered the KeepAliveTimeout
setting to 5 seconds. When KeepAlive
is enabled, the server allows the browser to submit multiple requests over the same connection if it asks to do so. When a browser opens a persistent connection, it maintains its claim on the process that is serving its requests until the browser closes the connection or the timeout duration is exceeded. Because I lowered the number of processes, I lowered the timeout so that ill-behaving browsers don’t block other people from connecting if they’re not actually going to request more content.
Things are working better right now, and I’d be much happier if I knew what was causing the intermittent failures I am seeing with MySQL.
I should also probably do a better job of monitoring the server. The only diagnostic tool I used throughout the process was “top” and reloading the home page to see if the server was responsive.
November 27, 2010 at 12:16 pm
One thing I’ve done on my slice is set-up a Pingdom check, specifically to see if I’m able to establish a DB connection.
That’s tended to be a bit better than monit or munin, which are still helpful in terms of monitoring. The Pingdom check is helpful specifically because it’s external and acts as a dead man’s switch. The test must explicitly pass to consider the server “up.” I’ve had a few cases where monit’s fallen over due to memory or CPU issues and not known until much later that I’ve had an “event.”
November 29, 2010 at 2:09 pm
Have you considered simply putting a cache like Varnish in front of your site? I’ve been doing that for awhile with high traffic sites and it’s been a LOT easier than various invasive back-end work since it’s completely transparent, solves several DoS scenarios (e.g. Slowloris) and automatically avoids dog-piling a slow backend (grace periods are nice, too, as your visitors can at least get stale content rather than a 500 page). As long as you’re setting decent cache-control headers it’s a 5 minute install, too.
November 29, 2010 at 2:13 pm
Also: http://www.webpagetest.org/result/101129_FG2/1/performance_optimization/ suggests some client optimizations which appear to be quite safe since the URLs incorporate a version.
November 30, 2010 at 12:55 pm
You mentioned some concerns with a full page cache, anything in particular?
November 30, 2010 at 2:41 pm
I just worry about people being served stale pages.
November 30, 2010 at 6:01 pm
Stale pages in certainly a possibility. Like so many other things caching is a trade off, for most cases you can configure reasonable tradeoffs to improve performance.
For full page caching you could look at a combination of things: timing and actions.