Had a client frantically calling and emailing me that his website wasn't loading, and was displaying a mysql "too many connections" error message.
So I dutifully logged on via SSH, tried to crank up mysql to have a look at the processlist, "no space on /var/lib/mysql to perform this operation".
Hmm, that's a new one ... how about disk space ? oops, 100% full. Which is strange on a dedicated box that has only ever been 4% full in the last two years of operation.
After a while trying to find out where my disk space has gone, I happened across the apache error logs. All 150 of them archived, and the current one at about 300GB - which in itself is strange as the cron should be rotating the logs after 7 archives. Then I noticed that all these error logs had timestamps within the last 24 hours, and the active log was filled with the same string repeated over and over again.
"PANIC: fatal region error detected; run recovery"
So it looks like apache has been filling god-knows how many petabytes of this string into the error logs, which are then gzipped and archived into 150 neat little "zip-bombs".
So kill the apache process, deleted all the error logs, restarted apache, and within 5 seconds the error log is back up to 100MB. Killed apache again.
Google to the rescue, and it was the work of a moment to discover the problem lay in the mod_gnutls module, which is an alternative to the usual mod_ssl module. It uses a Berkeley database "cache" and this cache had somehow become corrupted - bear in mind the cache itself is only 24 kilobytes.
When the cache becomes corrupted, mod_gnutls writes a helpful (sic) little message in the error log which neither identifies the module that is failing nor the real problem, nor any suggested solution. And then tries again to perform the same operation that lead to the error in the first place. Again, and again, and again until the end of the universe.
Turns out all that is needed is to delete the corrupted database cache thingy and restart apache, whereupon the mod will make a brand new cache database and sanity is restored.
So WHY can't this be automated as the solution to the error, instead of spamming the error log until the entire hard disk is filled with gzipped error logs ?
This has been known about since April 2010, and I it seems like 9 minor revisions later, it still hasn't been fixed.
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=576676
I can understand the fail-retry graceful recovery attempt, but wouldn't you think after 2^63-1 attempts, he might be thinking "hey, this doesn't seem to be working?".
Arghh.