After playing with Splunk over a year ago (and liking what I saw) , we finally got the green light to go forward with an implementation at work. Splunk is a log aggregation and search tool. Basically it's a really fancy way of grep'ing through log files without logging in to each server and grep'ing through logs. One search will search logs across all the servers.
Right now I am in the process of estimating our peak daily volume, which more of less means I am ssh'ing in to each app server and Apache server and looking at the file sizes for the various logfiles. The licensing for Splunk is tiered based on peak daily volume, with a free version that will index up to 500MB a day. Their pricing model seems like a real bargain when you compare it to other enterprise software vendors out there, especially when you see the license fee is a "perpetual license" -- you pay once and it is good forever. If your logging volume increases, you can upgrade the license without screwing around with the software installation, and the cost is prorated as the difference between the old and new.
Everyone I have dealt with over at Splunk has been really cool to work with so far as well. Not a bunch of pushy sales-droid types, and they seem more than willing to work with you for unique configurations.
This should make it a lot easier to find and pinpoint issues across our 3 Jboss clusters and the handful of stand-alone Jboss and Tomcat servers and multiple Apache servers.
Apache Log File Management
And speaking of logs, we had an issue where the mod_jk.log file hit the 2 gig limit (32 bit Linux) on one of our Apache servers the other day. That was fun to track down, since some stuff seemed to work and other stuff did not. Our web applications were working fine, but web services calls into our boxes were failing. According to our Jboss logs, we were processing the WS calls and returning valid responses, but the clients were getting errors about the connection being closed while reading the response, or null objects being returned, depending on which platform they were connecting from.
So now I am working on getting logrotate setup for our Apache installations. While I am in this mode, I think I'll whip up some Nagios checks to alert if the filesizes grow to say 1.5G (which shouldn't happen with daily log rotation and compression).
Setting it up is pretty easy. I edited the /etc/logrotate/httpd file to add the following: