Verse Of The Day

Wednesday, November 28, 2007

Splunkin' Like A Madman And Log Management


After playing with Splunk over a year ago (and liking what I saw) , we finally got the green light to go forward with an implementation at work. Splunk is a log aggregation and search tool. Basically it's a really fancy way of grep'ing through log files without logging in to each server and grep'ing through logs. One search will search logs across all the servers.

Right now I am in the process of estimating our peak daily volume, which more of less means I am ssh'ing in to each app server and Apache server and looking at the file sizes for the various logfiles. The licensing for Splunk is tiered based on peak daily volume, with a free version that will index up to 500MB a day. Their pricing model seems like a real bargain when you compare it to other enterprise software vendors out there, especially when you see the license fee is a "perpetual license" -- you pay once and it is good forever. If your logging volume increases, you can upgrade the license without screwing around with the software installation, and the cost is prorated as the difference between the old and new.

Everyone I have dealt with over at Splunk has been really cool to work with so far as well. Not a bunch of pushy sales-droid types, and they seem more than willing to work with you for unique configurations.

This should make it a lot easier to find and pinpoint issues across our 3 Jboss clusters and the handful of stand-alone Jboss and Tomcat servers and multiple Apache servers.

Apache Log File Management

And speaking of logs, we had an issue where the mod_jk.log file hit the 2 gig limit (32 bit Linux) on one of our Apache servers the other day. That was fun to track down, since some stuff seemed to work and other stuff did not. Our web applications were working fine, but web services calls into our boxes were failing. According to our Jboss logs, we were processing the WS calls and returning valid responses, but the clients were getting errors about the connection being closed while reading the response, or null objects being returned, depending on which platform they were connecting from.

So now I am working on getting logrotate setup for our Apache installations. While I am in this mode, I think I'll whip up some Nagios checks to alert if the filesizes grow to say 1.5G (which shouldn't happen with daily log rotation and compression).

Setting it up is pretty easy. I edited the /etc/logrotate/httpd file to add the following:

/app/j2http/apache-2.0.55/logs/*log {
rotate 10


Stoner said...

You need to have logrotate restart Apache after it rotates the files or else apache will keep writing at the file pointer it last wrote to. A graceful restart will cause Apache to reopen log files.

If you don't want to do daily rotates, you can use the size option to logrotate so it'll only rotate if the log file gets to a certain size.

Robb said...

I've heard that, but the way we set it up it copies the current file to the new file name (log.1, log.2, ...) then truncates the current. Apache can keep writing to the current file after it gets truncs and no restart needed. Tested this morning and it works like a charm with no restarts.

Slestak said...

To use with splunk though, is it advised to suffix the logfiles with dated names instead of an integer?
As the initial data moves into log.1, to log.2, etc, i imagine the indexing can get screwy.

Robb said...

Splunk indexes in real time, and has it's own copy of the original log data. Doesn't need to access the archived log files at all, so it doesn't matter what naming convention you use.