Ever have one of your customers call you about a transaction that failed in some way, or a message that never reached it's destination, or some other problem? Of course you have. And that usually means tracking down the issue through log files, and lots of them. You have a cluster of application servers, sitting behind load balancers, talking to queue managers and legacy systems. Each of which has their own log files to pour over. Did this particular transaction hit node 1 of the cluster, or node 2, 3, 4, ..., X? Did it get to the cluster or fail on the load balancer?
The old way would be to login to each server and grep through logfiles. The new way is Splunk (www.splunk.com). It's a log aggregation and indexing tool with a clever front end to search. Imagine having a Google-like search engine for ALL of your logs, all in one place.
I am playing with the free server version (1.1) right now, and it is slick. There is also a pay version that includes more features and support. The front end does a lot with AJAX principles, so you get full features like a popup-menu as you type with possible search terms and the number of occurrences of each, boolean searches for more complex queries, and search results where every term is clickable. For example, hover on a word or phrase in your search results, and every occurrence in the result set highlights. Clicking searches on that term, control-clicking AND's that term to the current search criteria, and control-alt-clicking NOT's that word from the criteria.
The backend is highly configurable. For example, you can set up directory monitors to watch a directory for new files, or set up tailing processors to index a live file in real-time. There are other options as well, but I'm still new at Splunking. I can see already that this can be an invaluable tool for troubleshooting complex systems, though. I definitely would recommend checking it out soon.