One issue was that, even though there is nothing running on this Linux box except a single instance of Tomcat (5.5) with this single web application deployed to it, we will get into frequent issues where we get paged for "high CPU utilization". Log in to the machine, and the Java process is taking 100% CPU. It might stay like that for a few hours, or even a few days. Occasionally it clears up on it's own, but usually the server stops responding and we have to restart Tomcat.
We have had another issue where we would get Out of Memory exceptions (Heap space), and the JVM would stop running. The two issues weren't necessarily related, but we couldn't rule out the possibility either.
We don't have much visibility into the application and what's going on since it is a vendor built "black box". We have some customized source code, but most of it is off limits to us. They rolled their own database connection pool, MVC framework, persistence framework, etc.
In comes Jconsole. JConsole is an awesome tool that comes bundled with JDK 1.5 and above. It connects to the JVM and gives you all the info you could want on the various JVM memory pools, garbage collection, threads, classloading, and lets you manage anything exposed via JMX. It also has lots of pretty graphs, such as memory usage and garbage collections over time, for any or all of the memory pools (heap, non-heap, or individual pools, like permgen and eden space). Same goes for threads and loaded classes --current number, peak, total created.
The best thing about JConsole is the ability to connect to remote JVM's so you don't add too much overhead on the box being monitored. I have JConsole hooked up to my test and production servers, and it helped me prove that the two issues above were connected. There is some condition in the application (yet to be found, actually, first step was proof of what's really happening) that causes a substantial memory leak, and once the memory usage gets at it's ceiling, the garbage collection thread basically runs constantly, trying to regain some trivial amount of memory, then filling it up, and running GC again.
With an average of 100 active sessions at a time plus full garbage collection running non-stop, the CPU gets consumed quickly. I monitored the app for about a week and a half, and memory and CPU looked great -- there were a bit over 200 full GC's in that time period. Then this past weekend, we had to restart because of a DNS issue. That was Friday evening, and by Monday morning there were over 2000 full GC's performed, and I was restarting a non-responsive server by lunch time.
Below are two screenshots, one is of several days of "normal" memory usage, notice gradual rise and then sharp decrease at full GC, all the while keeping well below the JVM's allotted memory ceiling. Second is this past weekend's issue, where memory is hovering at ceiling and a full GC doesn't do much. Also notice the old generation memory pool is quite full.
MAKING IT HAPPEN
To set up JConsole to run on a local JVM, you only need ot pass one extra argument to the JVM:
-Dcom.sun.management.jmxremoteTo set up remote (with no security), it is a matter of adding a few more parameters to the remote JVM at startup:
-Dcom.sun.management.jmxremote.port=8004 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=falseThen when you start up JConsole, go to the remote tab, enter the server name and the port that you specified (in this case, 8004).
Simple as that. It starts collecting stats immediately and the graphs appear. As you explore the JMX tree, you will notice yo ucan click on some of the stats and the simple integer displays "opens up" into a full graph display. I'm doing this to watch the active sessions patterns through the day and week.