Peter Johnson wrote:Pretty much anything that can make a URL would work. I've even seen JMeter used for stuff like this. However, it might be better if you tried to figure out the cause for the "hang". Things that I have seen (and how I figured out the root cause) are:
I have a few ideas of what could be causing the problem; but getting it all nailed down and validated is going to take some time. Since I also realize that problems like this may occur in the future, I want to have something to help us monitor it, collect crash statistics/logs, alert us and possibly issue a restart.
I suppose another key area I'd like to improve is developing a way to simulate production-level system activity, to hopefully bring these issues to the forefront before they manifest on the main system. This seems like it would require a fair amount of effort to accomplish with realistic tests. Even still, you have those edge cases where people are using the system in ways you haven't conceived of; perhaps a method of recording production activity for a period of time so it can be "replayed" in tests, would be useful.
Peter Johnson wrote:a) An infinite loop in the application code. The developers swore to me (or actually, to the person I was interacting with) that there was no such infinite loop. I had them take several JVM thread dumps, several seconds apart and look for threads that were always busy in the same location. They found the loop. Such an issue can "steal" threads from your thread pool (because the threads are never released). And users often won't complain - all the'll see is that their request is taking a long time to complete so they'll try it again and maybe this time they won't hit the combination of factors that causes a loop.
I believe the most likely explanation is an infinite loop. I have seen occasional log messages indicating a stack overflow when our
struts action.findForward() appears to be redirecting indefinitely beteween two pages. If this is the case, however, it isn't always logging it down.
Peter Johnson wrote:b) Heap space issues. Gather garbage collection statistics and use them to right-size your heap. If you end up filling up the heap, then the JVM will constantly perform major collections, which slows things down to a crawl.
We've experienced these in the past, fortunately they tend to generate log messages. The server hardware has a large amount of memory, far in excess of what could be maxed out under most conditions, so I doubt this is the case; but it is worth looking at those garbage collection statistics again for sure.
Peter Johnson wrote:c) Poor database access schemes, poorly written queries or poorly planned database updates. You need to gather database statistics to track these down.
Fortunately, this is an area we are typically okay on. I have been tuning the database part of the application for some time and using monitoring tools such as innotop to monitor the system. The application has overcome many hurdles related to a design that wouldn't scale well at first.
I'm thinking there is an issue with infinite loops and fortunately we have a later version of the module which gets us into the loops. The later version, instead of forwarding between pages, uses more pop-up modal windows, thus minimizing the complexity. I'm happy to say that we are moving into the direction of heavier use of modal windows and away from all of the forwarding logic which can get so cumbersome on complex screens.