Chris Case

Greenhorn
+ Follow
since Sep 25, 2011
Merit badge: grant badges
Cows and Likes
Cows
Total received
In last 30 days
0
Forums and Threads

Recent posts by Chris Case

As for detecting a crash, the following script is what I'm going to use to detect this kind of a crash in the future:

13 years ago
Well, I did eventually learn how to discover a crash; better yet, I learned why the server crashed and I think I have it fixed now. I did a thread dump during a crash and saw many threads with threadState BLOCKED and WAITING. When I examined the thread with the WAITING threadState, I saw that it was waiting for a connection from Hibernate's c3p0 thread pool. This pool was too small and was creating a bottleneck during times when usage spikes.

For the full detail, please refer to this thread:

https://coderanch.com/t/559684/EJB-JEE/java/Separating-JMS-Producer-Consumer-JBoss
13 years ago
It looks like the thread dump helped to find the root cause and ultimately solve the problem once and for all.

The problem appears to have been related to our connection pool max_size. It looks like it was too small of a connection pool in the c3p0 configuration for Hibernate, the ORM middleware we are using. We were using a c3p0.max_size of 20, which was too small for the amount of activity on that server. This has been changed to 150.

Without the thread dump, this was an elusive problem, as there were not clear indicators of what was happening to freeze the server. However, seeing the thread dump and seeing how everything that was blocked was waiting on a HibernateSession, then seeing the WAITING thread stack trace, which shows that it was waiting for a connection; a few google searches later I had an answer.

Here is a snippet of that stack trace which gave the vital clue, full stack trace is in previous message:



This is the hibernate.cfg.xml file I was referring to. In case anyone needs it for the context of what I'm talking about:



Anyhow, I appreciate the help you guys gave me very much. This issue has been bugging me for many months and I was wondering how, if ever, I was going to get past it. Like most things, the answer was very simple, once I knew where to look. The use of thread dumps is sure to be of great use in the future.

I've already written a script which I am going to use to monitor for blocked threads in the future. I'll include it here for feedback, or if anyone else wants it for their own use. I run it as a cron job every minute

I did a thread dump during an instance where the system was "locked up" and I think this more or less provides the information I need.

I see about 500 different threads, most of them in threadState:BLOCKED, in one way or another by the getHibernateSession() call we use to open up a hibernate session to read from the database.

Here is one of the stack traces from the dump:



My first instinct is to go through some of these Actions and look for situations where we can avoid having to open and use a hibernate session. I'm sure there are situations where these become nested. If say you have a Hibernate session opened for general use, then you call a function which opens a Hibernate session, etc, etc. I can already see, after reviewing the code, that there are places where we'd be better off loading this information from a session variable instead of loading from the database.

EDIT: When I looked towards the beginning of the "thread group: main", where these stacks first start to appear, I see this. I see a WAITING thread (related to hibernate session) with an awaitAcquire method near the top, followed by a deluge of BLOCKED threads. Not sure yet if this is significant; but it is worth noting.

Baski Reddy wrote:What version of JDK is in use? May be the standard JDK monitoring tools like jstack,jconsole... You need two things to troubleshoot further
- Thread dumps
- GC/Heap Dumps



We're currently using OpenJDK. here are the specifics:



Thanks for the tip. I found a way to take a thread dump using the built-in utility "twiddle".



I have attached what the thread dump during normal operation looks like in links.

I'm going to take one of these snapshots during the next crash. Also, I'll be sure that garbage collection logging is taking place.

I've got it running with a command line arg similar to:



What I may do, as a general rule, is have a script which runs when a crash is reported. I could have it take the thread dump, tail certain log files, write various other info, zip it up and email it to me. It will be interesting to see what the thread dump shows next time I have to do a restart.

Peter Johnson wrote:Pretty much anything that can make a URL would work. I've even seen JMeter used for stuff like this. However, it might be better if you tried to figure out the cause for the "hang". Things that I have seen (and how I figured out the root cause) are:



I have a few ideas of what could be causing the problem; but getting it all nailed down and validated is going to take some time. Since I also realize that problems like this may occur in the future, I want to have something to help us monitor it, collect crash statistics/logs, alert us and possibly issue a restart.

I suppose another key area I'd like to improve is developing a way to simulate production-level system activity, to hopefully bring these issues to the forefront before they manifest on the main system. This seems like it would require a fair amount of effort to accomplish with realistic tests. Even still, you have those edge cases where people are using the system in ways you haven't conceived of; perhaps a method of recording production activity for a period of time so it can be "replayed" in tests, would be useful.

Peter Johnson wrote:a) An infinite loop in the application code. The developers swore to me (or actually, to the person I was interacting with) that there was no such infinite loop. I had them take several JVM thread dumps, several seconds apart and look for threads that were always busy in the same location. They found the loop. Such an issue can "steal" threads from your thread pool (because the threads are never released). And users often won't complain - all the'll see is that their request is taking a long time to complete so they'll try it again and maybe this time they won't hit the combination of factors that causes a loop.



I believe the most likely explanation is an infinite loop. I have seen occasional log messages indicating a stack overflow when our struts action.findForward() appears to be redirecting indefinitely beteween two pages. If this is the case, however, it isn't always logging it down.

Peter Johnson wrote:b) Heap space issues. Gather garbage collection statistics and use them to right-size your heap. If you end up filling up the heap, then the JVM will constantly perform major collections, which slows things down to a crawl.



We've experienced these in the past, fortunately they tend to generate log messages. The server hardware has a large amount of memory, far in excess of what could be maxed out under most conditions, so I doubt this is the case; but it is worth looking at those garbage collection statistics again for sure.

Peter Johnson wrote:c) Poor database access schemes, poorly written queries or poorly planned database updates. You need to gather database statistics to track these down.



Fortunately, this is an area we are typically okay on. I have been tuning the database part of the application for some time and using monitoring tools such as innotop to monitor the system. The application has overcome many hurdles related to a design that wouldn't scale well at first.

I'm thinking there is an issue with infinite loops and fortunately we have a later version of the module which gets us into the loops. The later version, instead of forwarding between pages, uses more pop-up modal windows, thus minimizing the complexity. I'm happy to say that we are moving into the direction of heavier use of modal windows and away from all of the forwarding logic which can get so cumbersome on complex screens.
13 years ago
Not quite sure why, but most of the crashes we've been experiencing have no error log whatsoever. The only thing I have to go on, as a clue, is the last message logged. I think there are at least 2 different causes. I have a possible solution for the first cause, which I believe has to do with an occasional Struts mapping.findForward infinite loop redirect. The second probably has to do with the large excel reports we're generating with the bean, perhaps using up too many resources.

I suppose I should start by adding more log4j debug entries into the areas I suspect are problematic and perhaps even adding a logfile just for that area of the system. As for why the entire web server would be brought down, it seems like it must be a bug in JBoss AS 4.0.2, since I'd imagine the server should prevent such situations. Maybe it would be worth upgrading to at least JBoss AS 4.0.3, since we have other installations running on that successfully already. I'd like to get to the latest and greatest; but that's going to require some figuring out, as I hear it isn't exactly a trivial task.

In regards to our JMS consumers/producers and the maintenance requirements of running multiple application servers. The consumer is working alot harder than the producer, that's why we are using JMS, so we can queue up large processes for the processing of hundreds of PDFs (possibly all at once at times) and large excel reports. Currently it is a bit of a maintenance hassle because it is crashing once or more per week, interrupting the normal use of the application, so even having more application servers wouldn't be bad in comparison. If it keeps the web server from crashing and interrupting the user, it would be worth the extra work, until we can figure out exactly what is happening.

Anyone have any suggestions on how to figure out exactly what is happening aside from extra log entries? This is something I can't really figure out in the debugger because it works just fine 99% of the time.

Valery Lezhebokov wrote:

massimiliano cattaneo wrote:
The problem is that if the method thrown an exception (RuntimeException) the Transaction associated is rolledback and also the message is put back in the Queue ( if the Queue is used as destination of the message)



According to spec all RuntimeException (System Exceptions) should be logged, so I think it would be visible in the logs if that was the case.

I don't know this for sure, because the server often leaves no error logs around the time of the crash, it just becomes non-responsive, yet the process is still running. I will often see CPU in use, which I have found to be JMS related processes still running in the background. I don't know for sure, maybe they're unrelated; but I still want to separate the bean processes into their own application server.



I don't know the details of the architecture, but I guess that the majority of the work is done on the JMS consumer side. In this case indeed spiting up the job among several consumers make sense, but before doing that you really need to be sure that it's necessary. Having everything in one JVM is often always much simpler (especially from maintenance pov).


I'm trying to figure out a way to detect a web crash for jboss-4.0.2. The problem we're having is the web layer just stop responding, yet the process continues to run and may even be processing JMS messages in the background; the web layer just becomes more or less useless, or too sluggish to use, until the application server is rebooted. Often this produces no logs, it just becomes hosed.

I am looking into several different approaches; but if there is one that is already proven, that would certainly be preferred. Does anyone know of something that I'm not thinking of which might work for this problem?

CURL

One I am contemplating is the use of CURL to log into the application. I haven't tried all of the options for this command; but I'm not sure that this is going to be very straightforward. When the app is crashed it just sits there and waits for it to respond, wasting time. I need to know within a few seconds if someone can log-in.

Here's a sample of what I have tried with CURL:
curl -d "login=admin&password=test" http://localhost:8080/web/login.do

CURL and log monitoring

A hybrid that may allow this CURL option to work would be to login with CURL every minute or so and check the logs at the same interval. If it's been too long since the logs recorded a login, there's a problem.

Twiddle

Another that looks promising is twiddle. Perhaps there is some information in this which can tell me that the web layer isn't responding.

for example:



I would love it if the above script was all that is necessary; but it's not. It still produces the same output even if the web layer becomes hosed.

I have found that this will give me more info; but I'm not sure what to look for. There are a few other properties that can be loaded as well; but its difficult to know which will tell me that there is a problem and since I haven't yet gotten it to where I can reliably reproduce the problem in a test environment, I have little opportunity to check the values and compare them to normal values.



13 years ago
Hello,

JMS is an area that I am relatively new to at the moment. I help maintain an application which uses JMS messages to queue up operations for background processing; but it has always "just worked" and I haven't had to change much about that, so I never got the opportunity to learn much about it other than the fact that it was a nice way of setting up queues to run processes.

Well, here lately, I'm starting to suspect that some of the larger processes we have running in our bean are crashing the server. Basically, the web layer of the server is just becoming unresponsive 1-3 times per week. The frequency of these problems only increased after I added several large excel report generating classes to our bean, which can cause some fairly CPU/memory intensive processes to be invoked. I don't know this for sure, because the server often leaves no error logs around the time of the crash, it just becomes non-responsive, yet the process is still running. I will often see CPU in use, which I have found to be JMS related processes still running in the background. I don't know for sure, maybe they're unrelated; but I still want to separate the bean processes into their own application server.

What I'd like to do is start by separating the producer/consumer. So I want one JBoss application server node to produce messages and a completely different one to consume those messages. Once I figure this out and get this working in production, I'd like to have it cluster, so I can have one or more producers and a cluster of consumers, to help with our redundancy.

I have been able to spin up 2 nodes and move our bean jar file to the other node; but I can't get it to consume the messages. The only way it will consume them is if I restart the server, somehow its allowed to consume when the server first starts up; but not afterwards.

I am starting to find that, even with the documentation, this isn't as simple as I had hoped. Compounding this problem, is the fact that I am not familiar enough with the technology to know what to look for very effectively, or even where to start. That's why I wanted to get some feedback from the smart folks at this forum. Any words of wisdom for a greenhorn who wants to set something like this up.

I'll provide what info here I think is necessary to understand what's going on, please let me know if anything else is needed:

Here's how we're sending the message.



Here is our jms-ds.xml file