• Post Reply Bookmark Topic Watch Topic
  • New Topic

catching out of memory  RSS feed

 
Harald Kirsch
Ranch Hand
Posts: 37
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

in a tcp server, I create a new thread for each connection. Should the thread throw an unchecked exception, I catch it with catch(Throwable t) and simply dismiss that thread to keep the server up and running.

During stress testing I managed to trigger an out-of-memory exception in one thread. It was properly caught, the thread dismissed (ignored, forgotten) and the server kept running properly for several hundred more connections.

Then, booom, I get a NullPointerException in one of the threads in a stack of code that runs (nearly) exactly identical for each connection. With "nearly" I mean that a hash is involved, i.e. a hashCode, i.e. dependency on memory layout. Apart from this, the piece of code that bombs out is well tested with unit tests proven with code coverage analysis to cover each line of code.

Question 1: Did anyone come across something like this were a caught out-of-memory situation correlates with strange behaviour much later?

Question 2: Is it wrong to assume that the server should recover from out-of-memory if the offending thread is dismissed properly and thereby garbage collected eventually?

Harald.
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hmmm, at first this rang no bells but then I recalled I got out of memory exceptions on a particular page in my Wiki and it was able to catch and log them and keep right on going for weeks. I'm emotionally prepared for someone to tell me this is A Bad Thing and there could be corrupted memory out there that will bite me later, but I haven't seen any problems. Any chance the thread that went bad corrupted some shared (global) resource?
 
Harald Kirsch
Ranch Hand
Posts: 37
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Stan James:
Any chance the thread that went bad corrupted some shared (global) resource?


That is what makes it so strange. Imagine code like this:
All the objects are freshly created, the same code runs and runs for so many other connections but suddenly the s in the last line is null.

Well, it is not completely as straight forward as the above code, so this will develop into a head-scratching exercise where I have to track whatever the messed up global resource is.
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Question 1: No, I haven't seen or heard of anything liike this myself.

Question 2: No, I don't think your approach is wrong in general; I've dealt successfully dealt with OOME in the past with no similar issues. I think it's at least worth a try to attempt recovery when you get an OOME, assuming that there is anything else for you to do after the one thread fails. (I.e. for some things there's really no point in continuing if a certain method call cannot complete; for many others though it's possible to do something else.) But, while attempting recovery is not unreasonable, I suppose it's not a complete surprise that in some situations there may be lingering bugs (whether in your code or somewhere in the JVM) which become exposed.

So, the hashCode() is in a class which inherits the method from Object, presumeably depending on the memory position of the object it's called on? Is that object something that the now-dead thread had access to as well? And is it likely that the NPE is occurring on the first invocation of hashCode() on this object after the previous OOME and thread death? Can you do anything to cause this call to occurr either sooner or later after the thread death, to verify that the NPE consistently occurs upon this first invocation?

Also (perhaps I should have asked this first) can you clearly see the code at the top of the stack trace? You show code it's "like" which all involves local variables and interned literal Strings; it's hard to imagine how code like that could suddenly error. However if you could show a more precise replication of what's going on (specifically at the very top of the NPE stack trace), that would probably be helpful.

Good luck...
[ October 28, 2005: Message edited by: Jim Yingst ]
 
Mr. C Lamont Gilbert
Ranch Hand
Posts: 1170
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Just closing the thread is not enough. What of the resources the thread was accessing at the time it crashed? Have you then checked each object accessible by that thread to ensure its state is valid?
 
Tony Morris
Ranch Hand
Posts: 1608
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Harald Kirsch:
Hi,

in a tcp server, I create a new thread for each connection. Should the thread throw an unchecked exception, I catch it with catch(Throwable t) and simply dismiss that thread to keep the server up and running.


This is your fundamental problem. you should not be doing this. Instead, provide the ability for other threads to register as a callback when your Runnable.run() method throws an exception - or something like that at least - I even think the core API provides something of that nature these days.


During stress testing I managed to trigger an out-of-memory exception in one thread. It was properly caught, the thread dismissed (ignored, forgotten) and the server kept running properly for several hundred more connections.

Assuming you mean java.lang.OutOfMemoryError, this occurs when a malloc (memory allocation) fails. This is not the same as exhausting memory. For example, I could attempt to allocate 43 gazillion bytes of memory, fail (OutOfMemoryError), recover, then go on my merry way with whatever memory I do indeed have. Unfortunately, the API Specification makes it pretty clear (in its round about kind of way) that this kind of behaviour is not preferred. Of course, conceding to the API Specification is a blind man's sport, so choose your path.

Then, booom, I get a NullPointerException in one of the threads in a stack of code that runs (nearly) exactly identical for each connection. With "nearly" I mean that a hash is involved, i.e. a hashCode, i.e. dependency on memory layout. Apart from this, the piece of code that bombs out is well tested with unit tests proven with code coverage analysis to cover each line of code.

Code coverage analysis doesn't prove anything, other than that every line of code executed (depending on your tool, that every VM instruction executed). It is important to realise that this does not prove anything. A lot of people believe that they have a complete requirement specification when they achieve 100% code coverage. They certainly have a better requirement specification than most software that I get to see, but it's certainly not complete. A complete requirement specification achieves 100% system state coverage. I won't go on about it any more, other than to ask out of curiosity, have you noticed all of the design flaws (requirement defects) within the J2SE API Specification and the language itself as a result of your use of code coverage? I ask because I find that it is an excellent method to derive such information.

Oh by the way, I think your NullPointerException is because at this point, you have indeed exhausted memory


Question 1: Did anyone come across something like this were a caught out-of-memory situation correlates with strange behaviour much later?

Question 2: Is it wrong to assume that the server should recover from out-of-memory if the offending thread is dismissed properly and thereby garbage collected eventually?

Harald.[/qb]

[ October 28, 2005: Message edited by: Tony Morris ]
 
Harald Kirsch
Ranch Hand
Posts: 37
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for all the answers. It seems I am really looking at something weird here. Just to keep you calm:-)

Yes, trying to continue on the OOME is reasonable since the threads are just independently handling socket connections.

Yes, I know that resources need to be given back before the offending thread is really dismissed. But even if I don't, the NPE should not happen.

And yes, I know that 100% code coverage still leaves room for bugs. I just wanted to describe that I tried my best not to bother anyone with an NPE for trivial reasons.

Have you noticed all of the design flaws (requirement defects) within the J2SE API Specification and the language itself as a result of your use of code coverage?


Couldn't say so. I come across things I find bizarre (buffers in nio come to mind) once in a while, but this is not exactly related to the unit test/coverage analysis.

Finally: The reason for the initial OOME was anyway a bug. After that was fixed, I got no more OOME and no more NPE. It leaves a stale taste and may bite me in the future, but I currently don't have the time for hunting down this bug.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!