We've been running some load tests on our machines here, and we've noted a very baffling behavior. If we load up the machines so that the server is running near CPU capacity (80+ % CPU utilization for extended durations), Tomcat will not expire old sessions. We can run the load generators on the machine until it starts thrashing on Full Garbage Collection activity (i.e., a Full GC takes 20+ seconds, and it is doing a Full GC every 22-25 seconds, so we only get 2-5 seconds of 'run time' between GCs).
The thing we've noted is, our GC activity is caused by sessions filling up the memory space. If we pause the load generators for 2-3 minutes, the sessions will go away as expected and then we can restart the load gens and everything runs smoothly until we build up some more sessions.
Our environment is clustered, so the problem is even more significant - since the sessions are essentially copied from every server in the cluster (QA has 2 members in the cluster, production has 4 - so the problem is very significant when we get to prod!)
Even going into the "/manager/html" webapp and manually clicking the "Expire" button (for sessions with >= 1 min idle time) doesn't seem to affect the session expiration when the machine is under load. But as soon as we remove the load, the machine begins to recover and expire the sessions.
FYI, for testing purposes, I've set our session timeouts in the 'web.xml' file to 1 minute, but we actually control the session duration in code and have it set to 180 seconds in the code configuration. (When I examine the sessions - they seem to have the correct expiration times - 3 mins - and they even show that they have been idle for "20+" mins.)
We're wondering if there is a setting in Tomcat to tell it to alter the priority of session expiration, so that it will take precedence over the page requests.
In case it matters, we're using Tapestry 5.5 for our UI stuff, and we store nothing in the sessions if you aren't authenticated, but about 10K worth of data if you are authenticated. Our load generation is running 200 threads to get a plain page - unauthenticated, and 40 threads to perform registration and login actions, which stores about 10K per iteration... We're running with -Xmx6g, and -Xms2g and a 1g PermSize (although we've never really gone more than about 600KB in Perm space used).
Our site serves several million pages a day, and a substantial number of registrations each day (more than 10K). Which is why we're trying to evaluate loading conditions - to make sure we have our production equipment sizing correct...
It sounds related to this bug, and is supposed to be fixed from Tomcat 5.5.21.
37356: Ensure sessions time out correctly. This has been fixed by removing the accessCount feature by default. This feature prevents the session from timing out whilst requests that last longer than the session time out are being processed. This feature is enabled by setting the Java option -Dorg.apache.catalina.STRICT_SERVLET_COMPLIANCE=true. The feature is now implemented with synchronization which addresses the thread safety issues associated with the original bug report.
OK, it doesn't help to add the strict servlet stuff...
After reading the bug details, this doesn't look like the same problem I'm experiencing. The bug says that sessions just aren't expiring (ever). But my problem is that sessions don't expire while the machine is under load - as soon as I remove the load generators, the sessions start expiring properly (without any other action on my part).
So my problem seems to be that the session expiration runs at a lower priority than the servicing of incoming page (tapestry pages) requests.
The requests are coming over an AJP connection (if that makes a difference).
The ultimate question is, how do I increase the priority of the session expiration (even if it slows the machine down a little - it would be better than letting it build to a crash/hang).
My goal is to be able to run our load generators at full speed indefinitely - and for our Tomcat server to handle the high load conditions without failure. Stopping the traffic would allow the Tomcat to recover, but I'm looking for a way to get Tomcat to properly clear the sessions that are already expired - so that they don't continue to eat up memory.
After I figure that out, then I will adjust the memory on the box to handle the max sustained load (plus some headroom) we expect to encounter on our production environment, and tune our session durations to work well for both our customers and our machine capacity.
Worst case scenario, we will implement something akin to your suggestion - where our external load balancers shift load away from each Tomcat in our cluster for some time-slice out of each hour, so that it has time to recover the session memory. But that would be a nasty "kludge" solution to something that Tomcat should handle correctly in the first place.