Long GC Pauses are undesirable for applications. It affects your SLAs; it results in poor customer experiences, and it causes severe damages to mission critical applications. Thus in this article, I have laid out key reasons that can cause long GC pauses and potential solutions to solve them.
1. High Object Creation Rate If your application’s object creation rate is very high, then to keep with it, garbage collection rate will also be very high. High garbage collection rate will increase the GC pause time as well. Thus, optimizing the application to create less number of objects is THE EFFECTIVE strategy to reduce long GC pauses. This might be a time-consuming exercise, but it is 100% worth doing. In order to optimize object creation rate in the application, you can consider using java profilers like Jprofiler, YourKit, JVisualVM….). These profilers will report
• What are the objects that created?
• What is the rate at which these objects are created?
• What is the amount of space they are occupying in memory?
• Who is creating them?
Always try to optimize the objects which occupy the most amount of memory. Go after big fish in the pond.
Upload your GC log to the Universal Garbage Collection log analyzer tool GCeasy. This tool will report the object creation rate. There will be field by name ‘Avg creation rate’ in the section ‘Object Stats.’ This field will report the object creation rate. Strive to keep this value lower always. See the image (which is an excerpt from the GCeasy generated report), showing the ‘Avg creation rate’ to be 8.83 mb.sec.
Tit-bit: How to figure out object creation rate?
2. Undersized Young Generation When young Generation is undersized, objects will be prematurely promoted to Old Generation. Collecting garbage from old generation takes more time than collecting it from young Generation. Thus increasing young generation size has a potential to reduce the long GC pauses. Young Generation can be increased setting either one of the two JVM arguments
-Xmn: specifies the size of the young generation
-XX:NewRatio: Specifies ratio between the old and young generation. For example, setting -XX:NewRatio=3 means that the ratio between the old and young generation is 3:1. i.e. young generation will be fourth of the overall heap. i.e. if heap size is 2 GB, then young generation size would be 0.5 GB.
3. Choice of GC Algorithm Choice of GC algorithm has a major influence on the GC pause time. Unless you are a GC expert or intend to become one or someone in your team is a GC expert – you can tune GC settings to obtain optimal GC pause time. Assume if you don’t have GC expertise, then I would recommend using G1 GC algorithm, because of it’s auto-tuning capability. In G1 GC, you can set the GC pause time goal using the system property ‘-XX:MaxGCPauseMillis.’ Example:
As per the above example, Maximum GC Pause time is set to 200 milliseconds. This is a soft goal, which JVM will try it’s best to meet it. If you are already using G1 GC algorithm and still continuing to experience high pause time, then refer to this article.
4. Process Swapping Sometimes due to lack of memory (RAM), Operating system could be swapping your application from memory. Swapping is very expensive as it requires disk accesses which is much slower as compared to the physical memory access. In my humble opinion – no serious application in a production environment should be swapping. When process swaps, GC will take a long time to complete.
Below is the script obtained from Stackoverflow(thanks to the author) – which when executed will show all the process that are being swapped. Please make sure your process is not getting swapped.
If you find your process to be swapping then do one of the following:
a. Allocate more RAM to the server
b. Reduce the number of processes that running on the server, so that it can free up the memory (RAM).
c. Reduce the heap size of your application (which I wouldn’t recommend, as it can cause other side effects).
5. Less GC Threads For every GC event reported in the GC log, user, sys and real time are printed. Example:
To know the difference between each of these times, please read the article. (I highly encourage you to read the article, before continuing this section). If in the GC events you consistently notice that ‘real’ time isn’t significantly lesser than the ‘user’ time – then it might be indicating that there aren’t enough GC threads. Consider increasing the GC thread count. Say suppose ‘user’ time 25 seconds, and you have configured GC thread count to be 5, then real time should be close to 5 seconds (because 25 seconds / 5 threads = 5 seconds).
WARNING: Adding too many GC threads will consume a lot of CPU and takes away a resource from your application. Thus you need to conduct thorough testing before increasing the GC thread count.
6. Background IO Traffic If there is a heavy file system I/O activity (i.e. lot of reads and writes are happening) it can also cause long GC pauses. This heavy file system I/O activity may not be caused by your application. Maybe it is caused by another process that is running on the same server, still, can cause your application to suffer from long GC pauses. Here is a brilliant article from Linkedin engineers, which walks through this problem in detail.
When there is a heavy I/O activity, you will notice the ‘real’ time to be significantly more than ‘user’ time. Example: