This week's book giveaway is in the NodeJS forum. We're giving away four copies of Serverless Applications with Node.js and have Slobodan Stojanovic & Aleksandar Simovic on-line! See this thread for details.
Even unpredictable weather is being forecasted. But after all these technological advancements, are we able to forecast our application performance & availability? Are we able forecast even for the next 20 minutes? Will you be able to say that in the next 20 minutes application is going to experience OutOfMemoryError, CPU spikes, crashes? Most likely not. It’s because we focus only on macro-metrics:
These are great metrics, but they can’t act as lead indicators to forecast performance/availability characteristics of your application. Now let’s discuss few micrometrics that can forecast your application’s performance/availability characteristics.
Fig: You can notice repeated full GCs triggered (graph from GCeasy.io)
Let’s start this discussion with an example. This application experienced OutOfMemoryError. Look at the heap usage graph (generated by parsing garbage collection logs). You can notice heap usage going higher & higher despite full GCs running repeatedly. This application experienced OutOfMemoryError around 10:00am, whereas repeated full GCs started happening right around 08:00am. Starting from 08:00am till 10:00am application was only doing repeated full GCs. If DevOps team would have monitored Garbage collection activity, they should have been able to forecast that application is going to experience OutOfMemoryError even a couple of hours before.
Memory related micrometrics
There are 4 memory/garbage collection related micrometrics that you can monitor:
Garbage collection Throughput
Garbage collection Pause time
Object creation rate
Peak heap size
Let’s discuss them in this section.
# 1. GARBAGE COLLECTION THROUGHPUT
Garbage Collection throughout is the amount of time application spends in processing customer transactions vs amount of time application spends in doing garbage collection.
Let’s say your application has been running for 60 minutes. In this 60 minutes, 2 minutes is spent on GC activities.
It means application has spent 3.33% on GC activities (i.e. (2 / 60) * 100).
It means Garbage Collection throughput is 96.67% (i.e. 100 – 3.33).
When there is a degradation in the GC throughput, it’s an indication of some sort of memory problem is brewing in the application.
# 2. GARBAGE COLLECTION LATENCY
Fig:GC Throughput & GC Latency micrometric
When certain phases of Garbage Collection event run, entire application pauses. This pause is what referred as latency. Some Garbage collection events might take a few milliseconds, whereas some garbage collection events can take several seconds to minutes. You need to monitor GC pause times. If GC pause times starts to go higher, it will impact user’s experience.
# 3. OBJECT CREATION RATE
Fig: Object creation rate micrometric
Object creation rate is the average amount of objects created by your application. Say suppose your application was 100mb/sec. And recently it starts to create 150mb/sec without any increase in the traffic volume – then it’s an indication of some problem brewing in the application. This additional object creation rate has potential to trigger more GC activity, increase CPU consumption & degrade response time.
You can use this same metric in your CI/CD pipeline as well to measure the quality of code commit. Say in your previous code commit your application was creating 50mb/sec. Starting from recent code commit, say your application starts to create 75mb/sec for the same of amount traffic volume – then it’s an indication of some inefficient code commit to your repository.
# 4. PEAK HEAP SIZE
Fig: Peak Heap size micrometric
Peak heap size is the maximum amount of memory consumed by your application. If peak heap size goes beyond a limit you must investigate it. Maybe there is a potential memory leak in the application, newly introduced code (or 3rd libraries/frameworks) is consuming lot of memory.
How to generate memory related micrometrics? All the memory related micrometrics can be sourced from garbage collection logs.
(1). You can enable the garbage collection logs by passing following JVM arguments:
(2). Once garbage collection logs are generated you can either manually the GC logs to GC log analysis tools such as fastThread.io or using programmatic REST API. REST API is useful when you want to automate the report generation process. It can be used in CI/CD pipeline as well.
EXAMPLE 2 After few hours of launch a major financial application started to experience ‘OutOfMemoryError: unable to create new native thread’. This application turned ON a new feature in their JDBC (Java Database Connectivity) driver. Apparently, this feature had a bug, due to which JDBC driver started to spawn new threads repeatedly (instead of re-using same threads). Thus, within a short duration of time, application started to experience ‘OutOfMemoryError: unable to create new native thread’. If team would have monitored thread count and thread states, they could have caught the problem quite early on and prevented the outage. Here are the actual thread dumps captured from the application. You can notice that RUNNABLE state thread count growing between each thread dump over period.
Fig: Growing RUNNABLE state Thread count (graph from fastThread.io)