jim terry

Ranch Hand
+ Follow
since Nov 18, 2018
Cows and Likes
Cows
Total received
0
In last 30 days
0
Total given
0
Likes
Total received
0
Received in last 30 days
0
Total given
12
Given in last 30 days
0
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by jim terry

Recently we were troubleshooting a popular SaaS application. This application was slowing down intermittently. To recover from the problem, the application had to be restarted. This application was slowing down sometimes during high traffic volume periods; sometimes during low traffic periods as well. There was no cohesive pattern.

This sort of application slowing down and restarting it was going on for a while. Then we were engaged to troubleshoot the problem. We uncovered something interesting, thought you might also benefit from our findings, thus writing this article.

Technology Stack

This popular SaaS application was running on the Azure cloud. Below is it's technology stack:

  • Spring Framework
  • GlassFish Application Server
  • Java 8
  • Azure cloud


  • Troubleshooting

    When we were informed about this problem, we captured thread dump from this application right when slowdown was happening. There are multiple options to capture thread dump; we choose 'jstack' tool to capture the thread dump. Note: It's very critical that you obtain the thread dump right when the problem is happening. Thread dumps captured outside the problem duration window wouldn’t be useful.

    Now we uploaded the captured thread dump to fastThread.io - online thread dump analysis tool. The tool instantly generated this beautiful report. (We encourage you to click on the hyperlink to see the generated report so that you can get first-hand experience).

    The report instantly narrowed down the root cause of the problem. fastThread.io tool highlighted that 'http-nio-8080-exec-121' thread was blocking 134 application threads. Below is the transitive dependency graph showing the BLOCKED threads:



    Fig: fastThread.io showing transitive dependency of the BLOCKED threads

    From the graph you can see 134 applications threads are BLOCKED by 'http-nio-8080-exec-121' thread (first one from left side). When we clicked on the 'http-nio-8080-exec-121' hyperlink in the graph it printed the stack trace of the thread:



    Fig: http-nio-8080-exec-121 obtained org.apache.log4j.Logger lock

    I request you to take a close look at the highlighted section of the stack trace. You can see the thread obtaining org.apache.log4j.Logger lock and then moving forward to write the log records into Azure cloud storage.

    Now let's take a look at the stack trace of 'http-nio-8080-exec-56' thread (one of the 134 threads which was BLOCKED):



    Fig: http-nio-8080-exec-56 waiting to obtain org.apache.log4j.Logger lock

    Take a look at the highlighted section of the above stack trace. It's waiting to acquire org.apache.log4j.Logger lock. You can see the 'http-nio-8080-exec-56' thread to be in BLOCKED state, because 'http-nio-8080-exec-114' acquired org.apache.log4j.Logger lock and didn't release it.

    Remaining all 134 threads were also stuck waiting for the 'org.apache.log4j.Logger' lock. Thus whenever any application thread attempted to log, it got into this BLOCKED state. Thus 134 application threads ended into this BLOCKED state.

    We then googled for org.apache.log4j.Logger BLOCKED thread. We stumbled upon this interesting defect reported in the Apache Log4j bug database.

    It turned out that this is one of the known bugs in the Log4J framework, and it's one of the primary reasons why the new Log4j2 framework was developed. Below is the interesting excerpt from this defect description:

    <code:start>
    There is no temporary fix for this issue and is one of the reasons Log4j 2 came about. The only fix is to upgrade to Log4j 2.

    :
    :

    Yes, I am saying that the code in Log4j 2 is much different and locking is handled much differently. There is no lock on the root logger or on the appender loop.
    <code:end>

    Due to the bug, any thread which was trying to log, got into the BLOCKED state. It caused the entire application to come to a grinding halt. Once the application was migrated from Log4j to Log4j2 framework, the problem got resolved.

    Conclusion

    Log4j has reached EOL (End of life) in Aug’ 2015. It’s no longer supported. If your application is still using the Log4J framework, we highly recommend upgrading to the apache Log4j2 framework. Here is the migration guide. Log4j2 isn't just the next version of Log4j framework; it's a new framework written from scratch. It has a lot of performance improvements.

    Also, now you were able to learn how to troubleshoot an unresponsive application.

    2 days ago
    There are excellent Heap dump analysis tools like Eclipse MAT, Jprofiler, ... These tools are handy when you want to debug/troubleshoot OutOfMemoryError. However, we HeapHero as following unique capabilities which aren't available in those tools:

    1. How much memory wasted?

    HeapHero tells you how much memory your application is wasting because of inefficient programming practices by the developers. Today memory is wasted because of reasons like:

    a. Duplication of strings
    b. overallocation, and underutilization of data structures
    c. Boxed numbers
    d. Several more reasons.

    You can see HeapHereo reporting how much memory is wasted even in a vanilla pet clinic spring boot application. Other tools don't provide this vital metric.

    2. First cloud application for heap dump analysis

    Today's memory profiling tools need to be installed on your Desktop/Laptops. They can't run on the cloud. HeapHero can run on

    a. Public cloud (AWS, Azure,..)
    b. Your private data center
    c. Local machine

    Your entire organization can install one instance of HeapHero in a central server, and everyone in the organization can upload and analyze the heap dump from this one server.

    3. CI/CD pipeline Integration

    As part of CI/CD pipeline, several organizations do static code analysis using tools like coverity, vera code... . Using HeapHero, you can do runtime code analysis. HeapHeroprovides REST API. This API returns JSON response, which contains key metrics related to your application's memory utilization. You can invoke this API from CI/CD pipeline and see whether your code quality is improving or regressing between each code commit.

    4. Instant RCA in production

    Debugging OutOfMemoryError in production is a tedious/challenging exercise. You can automate the end-end analysis of OutOfMemoryError using HeapHero. Say if your application's memory consumption goes beyond certain limits or experience OutOfMemoryError, you can capture heap dumps and do heap dump analysis instantly using our REST API and generate instant root cause analysis report. Production troubleshooting tools like ycrash leverages HeapHero REST API to do this analysis for you.

    5. Analyzing heap dumps from remote location

    Heap dump files are large in size (several GB). To troubleshoot the heap dump, you have to transmit the heap dump file from your production server to your local machine. From your local machine, you have to upload the heap dump file to your tool. Sometimes heap dump might be stored/archived in remote server, AWS S3 storage,... In those circumstances, you will have to download the heap dump from that remote location and then once again upload it to the tool. HeapHero simplifies this process for you. You can pass the heap dump's remote location URL as input to the HeapHero API or to web interface  directly. HeapHero will download the heap dump from this remote location to analyze for you.

    6. Report Sharing & Team collaboration

    Sharing Heap Dumps amongst team is a cumbersome process. Finding a proper location to store the heap dump file is the first challenge. The team member with whom you are sharing this report should have the heap dump analysis tool installed on his local machine. So that he can open the heap dump file with the tool you are sharing to see the analysis report. HeapHero simplifies this process. HeapHero gives you a hyperlink like this. This hyperlink can be embedded in your emails, JIRA, and circulated amongst your team. When your team member clicks on this hyperlink, he can see the entire heap dump analysis report on his browser.

    HeapHero also lets you export you heap dump as PDF file. This PDF file can also be circulated amongst your team members.

    7. Analyze large size heap dumps

    Several memory profilers are good at analyzing heap dumps of smaller size. But they struggle to analyze large size heap dumps. HeapHero is geared to analyze heap dumps easily.
    1 week ago
    Eclipse MAT (Memory Analyzer Tool) is a powerful tool to analyze heap dumps. It comes quite handy when you are trying to debug memory related problems. In Eclipse MAT two types of object sizes are reported:

    Shallow Heap
    Retained Heap

    In this article lets study the difference between them. Let’s study how are they calculated?

    It’s easier to learn new concepts through example. Let’s say your application’s has object model as shown in Fig #1:


    Object A is holding reference to objects B and C.
    Object B is holding reference to objects D and E.
    Object C is holding reference to objects F and G.

    Let’s say each object occupies 10 bytes of memory. Now with this context let’s begin our study.

    Shallow Heap size
    Shallow heap of an object is its size in the memory. Since in our example each object occupies 10 bytes, shallow heap size of each object is 10 bytes. Very simple.

    Retained Heap size of B

    From the Fig #1, you can notice that object B is holding reference to objects D and E. So, if object B is garbage collected from memory, there will be no more active references to object D and E. It means D & E can also be garbage collected. Retained heap is the amount of memory that will be freed when the particular object is garbage collected. Thus, retained heap size of B is:

    = B’s shallow heap size + D’s shallow heap size + E’s shallow heap size

    = 10 bytes + 10 bytes + 10 bytes

    = 30 bytes

    Thus, retained heap size of B is 30 bytes.

    Retained Heap size of C

    Object C is holding reference to objects F and G. So, if object C is garbage collected from memory, there will be no more references to object F & G. It means F & G can also be garbage collected. Thus, retained heap size of C is:

    = C’s shallow heap size + F’s shallow heap size + G’s shallow heap size

    = 10 bytes + 10 bytes + 10 bytes

    = 30 bytes

    Thus, retained heap size of C is 30 bytes as well



    Retained Heap size of A
    Object A is holding reference to objects B and C, which in turn are holding references to objects D, E, F, G. Thus, if object A is garbage collected from memory, there will be no more reference to object B, C, D, E, F and G. With this understanding let’s do retained heap size calculation of A.

    Thus, retained heap size of A is:

    = A’s shallow heap size + B’s shallow heap size + C’s shallow heap size + D’s shallow heap size + E’s shallow heap size + F’s shallow heap size + G’s shallow heap size

    = 10 bytes + 10 bytes + 10 bytes + 10 bytes + 10 bytes + 10 bytes + 10 bytes

    = 70 bytes

    Thus, retained heap size of A is 70 bytes.

    Retained heap size of D, E, F and G

    Retained heap size of D is 10 bytes only i.e. their shallow size only. Because D don’t hold any active reference to any other objects. Thus, if D gets garbage collected no other objects will be removed from memory. As per the same explanation objects E, F and G’s retained heap size are also 10 bytes only.

    Let’s make our study more interesting

    Now let’s make our study little bit more interesting, so that you will gain thorough understanding of shallow heap and retained heap size. Let’s have object H starts to hold reference to B in the example. Note object B is already referenced by object A. Now two guys A and H are holding references to object B. In this circumstance lets study what will happen to our retained heap calculation.


    In this circumstance retained heap size of object A will go down to 40 bytes. Surprising? Puzzling? 🙂 continue reading on. If object A gets garbage collected, then there will be no more reference to objects C, F and G only. Thus, only objects C, F and G will be garbage collected. On the other hand, objects B, D and E will continue to live in memory, because H is holding active reference to B. Thus B, D and E will not be removed from memory even when A gets garbage collected.

    Thus, retained heap size of A is:

    = A’s shallow heap size + C’s shallow heap size + F’s shallow heap size + G’s shallow heap size

    = 10 bytes + 10 bytes + 10 bytes + 10 bytes

    = 40 bytes.

    Thus, retained heap size of A will become 40 bytes. All other objects retained heap size will remain undisturbed, because there is no change in their references.

    Hope this article helped to clarify Shallow heap size and Retained heap size calculation in Eclipse MAT. You might also consider exploring HeapHero– another powerful heap dump analysis tool, which shows the amount of memory wasted due to inefficient programming practices such as duplication of objects, overallocation and underutilization of data structures, suboptimal data type definitions,….




    1 month ago
    Recently we experienced an interesting production problem. This application was running on multiple AWS EC2 instances behind Elastic Load Balancer. The application was running on GNU/Linux OS, Java 8, Tomcat 8 application server. All of sudden one of the application instances became unresponsive. All other application instances were handling the traffic properly. Whenever the HTTP request was sent to this application instance from the browser, we were getting following response to be printed on the browser.



    We used our APM (Application Performance Monitoring) tool to examine the problem. From the APM tool, we could observe CPU, memory utilization to be perfect. On the other hand, from the APM tool, we could observe that traffic wasn’t coming into this particular application instance. It was really puzzling. Why traffic wasn’t coming in?

    We logged in to this problematic AWS EC2 instance. We executed vmstat, iostat, netstat, top, df commands to see whether we can uncover any anomaly. To our surprise, all these great tools didn’t report any issue.

    As the next step, we restarted the Tomcat application server in which this application was running. It didn’t make any difference either. Still, this application instance wasn’t responding at all.

    DMESG command
    Then we issued ‘dmesg’ command on this EC2 instance.  This command prints the message buffer of the kernel. The output of this command typically contains the messages produced by the device drivers. In the output generated by this command, we noticed the following interesting messages to be printed repeatedly:



    We were intrigued to see this error message: “TCP: out of memory — consider tuning tcp_mem”. It means out of memory error is happening at the TCP level. We had always taught out of memory error happens only at the application level and never at the TCP level.

    Problem was intriguing because we breathe this OutOfMemoryError problem day in and out. We have built troubleshooting tools like GCeasy, HeapHero to facilitate engineers to debug OutOfMemoryError that happens at the application level (Java, Android, Scala, Jython… applications). We have written several blogs on this OutOfMemoryError topic. But we were stumped to see OutOfMemory happening at the device driver level. We never thought there would be a problem at the device driver level, that too in, stable Linux operating system. Being stumped by this problem, we weren’t sure how to proceed further.

    Thus, we resorted to Google god’s help 😊. Googling for the search term: “TCP: out of memory — consider tuning tcp_mem”, showed only 12 search results. But for one article, none of them had much content  ☹. Even that one article was written in a foreign language that we couldn’t understand. So, we aren’t sure how to troubleshoot this problem.

    Now left with no other solutions, we went ahead and implemented universal solution i.e. “restart”. We restarted the EC2 instance to put-off immediate burning fire. Hurray!! Restarting the server cleared the problem immediately. Apparently, this server wasn’t restarted for several days (like more than 70+ days), maybe due to that application might have saturated TCP memory limits.

    We reached out to one of our intelligent friends who works for a world-class technology company for help. This friend asked us the values that we are setting for the below kernel properties:

    *core.netdev_max_backlog
    *core.rmem_max
    *core.wmem_max
    *ipv4.tcp_max_syn_backlog
    *ipv4.tcp_rmem
    *net.ipv4.tcp_wmem[/list]

    Honestly, this is the first time, we are hearing about these properties. We found that below are the values set for these properties in the server:



    Our friend suggested to change values as given below:



    He mentioned setting these values will eliminate the problem we had faced. Sharing the values with you (as it may be of help to you). Apparently, our values have been very low when compared to the values he has provided.

    Conclusion
    Here are a few conclusions that we would like to draw:

    * Even the modern industry-standard APM (Application Performance Monitoring) tools aren’t completely answering the application performance problems that we are facing today.
    * ‘dmesg’ command is your friend. You might want to execute this command when your application becomes unresponsive, it may point you out valuable information
    * Memory problems doesn’t have to happen in the code that we write 😊, it can happen even at the TCP/Kernel level.



    4 months ago

    In this modern world, Garbage collection logs are still analyzed in a tedious & manual mode. i.e. you have to get hold of your Devops engineer who has access to production servers, then he will mail you the application’s GC logs, then you will upload the logs to GC analysis tool, then you have to apply your intelligence to anlayze it. There is no programmatic way to analyze Garbage Collection logs in a proactive manner. Thus to eliminate this hassle, gceasy.io is introducing a RESTful API to analyze garbage collection logs. With one line of code you can get your GC logs analyzed instantly.

    Here are few use cases where this API can be extremely useful.

    Use case 1:Automatic Root cause Analysis
    Most of the DevOps invokes a simple Http ping or APM tools to monitor the applications health. This ping is good to detect whether application is alive or not. APM tools are great at informing that application’s CPU spiked  up by ‘x%’, memory utilization increased by ‘y%’, response time dropped by ‘z’ milliseconds. It won’t inform what caused the CPU to spike up, what caused memory utilization to increase, what caused the response time to degrade. If you can configure Cron job to capture thread dumps/GC logs on a periodic interval and invoke our REST API, we apply our intelligent patterns & machine learning algorithms to instantly identify the root cause of the problem.

    Advantage 1: Whenever these sort of production problem happens, because of the heat of the moment, DevOps team recycles the servers with out capturing the thread dumps and GC logs. You need to capture thread dumps and GC logs at the moment when problem is happening, in order to diagnose the problem. In this new strategy you don’t have to worry about it, because your cron job is capturing thread dumps/GC logs on a periodic intervals and invoking the REST API, all your thread dumps/GC Logs are archived in our servers.

    Advantage 2: Unlike APM tools which claims to add less than 3% of overhead, where as in reality it adds multiple folds, beauty of this strategy is: It doesn’t add any overhead (or negligible overhead). Because entire analysis of the thread dumps/GCeasy are done on our servers and not on your production servers..

    Use case 2: Performance Tests
    When you conduct performance tests, you might want to take thread dumps/GC logs on a periodic basis and get it analyzed through the API. In case if thread count goes beyond a threshold or if too many threads are WAITING or if any threads are BLOCKED for a prolonged period or lock isn’t getting released or frequent full GC activities happening or GC pause time exceeds thresholds, it needs to get the visibility right then and there. It should be analyzed before code hits the production. In such circumstance this API will become very handy.

    Use case 3: Continuous Integration
    As part of continuous integration it’s highly encouraged to execute performance tests. Thread dumps/GC Logs should be captured and it can be analyzed using the API.  If API reports any problems, then build can be failed. In this way, you can catch the performance degradation right during code commit time instead of catching it in performance labs or production.

    How to invoke Garbage Collection log analysis API?

    Invoking Garbage Collection log analysis is very simple:

    1). Register with us. We will email you the API key. This is a one-time setup process. Note: If you have purchased enterprise version with API, you don’t have to register. API key will be provided to you as part of installation instruction.
    2).Post HTTP request to https://api.gceasy.io/analyzeGC?apiKey={API_KEY_SENT_IN_EMAIL}
    3).The body of the HTTP request should contain the Garbage collection log that needs to be analyzed.
    4).HTTP Response will be sent back in JSON format. JSON has several important stats about the GC log. Primary element to look in the JSON response is: “isProblem“. This element will have value to be “true” if any memory/performance problems has been discovered. “problem” element will contain the detailed description of the memory problem.

    CURL command

    Assuming your GC log file is located in “./my-app-gc.log,” then CURL command to invoke the API is:


    It can’t get any more simpler than that? Isn’t it?

    How to invoke Java Garbage Collection log analysis API
    5 months ago
    Hi

    The source is: https://www.youtube.com/watch?v=uJLOlCuOR4k&t=26s  and its been allowed to post it under copyright regulations.
    5 months ago

    Java Thread Dump Analyzer,  Troubleshoot JVM crashes, slowdowns, memory leaks, freezes, CPU Spikes
    https://community.atlassian.com/t5/Marketplace-Apps-Integrations/How-do-you-analyze-GC-logs-thread-dumps-and-head-dumps/ba-p/985787

    5 months ago