Recently we experienced an interesting production problem. This application was running on multiple AWS EC2 instances behind Elastic Load Balancer. The application was running on GNU/Linux OS, Java 8, Tomcat 8 application server. All of sudden one of the application instances became unresponsive. All other application instances were handling the traffic properly. Whenever the HTTP request was sent to this application instance from the browser, we were getting following response to be printed on the browser.
We used our APM (Application Performance Monitoring) tool to examine the problem. From the APM tool, we could observe CPU, memory utilization to be perfect. On the other hand, from the APM tool, we could observe that traffic wasn’t coming into this particular application instance. It was really puzzling. Why traffic wasn’t coming in?
We logged in to this problematic AWS EC2 instance. We executed vmstat, iostat, netstat, top, df commands to see whether we can uncover any anomaly. To our surprise, all these great tools didn’t report any issue.
As the next step, we restarted the Tomcat application server in which this application was running. It didn’t make any difference either. Still, this application instance wasn’t responding at all.
Then we issued ‘dmesg’ command on this EC2 instance. This command prints the message buffer of the kernel. The output of this command typically contains the messages produced by the device drivers. In the output generated by this command, we noticed the following interesting messages to be printed repeatedly:
We were intrigued to see this error message: “TCP: out of memory — consider tuning tcp_mem”. It means out of memory error is happening at the TCP level. We had always taught out of memory error happens only at the application level and never at the TCP level.
Problem was intriguing because we breathe this OutOfMemoryError problem day in and out. We have built troubleshooting tools like GCeasy, HeapHero to facilitate engineers to debug OutOfMemoryError that happens at the application level (Java, Android, Scala, Jython… applications). We have written several blogs on this OutOfMemoryError topic. But we were stumped to see OutOfMemory happening at the device driver level. We never thought there would be a problem at the device driver level, that too in, stable Linux operating system. Being stumped by this problem, we weren’t sure how to proceed further.
Thus, we resorted to Google god’s help 😊. Googling for the search term: “TCP: out of memory — consider tuning tcp_mem”, showed only 12 search results. But for one article, none of them had much content ☹. Even that one article was written in a foreign language that we couldn’t understand. So, we aren’t sure how to troubleshoot this problem.
Now left with no other solutions, we went ahead and implemented universal solution i.e. “restart”. We restarted the EC2 instance to put-off immediate burning fire. Hurray!! Restarting the server cleared the problem immediately. Apparently, this server wasn’t restarted for several days (like more than 70+ days), maybe due to that application might have saturated TCP memory limits.
We reached out to one of our intelligent friends who works for a world-class technology company for help. This friend asked us the values that we are setting for the below kernel properties:
Honestly, this is the first time, we are hearing about these properties. We found that below are the values set for these properties in the server:
Our friend suggested to change values as given below:
He mentioned setting these values will eliminate the problem we had faced. Sharing the values with you (as it may be of help to you). Apparently, our values have been very low when compared to the values he has provided.
Here are a few conclusions that we would like to draw:
- Even the modern industry-standard APM (Application Performance Monitoring) tools aren’t completely answering the application performance problems that we are facing today.
- ‘dmesg’ command is your friend. You might want to execute this command when your application becomes unresponsive, it may point you out valuable information
- Memory problems doesn’t have to happen in the code that we write 😊, it can happen even at the TCP/Kernel level.