News

Facebook Icon Twitter Icon Linkedin Icon

AnyMind Group

Facebook Icon Twitter Icon Linkedin Icon

[Tech Blog] How to deal with JVM OOM issues in GKE

Hello, my name is Anton, and I’m a tech lead at AnyMind. In this article, we discuss memory issues that you might face while developing software systems in Google Cloud and K8S in particular. The article covers the very basics of GC in JVM and possible ways of solving OOM problems.

137 exit code

You wake up in the morning, get a sip of coffee, and run kubectl get pods. Everything seems ok, but some pods got restarted by K8S for some reason. You run kubectl decribe pod <your_pod> and see the following:

  Last State:   Terminated
   Reason:    OOMKilled
   Exit Code:  137

You google "exit code 137" and realize that your Java/Kotlin app hit the memory limit. As a quick fix, you go to K8S deployments and give your containers some additional memory and resize the maximum heap size.

resources:
    requests:
        memory: "3500Mi"
    limits:
        memory: "3500Mi"
...
env:
  # Dockerfile 'ENTRYPOINT exec java $JAVA_OPTS -jar app.jar'
  - name: JAVA_OPTS
    value: -Xmx3000m

Time goes on but nothing seems to be helping as your pods get killed by K8S. You’ve checked the code, you’ve read the logs, and still have no clue what’s wrong. Familiar situation? If so, let’s dive in.

GC in JVM

In 2 words, GC collects the objects that are not reachable from the root objects set (for example, references to objects from a stack). There are many GC algorithms available for JVMs right now. Some of them were developed for client applications, some for better throughput, some for multicore CPUs, but all of them have something in common: generations.

As most Java objects die young, the heap is split into 3 sections: eden space(this is where new objects reside typically), survivor space(s0 and s1), and tenured(old and long-living objects). Because young objects and old objects have different "life expectancies", there are 2 different garbage collections: minor for collecting young objects, and major for collecting all objects including old and young.

A typical minor garbage collection consists of the following steps:

  1. Get all the objects from eden and s0(it’s empty on the first collection) that are reachable from the root set.
  2. Put the objects to s1.
  3. Clear all the objects that are left in eden as all survived objects migrated to s1.

The process repeats after some time and the survived objects move from s1 and eden to s0. As a result, survived objects migrate from s0 to s1 and back. If an object has survived multiple times(migrated from s0 to s1 and back), it gets promoted to old objects and move to tenured. Old objects are expected to live long, so GC doesn’t bother them until the heap is running low on free memory. In this situation, a major(aka full) GC takes place.

How to monitor memory in JVM?

Now you know how GC works and are ready for some action. In a local environment, you just run a Java/Kotlin app and connect to it using one of the available JVM monitoring tools, some of which are already preinstalled on your computer for sure. Just go to the "/bin" folder of your JDK and run either jconsole or Java Visual VM. Both are equally good, but for the simplicity purpose, we will use jconsole that lets you monitor any JVM process.

jconsole interface

jconsole, like any other monitoring tool for JVM, lets you examine almost everything: threads, memory, CPU, and even loaded classes. Moreover, you can even get the memory metrics from different generations.

All generations that are available for monitoring in jconsole

How to spot a leak in JVM?

Most of the time, memory leaks in Java/Kotlin apps are caused by storing references to objects that are not in use by applications. For example, you may cache responses of "heavy" SQL queries for better performance in a static final HashMap<> and forget to clear the cache for those SQL queries which results are not needed anymore. As a result, you may run out of memory at some moment. In that sense, memory leaks are like time bombs: everything is perfectly fine, no major problems seen during testing, but after a couple of days the app "blows up" in production. In most situations, "leaky" objects reside in tenured as they tend to survive all collections because technically you have references to those objects somewhere. Monitoring tenured might help to spot a problem. Just open jconsole, choose an app, click the "memory" tab, select tenured in the "chart selector"(it may be called differently depending on a GC algorithm), leave it for a while(you may even trigger a major GC by clicking "perform GC") and see how the chart goes. If you see a "horizontal saw", then it’s a good sign. This is how it should be:

A schematic potential memory line chart from jconsole. Old objects die eventually

But if you see a "rising saw", then something is wrong. It means that you get more and more old objects that are not willing to "die" and leave some space for the younger generation.

A schematic potential memory line chart from jconsole. Old objects are not willing to die

Getting a heap dump in JVM

All right, now you know have a memory leak. What is next? Where is it? MAT(memory analyzer tool) is here to help you answer this question. First, you need to get a memory dump file. You can get it easily by following these steps:

  1. Go to jconsole and choose your app
  2. Go to the "MBeans" tab
  3. Select "com.sun.management" > "HotSpotDiagnostic" > "Operations" > "dumpHeap"
  4. In the window, enter the filename of the heap with the ".hprof" extension. Don’t forget to specify the folder too as jconsole creates heap dump files in the root folder if it’s not specified.
  5. Click "dumpHeap"

This is how you get heap dumps in jconsole

Now, you must have a heap dump file with the data about your current Java heap’s state. Then, you will need to analyze the heap:

  1. Open MAT
  2. Click "File" > "Open file" and chose your dump file.
  3. In the "Getting started wizard", choose "Leak suspects report".

As a result, you might get the following picture:

Memory leak suspects in MAT

In the first window, you might see classes that MAT suspects as "leaky". In addition, I also recommend seeing the histogram of objects sorted by the "Retained heap"(see "Eclipse MAT: Shallow Heap Vs. Retained Heap"). That might greatly help in spotting unexpectedly heavy objects.

Histogram of heavy objects

MAT, unfortunately, can’t tell you what to do next. It only spots problematic memory areas without any clue on how to fix the problem. From now, you are on your own: check the suspects, investigate the code that works with heavy objects, analyze the source code of external libraries, and etc.

Memory monitoring in GKE

Ok, the local environment is cool, but you don’t run the system on your laptop for sure. Everything is in a cloud now. So how do you monitor remote JVM processes? Usually, it’s done through JMX(Java Management Extensions). The first thing you need to do is to open a JMX port on your remote process using the following JVM arguments:

    -Dcom.sun.management.jmxremote 
    -Dcom.sun.management.jmxremote.authenticate=false 
    -Dcom.sun.management.jmxremote.ssl=false 
    -Dcom.sun.management.jmxremote.local.only=false
    -Dcom.sun.management.jmxremote.port=<your port> 
    -Dcom.sun.management.jmxremote.rmi.port=<your port> 
    -Djava.rmi.server.hostname=127.0.0.1

You may notice -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false but don’t worry. It’s ok since we will not expose the port in K8S. It will be used within the cluster only.

The next step is port-forwarding. Given the fact the JMX port is not accessible from the outside, you need to get inside the cluster: kubectl port-forward <pod-name> <your port>:<your port>

By running this command, you will connect to the remote JMX port of the pod in interest. Then, run jconsole:

  1. Click on the "Remote process" radio button
  2. And enter "127.0.0.1:<your_port>"

After the connection has been established, you will see a very familiar interface with all memory, threads, and CPU line charts. If you want to get the heap dump, just follow the same steps you went through for the local environment. There is only one "but": the dump file is created inside your pod and not on your local machine. You need to copy the file first: kubectl cp <pod-name>:/dump.hprof ~/work/dump.hprof Finally, you can open the file in MAT and investigate the problem in detail.

Summary

If you are having JVM memory issues in GKE, don’t let the problem get you down. Connect to the JVM process remotely using jconsole(or Java Visual VM), monitor the heap, get the dump, and analyze it in MAT.

Latest News