- How to Fix OOMKilled Kubernetes Error (Exit Code 137)
- How Does the OOM Killer Mechanism Work?
- OOMKilled: Common Causes
- OOMKilled: Diagnosis and Resolution
- Step 1: Gather Information
- Step 2: Check Pod Events Output for Exit Code 137
- Step 3: Troubleshooting
- Solving Kubernetes Errors Once and for All with Komodor
- Exit code 137
- 137 = killed by SIGKILL
- Physical memory vs JVM heap
- What to do
- Kubernetes Pods Terminated — Exit Code 137
How to Fix OOMKilled Kubernetes Error (Exit Code 137)
The OOMKilled error, also indicated by exit code 137, means that a container or pod was terminated because they used more memory than allowed. OOM stands for “Out Of Memory”.
Kubernetes allows pods to limit the resources their containers are allowed to utilize on the host machine. A pod can specify a memory limit – the maximum amount of memory the container is allowed to use, and a memory request – the minimum memory the container is expected to use.
If a container uses more memory than its memory limit, it is terminated with an OOMKilled status. Similarly, if overall memory usage on all containers, or all pods on the node, exceeds the defined limit, one or more pods may be terminated.
You can identify the error by running the kubectl get pods command —the pod status will appear as Terminating .
NAME READY STATUS RESTARTS AGE my-pod-1 0/1 OOMKilled 0 3m12s
We’ll provide a general process for identifying and resolving OOMKilled . More complex cases will require advanced diagnosis and troubleshooting, which is beyond the scope of this article.
How Does the OOM Killer Mechanism Work?
OOMKilled is actually not native to Kubernetes—it is a feature of the Linux Kernel, known as the OOM Killer , which Kubernetes uses to manage container lifecycles. The OOM Killer mechanism monitors node memory and selects processes that are taking up too much memory, and should be killed. It is important to realize that OOM Killer may kill a process even if there is free memory on the node.
The Linux kernel maintains an oom_score for each process running on the host. The higher this score, the greater the chance that the process will be killed. Another value, called oom_score_adj , allows users to customize the OOM process and define when processes should be terminated.
Kubernetes uses the oom_score_adj value when defining a Quality of Service (QoS) class for a pod. There are three QoS classes that may be assigned to a pod:
Each QoS class has a matching value for oom_score_adj :
Quality of Service | oom_score_adj |
---|---|
Guaranteed | -997 |
BestEffort | 1000 |
Burstable | min(max(2, 1000—(1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
Because “Guaranteed” pods have a lower value, they are the last to be killed on a node that is running out of memory. “BestEffort” pods are the first to be killed.
A pod that is killed due to a memory issue is not necessarily evicted from a node—if the restart policy on the node is set to “Always”, it will try to restart the pod.
To see the QoS class of a pod, run the following command:
Kubectl get pod -o jsonpath=’’
To see the oom_score of a pod:
- Run kubectl exec -it /bin/bash
- To see the oom_score , run cat/proc//oom_score
- To see the oom_score_adj , run cat/proc//oom_score_adj
The pod with the lowest oom_score is the first to be killed when the node runs out of memory.
OOMKilled: Common Causes
The following table shows the common causes of this error and how to resolve it. However, note there are many more causes of OOMKilled errors, and many cases are difficult to diagnose and troubleshoot.
Cause | Resolution |
---|---|
Container memory limit was reached, and the application is experiencing higher load than normal | Increase memory limit in pod specifications |
Container memory limit was reached, and application is experiencing a memory leak | Debug the application and resolve the memory leak |
Node is overcommitted—this means the total memory used by pods is greater than node memory | Adjust memory requests (minimal threshold) and memory limits (maximal threshold) in your containers |
OOMKilled: Diagnosis and Resolution
Step 1: Gather Information
Run kubectl describe pod [name] and save the content to a text file for future reference:
kubectl describe pod [name] /tmp/troubleshooting_describe_pod.txt
Step 2: Check Pod Events Output for Exit Code 137
Check the Events section of the describe pod text file, and look for the following message:
State: Running Started: Thu, 10 Oct 2019 11:14:13 +0200 Last State: Terminated Reason: OOMKilled Exit Code: 137 .
Exit code 137 indicates that the container was terminated due to an out of memory issue. Now look through the events in the pod’s recent history, and try to determine what caused the OOMKilled error:
- The pod was terminated because a container limit was reached.
- The pod was terminated because the node was “overcommitted”—pods were scheduled to the node that, put together, request more memory than is available on the node.
Step 3: Troubleshooting
If the pod was terminated because container limit was reached:
- Determine if your application really needs more memory. For example, if the application is a website that is experiencing additional load, it may need more memory than originally specified. In this case, to resolve the error, increase the memory limit for the container in the pod specification.
- If memory use suddenly increases, and does not seem to be related to application loads, the application may be experiencing a memory leak. Debug the application and resolve the memory leak. In this case you should not increase the memory limit, because this will cause the application to use up too many resources on the nodes.
If the pod was terminated because of overcommit on the node:
- Overcommit on a node can occur because pods are allowed to schedule on a node if their memory requests value—the minimal memory value—is less than the memory available on the node.
- For example, Kubernetes may run 10 containers with a memory request value of 1 GB on a node with 10 GB memory. However, if these containers have a memory limit of 1.5 GB, some of the pods may use more than the minimum memory, and then the node will run out of memory and need to kill some of the pods.
- You need to determine why Kubernetes decided to terminate the pod with the OOMKilled error, and adjust memory requests and limit values to ensure that the node is not overcommitted.
When adjusting memory requests and limits, keep in mind that when a node is overcommitted, Kubernetes kills nodes according to the following priority order:
- Pods that do not have requests or limits
- Pods that have requests, but not limits
- Pods that are using more than their memory request value—minimal memory specified—but under their memory limit
- Pods that are using more than their memory limit
To fully diagnose and resolve Kubernetes memory issues, you’ll need to monitor your environment, understand the memory behavior of pods and containers compared to the limits, and fine tune your settings. This can be a complex, unwieldy process without the right tooling.
Solving Kubernetes Errors Once and for All with Komodor
The troubleshooting process in Kubernetes is complex and, without the right tools, can be stressful, ineffective and time-consuming. Some best practices can help minimize the chances of things breaking down, but eventually something will go wrong—simply because it can.
This is the reason why we created Komodor, a tool that helps dev and ops teams stop wasting their precious time looking for needles in (hay)stacks every time things go wrong.
Acting as a single source of truth (SSOT) for all of your k8s troubleshooting needs, Komodor offers:
- Change intelligence: Every issue is a result of a change. Within seconds we can help you understand exactly who did what and when.
- In-depth visibility: A complete activity timeline, showing all code and config changes, deployments, alerts, code diffs, pod logs and etc. All within one pane of glass with easy drill-down options.
- Insights into service dependencies: An easy way to understand cross-service changes and visualize their ripple effects across your entire system.
- Seamless notifications: Direct integration with your existing communication channels (e.g., Slack) so you’ll have all the information you need, when you need it.
Exit code 137
On a large project you may sometimes encounter a mysterious build failure with no clear error message. The logs say something about java exiting with exit code 1, and on further investigation you find that the reason was that another java exited with error code 137. Usually this happens on a CI machine (Jenkins, TeamCity, GitLab). What does it mean and how do you fix it?
137 = killed by SIGKILL
Exit code 137 is Linux-specific and means that your process was killed by a signal, namely SIGKILL . The main reason for a process getting killed by SIGKILL on Linux (unless you do it yourself) is running out of memory. 1
Physical memory vs JVM heap
It is important here to understand that the process was killed by the operating system, not the JVM. In fact, when the JVM runs out of heap, it throws an OutOfMemoryError and you get a nice stack trace, whereas exit code 137 means that the process was killed abruptly without any chance to produce a stack trace.
The JVM has a well-known -Xmx option to limit its heap usage. If your process gets killed with exit code 137, you want to lower the heap limit, not raise it, as you want your process to be constrained by the JVM (to get the nice stack trace and diagnostics) and not the kernel.
What to do
In summary, to fix error 137, you need to take one of these three measures:
- Talk to your CI administrators to find out how much memory the agent has available.
- Make sure that any heap limits you pass to the JVM are lower than the amount of memory on the machine.
- If your build starts taking too long or fails due to an OutOfMemoryError coming from the JVM, you have to either ask the CI team to give your machine more memory (letting you increase the JVM heap limit), or optimize the memory-hungry part of the build.
- The actual picture is a bit more complicated. When Linux kernel is about to run out of memory, it will start killing processes to free some memory up. It uses some heuristics to select what process to kill. It means that in theory, your Java build could be an innocent victim of another process’ memory hunger. However, CI machines don’t usually run much else beyond the builds so if your build is killed, it is likely that it was the hungry one. ↩︎
Kubernetes Pods Terminated — Exit Code 137
I need some advise on an issue I am facing with k8s 1.14 and running gitlab pipelines on it. Many jobs are throwing up exit code 137 errors and I found that it means that the container is being terminated abruptly. Cluster information: Kubernetes version: 1.14 Cloud being used: AWS EKS Node: C5.4xLarge After digging in, I found the below logs:
**kubelet: I0114 03:37:08.639450** 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%). **kubelet: E0114 03:37:08.653132** 4721 kubelet.go:1282] Image garbage collection failed once. Stats initialization may not have completed yet: failed to garbage collect required amount of images. Wanted to free 3022784921 bytes, but freed 0 bytes **kubelet: W0114 03:37:23.240990** 4721 eviction_manager.go:397] eviction manager: timed out waiting for pods runner-u4zrz1by-project-12123209-concurrent-4zz892_gitlab-managed-apps(d9331870-367e-11ea-b638-0673fa95f662) to be cleaned up **kubelet: W0114 00:15:51.106881** 4781 eviction_manager.go:333] eviction manager: attempting to reclaim ephemeral-storage **kubelet: I0114 00:15:51.106907** 4781 container_gc.go:85] attempting to delete unused containers **kubelet: I0114 00:15:51.116286** 4781 image_gc_manager.go:317] attempting to delete unused images **kubelet: I0114 00:15:51.130499** 4781 eviction_manager.go:344] eviction manager: must evict pod(s) to reclaim ephemeral-storage **kubelet: I0114 00:15:51.130648** 4781 eviction_manager.go:362] eviction manager: pods ranked for eviction: 1. runner-u4zrz1by-project-10310692-concurrent-1mqrmt_gitlab-managed-apps(d16238f0-3661-11ea-b638-0673fa95f662) 2. runner-u4zrz1by-project-10310692-concurrent-0hnnlm_gitlab-managed-apps(d1017c51-3661-11ea-b638-0673fa95f662) 3. runner-u4zrz1by-project-13074486-concurrent-0dlcxb_gitlab-managed-apps(63d78af9-3662-11ea-b638-0673fa95f662) 4. prometheus-deployment-66885d86f-6j9vt_prometheus(da2788bb-3651-11ea-b638-0673fa95f662) 5. nginx-ingress-controller-7dcc95dfbf-ld67q_ingress-nginx(6bf8d8e0-35ca-11ea-b638-0673fa95f662)
And then the pods get terminated resulting in the exit code 137s. Can anyone help me understand the reason and a possible solution to overcome this? Thank you 🙂