A Gentle Inspection of OOMKilled in Kubernetes
Quality of Service in Kubernetes
Three Quality of Service (often referred to as QoS) classes in Kubernetes are automatically given to Pod specifications when they are created. In circumstances of resource utilization, Kubernetes employs QoS classes to determine which Pod has to be rescheduled or evicted.
We’ll examine each class in depth and create several samples. These courses include:
- Guaranteed
For a Pod to be given a QoS class of Guaranteed:
Every Container in the Pod must have a memory limit and a memory request.
For every Container in the Pod, the memory limit must equal the memory request.
Every Container in the Pod must have a CPU limit and a CPU request.
For every Container in the Pod, the CPU limit must equal the CPU request.
apiVersion: v1
kind: Pod
metadata:
labels:
run: guaranteed
name: guaranteed
spec:
containers:
- image: busybox
name: someotherapp
resources:
limits:
cpu: "20m"
memory: "10Mi"
requests:
cpu: "20m"
memory: "10Mi"
- image: nginx
name: guaranteed
resources:
limits:
cpu: "10m"
memory: "50Mi"
requests:
cpu: "10m"
memory: "50Mi"
Apply the above declaration now:
Every container in the Pod has a specification of its resources, including requests and limits, as you can see above.
In this scenario, Kubernetes won’t attempt to evict this Pod; instead, it will seek out another Pod to destroy in the event of a resource shortage.
If you look at the source code for Kubernetes, where this validation is used:
You may read more about the function “GetPodQOS” in the code snippet above.
Let’s investigate this issue further to see how the Kernel responds. Since everything is a L̶i̶n̶u̶x̶ file, a file for these events should exist on the K8S node (a Linux Machine). This helps you truly understand how and why OOMKilled commonly occurs in the K8S Cluster.
See the scenario:
- You created a Pod via
kubectl run --image=nginx mypod --command -- sleep 1033
K8S created the Pod via kubelet who hands it over to the containerd (or CRI-compatible container runtime).
Assume you have the access to the K8S node
Some background info:
In Linux, every process has a unique ID and for each process under the /proc
directory, a new directory is created with its PID. Every process has a
so-called "oom_score_adj" that means "Out of Memory score".
In case of resource shortage, the Kernel will verify this file, and according
to that score, it will kill the process.This score is between -1000 and +1000.
The smaller this score is, the longer the kernel will struggle to run it.
Get a list of all active processes on the K8S node, then use grep to find the one that belongs to the newly generated Pod. You can see the process ID running the sleep 1033 command in the screenshot below; this is the Pod we created earlier. Since we now know the process ID, we can view any information in regard to the process under /proc/<PID>
In order to verify oom_score_adj which is 1000, we first determined the process id of our container. So as soon as the K8S encounter a resource shortage it will kill this Pod with an OOMKilled error.
Now let's define dedicated resources for our Pod and see how it will change the oom score:
apiVersion: v1
kind: Pod
metadata:
labels:
run: mypod
name: mypod
spec:
containers:
- command:
- sleep
- "1033"
image: nginx
name: mypod
resources:
limits:
cpu: 10m
memory: 10Mi
requests:
cpu: 10m
memory: 10Mi
Now the Pod will be created Guaranteed of QoS. The Pod is created with a new container id so implicitly with a new process ID on the node. So if you cat the oom_score_adj again: -997
This Kernel will avoid the kill this process as much as it can so that K8S will try to run to the Pod as well.
- Guaranteed
- Burstable
The Pod does not meet the criteria for QoS class Guaranteed.
At least one Container in the Pod has a memory or CPU request or limit.
If I remove the CPU limits from the Pod definition:
apiVersion: v1
kind: Pod
metadata:
labels:
run: mypod
name: mypod
spec:
containers:
- command:
- sleep
- "1033"
image: nginx
name: mypod
resources:
limits:
cpu: 10m
memory: 10Mi
requests:
memory: 10Mi
The QoS is changed to:
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
And let's validate the oom score:
- Guaranteed
- Burstable
- BestEffort → If do not set any resources CPU/Memory will be assigned
For a Pod to be given a QoS class of BestEffort, the Containers in the Pod must not have any memory or CPU limits or requests.
kubectl run --image=nginx mypod --command -- sleep 1033
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Validating the oom_score_adj:
Conclusion:
As is always the case in Software Engineering which method you should use is literally IT DEPENDS on the workload and workflow requirements.
Does your workflow require stability and reliability?
What is the target uptime for your service?
Is the Pod running for dev or prod or staging?
How your workflow is aggregating?
…
…
and so on.
A TAKE AWAY:
In K8S, resource utilization is not homogeneous, a cluster of three nodes with 8GB RAM and 4 CPU cores, total resources are 24GB RAM and 12 CPU cores. If you deploy an app with resource requests of 9GB of memory, it won’t be scheduled and will remain in the “pending” state.
resources:
- https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#qos-classes
- https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/helper/qos/qos.go#L38
- https://mkdev.me/posts/kubernetes-capacity-and-resource-management-it-s-not-what-you-think-it-is
ChangeLog.md
- typos