Part 1: How Kube-Scheduler Works in Kubernetes
This is a first part of series of Kubernetes core components.
Since Kubernetes consists of multiple moving parts, the kube-scheduler is one of the core components of the Kubernetes Control Plane, alongside kube-apiserver, etcd, and the controller manager. Let’s take a deeper look to understand how the Scheduler works, providing a more profound understanding of this component. This knowledge will enhance your troubleshooting skills and might also inspire you to write a custom scheduler for your specific workload, as long as it makes sense.
Kubernetes is a series of events that are either triggered or generated by controllers. An event generated by one controller may trigger another controller to perform a certain task. The Scheduler is also a controller, composed of many control loops for handling different events.
As we know, everything in Kubernetes is pluggable or tends to be. Starting from version 1.15, Kubernetes developers have refactored the Scheduler with a pluggable scheduling framework. This allows most scheduling features to be implemented as plugins, making the kube-scheduler extensible and maintainable.
Let’s take a closer look at the default kube-scheduler architecture:
QueueSort sorts the Pods for scheduling into a queue as follows:
Queues are an essential part of the scheduler. When we define a Pod (for example, when defining a Deployment, StatefulSet, or DaemonSet), we often set conditions such as pod anti-affinity, toleration, persistent volumes, or node names to fit the deployment environment. These parameters or constraints must be met before the pod can move to the next phase. The queue mechanism consists of three subqueues:
1. ActiveQ: For immediate scheduling by default.
2. UnschedulableQ: Holds Pods that cannot be scheduled due to certain conditions such as insufficient resources or taints. The `flushUnschedulableQLeftover` function flushes the pods from this queue every 30 seconds.
3. BackOffQ: Postpones pods that failed to be scheduled, such as when a Persistent Volume Claim (PVC) is not ready. The `flushBackoffQCompleted` function flushes pods from this queue to the ActiveQ every 30 seconds.
Every time we create a Pod, either directly (kind: Pod) or indirectly as part of a Deployment, StatefulSet, etc., the scheduling process is split into two phases:
- Scheduling Cycle
This phase is the extensible part of scheduler where we can adjust workflows(plugins) programatically. Scheduling Cycle consist of following plugins respectively:
#plugins of scheduling cycle
1. PreFilter: Pre-processes pod information and checks cluster/pod conditions before scheduling.
2. Filter: Filters out nodes that cannot run the pod.
3. PostFilter: Handles scenarios when no feasible nodes are found.
4. PreScore: An internal proccessing of tasks before scoring.
5. Score: Ranks nodes based on suitability.
7. Reserve: Reserves resources on a node.
8. Permit: Approve, deny or delays the binding of a pod.
See more details.
After Permit plugin approved the pod, it send Pod to the binding cycle.
Essentially, the goal in scheduling cycles is to find the appropriate node that meets the conditions of the Pod.
2. Binding Cycle
- It may run concurrently.
1. PreBind: Fullfil the requirements before binding phase i.e provisioning
network volume etc.
2. Bind: The actual binding Pod to the Node.
3. PostBind: As an end of binding cycle, it tidy up the associated resources.
At the end of the Bind cycle, the scheduler saves the binding information to the API server.
Conclusion
Let’s say you’ve created a Pod via kubectl:
kubectl run nginxserver --image nginx --port 80
- Kubectl sends a request to the API server for the new Pod.
- API Server: The API server processes the request and saves the new Pod’s configuration in the etcd datastore.
- Scheduler: The scheduler watches for the new Pod and starts processing the request as explained above.
3.1. Queue the Pod
3.2. Pod goes through Scheduling Cycle
3.3. The reserved Node from Scheduling Cycle is used to bind the Pod.
4. The scheduler updates the Pod’s specs in the API server and assigns the Node.
5. The Kubelet on the selected node detects the new Pod assignment, communicates with the container runtime, and handles the rest of the process.
6. Finally, the Kubelet continuously monitors the Pods and reports their status to the API Server.
Since the Scheduler Framework is pluggable, you can customize the scheduler or write your own to better fit your workflows. The default scheduler is designed to be general-purpose, covering as many workflows as possible. For more information about custom schedulers, please refer to this resource here.
References:
#From the links below, you can read about each step of the scheduling process in detail.
1. https://github.com/kubernetes/community/blob/f03b6d5692bd979f07dd472e7b6836b2dad0fd9b/contributors/devel/sig-scheduling/scheduler_queues.md
2. Core Kubernetes, https://www.manning.com/books/core-kubernetes
3. https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/
4. https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/624-scheduling-framework