Customizing K8S scheduler

Mar 22, 2022 Patrick Fu

Photo by Alvaro Reyes on Unsplash

Author:Patrick Fu(CEO of Gemini Open Cloud)

The default scheduler is a very matured component of Kubernetes cluster management. It is working quite well for most of the cases. However, there are still some use cases that the Kube scheduler is not designed for. For example:

  1. The default Kube scheduler is a Task scheduler, not a Job scheduler. The default scheduler schedules one POD at a time based on the scheduler policy. But if you have a job that requires multiple tasks (PODs) to coordinate with each other, there is no control on when the tasks will be completed. This can lead to unnecessary waiting and overall decrease in performance for the job.
  2. A lot of modern computing tasks (e.g. Machine Learning) requires accelerators, such as GPU or FPGA. However, the default scheduler only considers worker node resources such as CPU core, memory, network, and disk. So the scheduler may end up selecting a sub-optimal worker node for a POD.

Kubernetes does allow users to extend the default scheduler. In fact, Kubernetes allows multiple schedulers to co-exist in a cluster, so you can still run the default scheduler for most of the pods, but run your own custom scheduler for some special pods. There are a couple of ways to customize the default scheduler.

Custom Resource Definition (CRD)

CRD is a mechanism in Kubernetes to allow users to define their own controller for custom objects. The idea of this approach is to implement your own scheduler for special type of K8S POD resources.

The advantage of the CRD approach is that it is entirely isolated. Such scheduler change is applied only to custom resources. It will not impact the system behavior on existing object. Also, the CRD controller is independent of the default controller. The only touchpoint is via the API server. W hen Kubernetes has to be upgraded, it does not require extra code modification and testing of the CRD controller.

The disadvantage of the CRD approach is that it required the application to modify its POD specification to use the special custom defined resources. Also, the CRD scheduler co-exist with the default Kube scheduler. Although they are independent processes, the co-existence of multiple schedulers might have more problems than it sounds like, such as a distributed lock and cache synchronization.

Scheduler Extender

The other approach to customize Kube scheduler is via the Scheduler Extender. The mechanism here is a webhook. A webhook is a method of augmenting or altering the behavior of a web application with custom callbacks. The scheduler extender approach is the most viable solution at this moment to extend scheduler with minimal efforts and is also compatible with the upstream scheduler. The phrase “scheduler extender” simply means configurable webhooks, which corresponds to the two major phases (“Predicates” and “Priorities”) in a scheduling cycle.

But there are disadvantages to the extender approach as well. First, the scheduler extender only accommodates for limited extension points. It can only involve at the end of the Filtering and Scoring phase, so it may not provide enough flexibility for customization.

The other drawback is communications cost. For webhook, the data is transferred in HTTP between the default scheduler and the extender. The IPC cost may impact the overall performance of the Kube scheduler. We will describe the new scheduler Framework in the next section. The scheduler framework employs Plugins instead of webhooks. The Plugins are compiled into the same binary as the scheduler, so there is no overhead in inter-process communications.

K8S scheduler framework

In 1.19, Kubernetes introduced the new Scheduler Framework. This is a framework to allow the user to extend the behavior of Kube scheduler at much finer granularity (known as Stages) than the previous versions. Users can customize scheduler Profiles for specific POD kinds. Each stage is exposed as an Extension Point. Plugins provide scheduler customizations by implementing one or more of these extension points. Users define profile through the kube-scheduler configuration. Each profile configuration is separated into two parts:

The Scheduling and Binding cycle are composed of stages that are executed sequentially to calculate the pod placement. The native scheduling behavior is implemented using the Plugin pattern as well, in the same way as the custom extensions, making the core of the kube-scheduler lightweight as the main scheduling logic is placed in the plugins.

Figure 3 shows an instance where the user specifies 2 profiles for the same scheduler. As you can see, the new scheduler framework allows users to defined plugins for 10 different extension points. A brief description of the extension points follows:

Figure 3 Kubernetes Scheduler Framework
Figure 3 Kubernetes Scheduler Framework

Extension points & plugin’s

We also customized k8s scheduler to fit machine learning environment. Please refer to out next article to understand how to leverage it.

K8s Scheduler Series Reference


Gemini Open Cloud is a CNCF member and a CNCF-certified Kubernetes service provider. With more than ten years of experience in cloud technology, Gemini Open Cloud is an early leader in cloud technology in Taiwan.

We are currently a national-level AI cloud software and Kubernetes technology and service provider. We are also the best partner for many enterprises to import and management container platforms.

Except for the existing products “Gemini AI Console” and “GOC API Gateway”, Gemini Open Cloud also provides enterprises consulting and importing cloud native and Kubernetes-related technical services. Help enterprises embrace Cloud Native and achieve the goal of digital transformation.


相關文章

回雙子星技術部落格列表

Gemini AI Console

熱門文章


kubernetes professional service

關於我們

雙子星雲端是混合多雲技術的領導者,是國際認證之 KCSP - Kubernetes 服務提供商,同時也為 CNCF 雲原生計算基金會會員。

雙子星的雲端專家擁有 Kubernetes、OpenStack 與 Google Cloud Platform 等多項證照,我們的軟體至今已為上百家機構和數千台的 CPU/GPU 伺服器提供雲端服務。