Authors: Aldo Culquicondor (Google)
The Kubernetes 1.26 release includes a stable implementation of the Jobcontroller that can reliably track a large amount of Jobs with high levels of parallelism. SIG Appsand WG Batchhave worked on this foundational improvement since Kubernetes 1.22. After multiple iterations and scale verifications, this is now the default implementation of the Job controller.
Paired with the Indexed completion mode, the Job controller can handle massively parallel batch Jobs, supporting up to 100k concurrent Pods.
The new implementation also made possible the development of Pod failure policy, which is in beta in the 1.26 release.
How do I use this feature?
To use Job tracking with finalizers, upgrade to Kubernetes 1.25 or newer and create new Jobs. You can also use this feature in v1.23 and v1.24, if you have the ability to enable the
JobTrackingWithFinalizers feature gate.
If your cluster runs Kubernetes 1.26, Job tracking with finalizers is a stable feature. For v1.25, it's behind that feature gate, and your cluster administrators may have explicitly disabled it - for example, if you have a policy of not using beta features.
Jobs created before the upgrade will still be tracked using the legacy behavior. This is to avoid retroactively adding finalizers to running Pods, which might introduce race conditions.
For maximum performance on large Jobs, the Kubernetes project recommends using the Indexed completion mode. In this mode, the control plane is able to track Job progress with less API calls.
If you are a developer of operator(s) for batch, HPC,AI, MLor related workloads, we encourage you to use the Job API to delegate accurate progress tracking to Kubernetes. If there is something missing in the Job API that forces you to manage plain Pods, the Working Group Batchwelcomes your feedback and contributions.
During the development of the feature, the control plane added the annotation
batch.kubernetes.io/job-trackingto the Jobs that were created when the feature was enabled. This allowed a safe transition for older Jobs, but it was never meant to stay.
In the 1.26 release, we deprecated the annotation
batch.kubernetes.io/job-trackingand the control plane will stop adding it in Kubernetes 1.27. Along with that change, we will remove the legacy Job tracking implementation. As a result, the Job controller will track all Jobs using finalizers and it will ignore Pods that don't have the aforementioned finalizer.
Before you upgrade your cluster to 1.27, we recommend that you verify that there are no running Jobs that don't have the annotation, or you wait for those jobs to complete. Otherwise, you might observe the control plane recreating some Pods. We expect that this shouldn't affect any users, as the feature is enabled by default since Kubernetes 1.25, giving enough buffer for old jobs to complete.
What problem does the new implementation solve?
Generally, Kubernetes workload controllers, such as ReplicaSet or StatefulSet, rely on the existence of Pods or other objects in the API to determine the status of the workload and whether replacements are needed. For example, if a Pod that belonged to a ReplicaSet terminates or ceases to exist, the ReplicaSet controller needs to create a replacement Pod to satisfy the desired number of replicas (
Since its inception, the Job controller also relied on the existence of Pods in the API to track Job status. A Job has completionand failure handlingpolicies, requiring the end state of a finished Pod to determine whether to create a replacement Pod or mark the Job as completed or failed. As a result, the Job controller depended on Pods, even terminated ones, to remain in the API in order to keep track of the status.
This dependency made the tracking of Job status unreliable, because Pods can be deleted from the API for a number of reasons, including:
- The garbage collector removing orphan Pods when a Node goes down.
- The garbage collector removing terminated Pods when they reach a threshold.
- The Kubernetes scheduler preempting a Pod to accomodate higher priority Pods.
- The taint manager evicting a Pod that doesn't tolerate a
- External controllers, not included as part of Kubernetes, or humans deleting Pods.
The new implementation
When a controller needs to take an action on objects before they are removed, it should add a finalizerto the objects that it manages. A finalizer prevents the objects from being deleted from the API until the finalizers are removed. Once the controller is done with the cleanup and accounting for the deleted object, it can remove the finalizer from the object and the control plane removes the object from the API.
This is what the new Job controller is doing: adding a finalizer during Pod creation, and removing the finalizer after the Pod has terminated and has been accounted for in the Job status. However, it wasn't that simple.
The main challenge is that there are at least two objects involved: the Pod and the Job. While the finalizer lives in the Pod object, the accounting lives in the Job object. There is no mechanism to atomically remove the finalizer in the Pod and update the counters in the Job status. Additionally, there could be more than one terminated Pod at a given time.
To solve this problem, we implemented a three staged approach, each translating to an API call.
- For each terminated Pod, add the unique ID (UID) of the Pod into short-lived lists stored in the
.statusof the owning Job (.status.uncountedTerminatedPods).
- Remove the finalizer from the Pods(s).
- Atomically do the following operations:
- remove UIDs from the short-lived lists
- increment the overall
failedcounters in the
statusof the Job.
Additional complications come from the fact that the Job controller might receive the results of the API changes in steps 1 and 2 out of order. We solved this by adding an in-memory cache for removed finalizers.
Still, we faced some issues during the beta stage, leaving some pods stuck with finalizers in some conditions (#108645,#109485, and#111646). As a result, we decided to switch that feature gate to be disabled by default for the 1.23 and 1.24 releases.
Once resolved, we re-enabled the feature for the 1.25 release. Since then, we have received reports from our customers running tens of thousands of Pods at a time in their clusters through the Job API. Seeing this success, we decided to graduate the feature to stable in 1.26, as part of our long term commitment to make the Job API the best way to run large batch Jobs in a Kubernetes cluster.
To learn more about the feature, you can read the KEP.
As with any Kubernetes feature, multiple people contributed to getting this done, from testing and filing bugs to reviewing code.
On behalf of SIG Apps, I would like to especially thank Jordan Liggitt (Google) for helping me debug and brainstorm solutions for more than one race condition and Maciej Szulik (Red Hat) for his thorough reviews.