Kubernetes v1.36: Workload-Aware Scheduling Evolved with PodGroup API

Kubernetes v1.36 brings a major leap forward in scheduling capabilities, especially for AI/ML and batch workloads. The new architecture cleanly separates API concerns: the Workload API now serves as a static template, while the brand-new PodGroup API handles all runtime state. This evolution streamlines the scheduler, introduces topology-aware scheduling, workload-aware preemption, dynamic resource allocation (DRA) for PodGroups, and the first phase of Job controller integration. Below, we answer key questions about these advancements.

What are the major scheduling improvements in Kubernetes v1.36?

Kubernetes v1.36 introduces a significant architectural evolution by separating API concerns. The Workload API becomes a static template, while the new PodGroup API manages runtime state, including per-replica sharding of status updates. The kube-scheduler now has a dedicated PodGroup scheduling cycle that enables atomic workload processing, paving the way for future enhancements. This release also debuts the first iterations of topology-aware scheduling and workload-aware preemption to advance scheduling capabilities. Additionally, ResourceClaim support unlocks Dynamic Resource Allocation (DRA) for PodGroups. Finally, the first phase of integration between the Job controller and the new APIs demonstrates real-world readiness.

Kubernetes v1.36: Workload-Aware Scheduling Evolved with PodGroup API

How has the Workload API changed in v1.36?

In v1.35, Pod groups and their runtime states were embedded within the Workload resource. Kubernetes v1.36 decouples these concepts: the Workload now serves solely as a static template object, while the PodGroup manages the runtime state. This separation improves performance and scalability because the PodGroup API allows status updates to be sharded per replica. The scheduler can directly read the PodGroup—which contains all the information it requires—without needing to watch or parse the Workload object itself. This streamlined design is part of the scheduling.k8s.io/v1alpha2 API group, which completely replaces the previous v1alpha1 version.

What is the PodGroup API and how does it work?

The PodGroup API is a new runtime object that holds the actual scheduling policy and references the template from which it was created. Controllers (like the Job controller) define a Workload object with podGroupTemplates that contain fields like minCount for gang scheduling. They then stamp out runtime PodGroup instances based on those templates. Each PodGroup has a status containing conditions that mirror the states of individual Pods, reflecting the overall scheduling state of the group. This runtime object is what the scheduler uses to make decisions, enabling atomic handling of groups rather than individual Pods.

How does gang scheduling work with the new architecture?

Gang scheduling ensures that all Pods in a group are scheduled together or none at all. In v1.36, it is implemented using the PodGroup API's schedulingPolicy.gang.minCount field. For example, a Workload template can specify minCount: 4, meaning the gang is schedulable only if at least 4 Pods can run concurrently. The scheduler’s new PodGroup scheduling cycle processes the entire group atomically. This approach builds on the basic gang scheduling introduced in v1.35 but now uses a cleaner, more scalable model where the Workload is just a template and the PodGroup holds the runtime state.

What topology-aware scheduling features are introduced?

Kubernetes v1.36 debuts the first iterations of topology-aware scheduling for workloads. While the feature is still in its early stages, it aims to consider the physical or logical topology of the cluster (e.g., nodes, zones, racks) when scheduling PodGroups. This is particularly important for distributed AI/ML training jobs or latency-sensitive batch workloads that benefit from Pods being placed close together or spread across failure domains. The current release lays the foundation for more sophisticated topology handling in future versions.

How does workload-aware preemption improve scheduling?

Workload-aware preemption extends Kubernetes’ existing preemption logic to consider workload-level semantics. In v1.36, the scheduler can preempt lower-priority Pods from different PodGroups to make room for a higher-priority group, all while respecting gang constraints and minimum availability. This is a critical improvement for batch workloads where preempting individual Pods could break group integrity. The new preemption logic is designed to work hand-in-hand with the PodGroup API, ensuring that preemption decisions are made holistically rather than on a per-Pod basis.

What is the significance of ResourceClaim support for PodGroups?

With ResourceClaim support, Kubernetes v1.36 unlocks Dynamic Resource Allocation (DRA) for PodGroups. DRA allows workloads to request dedicated hardware resources (e.g., GPUs, FPGAs, or specialized network interfaces) in a flexible, driver-managed manner. Previously, DRA was only available for individual Pods. Now, a PodGroup can declare a ResourceClaim that is shared or distributed among its member Pods, enabling sophisticated resource management for AI/ML training jobs that require consistent access to specialized hardware across all replicas.

How does the Job controller integrate with the new APIs?

The v1.36 release delivers the first phase of integration between the Kubernetes Job controller and the new Workload/PodGroup APIs. This integration demonstrates real-world readiness by allowing Jobs to automatically create Workload templates and manage PodGroup instances. For example, when a user submits a Job, the controller can generate a PodGroup based on the Job's parallelism and completion requirements, leveraging gang scheduling and other advanced features without manual API calls. This paves the way for seamless adoption of workload-aware scheduling in existing batch and data-processing pipelines.