Kubernetes v1.36: Workload-Aware Scheduling โ AI/ML Workloads Finally Get Fair Treatment
Pod-by-Pod scheduling is outdated
If you run AI/ML training on Kubernetes, you've probably seen this: 10 Pods need to run together for model training, but the scheduler only manages to schedule 8. The remaining 2 wait for resources while those 8 Pods consume resources without doing anything useful. Classic resource waste.
Kubernetes has always used Pod-by-Pod scheduling โ the scheduler evaluates each Pod individually, finds a suitable Node, and binds it. This works fine for typical microservices, but it's completely wrong for batch/AI workloads where a group of Pods must run together or not run at all.
Kubernetes v1.36, released on May 13, 2026, solves this by introducing Workload-Aware Scheduling โ a scheduler that understands the relationships between Pods within a workload.
PodGroup API: Separating template from runtime
The biggest architectural change in v1.36 is separating the Workload API from the PodGroup API.
In v1.35, both the template and runtime state lived in the same Workload resource. This created a problem: the scheduler had to watch and parse the entire Workload object even though it only needed scheduling information.
V1.36 solves this by splitting them:
- Workload API โ static template only (defines what a group of Pods looks like)
- PodGroup API โ manages runtime state (actual scheduling status)
# Workload โ static template
apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
name: training-job-workload
spec:
podGroupTemplates:
- name: workers
schedulingPolicy:
gang:
minCount: 4 # Need at least 4 Pods running together
# PodGroup โ runtime state
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
name: training-job-workers-pg
spec:
podGroupTemplateRef:
workload:
workloadName: training-job-workload
podGroupTemplateName: workers
schedulingPolicy:
gang:
minCount: 4
This separation lets the scheduler only read PodGroup objects, skipping the Workload parsing. Better performance, especially at scale.
Gang scheduling: All or nothing
This is the feature I've been waiting for. Gang scheduling ensures that either all Pods in a group get scheduled, or none of them do.
Previously, if you needed 8 GPU Pods for distributed training:
- Scheduler might schedule 5 Pods โ 5 Pods consume GPUs but training can't run
- 3 Pods wait โ resources wasted
With gang scheduling in v1.36:
- Scheduler evaluates the entire group in one atomic cycle
- If resources are insufficient for
minCountPods โ all wait - If sufficient โ all get bound simultaneously
How it works:
- Scheduler takes a cluster state snapshot (avoids race conditions)
- Runs PodGroup scheduling algorithm โ finds Node placement for all Pods
- Applies the decision atomically: success โ bind all; failure โ return to queue
The nice part: if new Pods are added to a group after some Pods are already running, the scheduler evaluates the new Pods without evicting already-running ones.
Topology-aware scheduling: Pods together, not scattered
Distributed training is sensitive to network latency. If 8 GPU Pods are scattered across the cluster, each on a different rack, inter-Pod bandwidth suffers โ training slows down.
Topology-aware scheduling lets you bind a PodGroup to a specific topology domain:
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
name: topology-aware-workers-pg
spec:
schedulingPolicy:
gang:
minCount: 4
schedulingConstraints:
topology:
- key: topology.kubernetes.io/rack
The scheduler finds Node combinations within the same rack, evaluates whether the entire PodGroup fits, then selects optimal placement based on resource efficiency.
Currently, topology-aware scheduling doesn't trigger preemption (evicting other Pods to make room). This is expected in v1.37.
Workload-aware preemption: Priority done right
Preemption in K8s isn't new โ high-priority Pods can evict low-priority ones. But old preemption worked on individual Nodes only.
Workload-aware preemption in v1.36 treats the entire PodGroup as a single preemptor unit. Instead of finding victims on each Node separately, the scheduler searches across the entire cluster, preempting Pods from multiple Nodes simultaneously to make room for the PodGroup.
Two new concepts:
- PodGroup priority โ overrides individual Pod priority
- PodGroup disruptionMode โ dictates behavior when a PodGroup gets preempted
ResourceClaim for PodGroup: Smarter GPU allocation
V1.36 extends Dynamic Resource Allocation (DRA) to PodGroups. You can request GPUs, FPGAs, or specialized hardware for an entire group of Pods instead of individual Pods.
This is particularly useful for AI training when you need to allocate the same GPU type across all worker nodes.
Personal take: A necessary update
I've been running K8s for AI workloads for over 2 years. The gang scheduling problem has been the biggest pain point โ especially when using Kubeflow or Ray on K8s.
Before v1.36, the common workaround was custom schedulers (like Volcano or the Coscheduling plugin). Not terrible, but high maintenance burden and poor integration with upstream features.
V1.36 makes gang scheduling a first-class citizen. No more custom schedulers. No more workarounds.
However, there are a few limitations to note:
- Works well for homogeneous Pod groups (identical Pods)
- Heterogeneous Pod groups and inter-Pod dependencies aren't guaranteed to find placement
- API is still v1alpha2 โ not stable yet, may change in later releases
If you're using Volcano or Coscheduling plugins, no need to migrate immediately. But you should start testing v1.36 on staging clusters to be ready when the API stabilizes.
Summary: Real-world impact
| Feature | Before v1.36 | After v1.36 |
|---|---|---|
| Gang scheduling | Custom scheduler (Volcano) | Built-in, first-class |
| Pod group state | Mixed in Workload | Separated via PodGroup API |
| Topology constraints | Manual affinity rules | Declarative on PodGroup |
| Preemption | Per-Node | Per-PodGroup, cluster-wide |
| DRA for groups | Not supported | ResourceClaim support |
If you run AI/ML training on K8s, v1.36 is worth upgrading. If you only run typical microservices, you can wait for the next stable release.
References: