Kubernetes v1.36: Workload-Aware Scheduling โ€” AI/ML Workloads Finally Get Fair Treatment

Karify98 & Amy ๐ŸŒธยท
Cover Image for Kubernetes v1.36: Workload-Aware Scheduling โ€” AI/ML Workloads Finally Get Fair Treatment

Pod-by-Pod scheduling is outdated

If you run AI/ML training on Kubernetes, you've probably seen this: 10 Pods need to run together for model training, but the scheduler only manages to schedule 8. The remaining 2 wait for resources while those 8 Pods consume resources without doing anything useful. Classic resource waste.

Kubernetes has always used Pod-by-Pod scheduling โ€” the scheduler evaluates each Pod individually, finds a suitable Node, and binds it. This works fine for typical microservices, but it's completely wrong for batch/AI workloads where a group of Pods must run together or not run at all.

Kubernetes v1.36, released on May 13, 2026, solves this by introducing Workload-Aware Scheduling โ€” a scheduler that understands the relationships between Pods within a workload.

PodGroup API: Separating template from runtime

The biggest architectural change in v1.36 is separating the Workload API from the PodGroup API.

In v1.35, both the template and runtime state lived in the same Workload resource. This created a problem: the scheduler had to watch and parse the entire Workload object even though it only needed scheduling information.

V1.36 solves this by splitting them:

  • Workload API โ†’ static template only (defines what a group of Pods looks like)
  • PodGroup API โ†’ manages runtime state (actual scheduling status)
# Workload โ€” static template
apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
  name: training-job-workload
spec:
  podGroupTemplates:
    - name: workers
      schedulingPolicy:
        gang:
          minCount: 4  # Need at least 4 Pods running together
# PodGroup โ€” runtime state
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: training-job-workers-pg
spec:
  podGroupTemplateRef:
    workload:
      workloadName: training-job-workload
      podGroupTemplateName: workers
  schedulingPolicy:
    gang:
      minCount: 4

This separation lets the scheduler only read PodGroup objects, skipping the Workload parsing. Better performance, especially at scale.

Gang scheduling: All or nothing

This is the feature I've been waiting for. Gang scheduling ensures that either all Pods in a group get scheduled, or none of them do.

Previously, if you needed 8 GPU Pods for distributed training:

  • Scheduler might schedule 5 Pods โ†’ 5 Pods consume GPUs but training can't run
  • 3 Pods wait โ†’ resources wasted

With gang scheduling in v1.36:

  • Scheduler evaluates the entire group in one atomic cycle
  • If resources are insufficient for minCount Pods โ†’ all wait
  • If sufficient โ†’ all get bound simultaneously

How it works:

  1. Scheduler takes a cluster state snapshot (avoids race conditions)
  2. Runs PodGroup scheduling algorithm โ€” finds Node placement for all Pods
  3. Applies the decision atomically: success โ†’ bind all; failure โ†’ return to queue

The nice part: if new Pods are added to a group after some Pods are already running, the scheduler evaluates the new Pods without evicting already-running ones.

Topology-aware scheduling: Pods together, not scattered

Distributed training is sensitive to network latency. If 8 GPU Pods are scattered across the cluster, each on a different rack, inter-Pod bandwidth suffers โ†’ training slows down.

Topology-aware scheduling lets you bind a PodGroup to a specific topology domain:

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: topology-aware-workers-pg
spec:
  schedulingPolicy:
    gang:
      minCount: 4
    schedulingConstraints:
      topology:
        - key: topology.kubernetes.io/rack

The scheduler finds Node combinations within the same rack, evaluates whether the entire PodGroup fits, then selects optimal placement based on resource efficiency.

Currently, topology-aware scheduling doesn't trigger preemption (evicting other Pods to make room). This is expected in v1.37.

Workload-aware preemption: Priority done right

Preemption in K8s isn't new โ€” high-priority Pods can evict low-priority ones. But old preemption worked on individual Nodes only.

Workload-aware preemption in v1.36 treats the entire PodGroup as a single preemptor unit. Instead of finding victims on each Node separately, the scheduler searches across the entire cluster, preempting Pods from multiple Nodes simultaneously to make room for the PodGroup.

Two new concepts:

  • PodGroup priority โ€” overrides individual Pod priority
  • PodGroup disruptionMode โ€” dictates behavior when a PodGroup gets preempted

ResourceClaim for PodGroup: Smarter GPU allocation

V1.36 extends Dynamic Resource Allocation (DRA) to PodGroups. You can request GPUs, FPGAs, or specialized hardware for an entire group of Pods instead of individual Pods.

This is particularly useful for AI training when you need to allocate the same GPU type across all worker nodes.

Personal take: A necessary update

I've been running K8s for AI workloads for over 2 years. The gang scheduling problem has been the biggest pain point โ€” especially when using Kubeflow or Ray on K8s.

Before v1.36, the common workaround was custom schedulers (like Volcano or the Coscheduling plugin). Not terrible, but high maintenance burden and poor integration with upstream features.

V1.36 makes gang scheduling a first-class citizen. No more custom schedulers. No more workarounds.

However, there are a few limitations to note:

  • Works well for homogeneous Pod groups (identical Pods)
  • Heterogeneous Pod groups and inter-Pod dependencies aren't guaranteed to find placement
  • API is still v1alpha2 โ€” not stable yet, may change in later releases

If you're using Volcano or Coscheduling plugins, no need to migrate immediately. But you should start testing v1.36 on staging clusters to be ready when the API stabilizes.

Summary: Real-world impact

Feature Before v1.36 After v1.36
Gang scheduling Custom scheduler (Volcano) Built-in, first-class
Pod group state Mixed in Workload Separated via PodGroup API
Topology constraints Manual affinity rules Declarative on PodGroup
Preemption Per-Node Per-PodGroup, cluster-wide
DRA for groups Not supported ResourceClaim support

If you run AI/ML training on K8s, v1.36 is worth upgrading. If you only run typical microservices, you can wait for the next stable release.


References: