A100s sitting idle at 2 a.m. while training jobs wait in Pending is the kind of waste that makes platform teams quietly furious. Kubernetes 1.36 changes that equation. If you run shared AI/ML infrastructure, this release matters less for the headline and more for the mechanics: Kubernetes v1.36, released April 22, 2026 under the codename ハル/Haru, gives you better control over partitionable accelerators, saner preemption for distributed jobs, and tighter node-level memory behavior where dense inference clusters usually crack first.
The release notes tell you what shipped. They don’t tell you how to wire it into production without creating a different class of outage. Dynamic Resource Allocation (DRA), workload-aware preemption, PSI metrics, CRI list streaming, mixed-version proxy, external ServiceAccount token signer, and memory QoS enhancements are all useful. Some are ready now. Some need restraint. And one fixes a mess many GPU operators created by being too clever with device plugins and ad hoc labels.
My short take: DRA is the centerpiece for AI/ML clusters, workload-aware preemption is promising but still alpha so treat it like a sharp tool, and PSI plus memory QoS are the pair that will save you from blaming CUDA for what is really kernel pressure behavior.
Kubernetes 1.36 Dynamic Resource Allocation in real clusters
DRA has been inching toward practical use for a while; in Kubernetes 1.36 the important change for accelerator operators is that device taints and tolerations graduate to beta. Sounds minor. It isn’t if you run expensive GPUs that need partitioning and isolation across unrelated workloads. In older setups, teams often stuffed scheduling intent into node labels or custom admission logic and hoped nobody accidentally co-located an incompatible job on a partially allocated device slice. It worked right up until it didn’t.
The better mental model is this: stop thinking of a GPU as a binary node capability and start treating it as an allocatable object with state, policy, and per-claim constraints. DRA lets your resource driver expose devices or partitions as first-class resources; device taints/tolerations let those resources express incompatibility directly to workloads instead of through fragile scheduler side channels.

Picture the scheduling path in prose. A training pod arrives with a claim for two accelerator slices with exclusive memory regions and a topology hint requiring same-host placement for NCCL efficiency. The scheduler asks the DRA-capable driver about candidate allocations rather than guessing from node labels alone. The driver returns feasible partitions on nodes where those slices exist; taints attached to specific devices exclude workloads that can’t share firmware mode or MIG profile; then kube-scheduler places the pod using actual allocatable state instead of stale intent copied into annotations last week.
A concrete DRA shape for partitionable GPUs
The API details will vary by vendor driver implementation, but the pattern stays consistent: define a ResourceClass that represents how allocation should happen, request it through claims, then bind pods to those claims rather than raw nvidia.com/gpu-style extended resources. I’d avoid mixing both models in production unless you enjoy debugging scheduler edge cases at odd hours.

apiVersion: resource.k8s.io/v1beta1
kind: ResourceClass
metadata:
name: gpu-mig-20gb-exclusive
spec:
driverName: gpu.example.com/dra-driver
parametersRef:
apiGroup: gpu.example.com
kind = GpuPartitionPolicy
name = mig-20gb-exclusive
---
apiVersion = resource.k8s.io/v1beta1
kind = ResourceClaimTemplate
metadata:
name = trainer-gpu-claim-template
spec:
spec:
resourceClassName = gpu-mig-20gb-exclusive
allocationMode = Immediate
---
apiVersion = v1
kind = Pod
metadata:
name-bisect-trainer-
spec:
resourceClaims:
- name = accelerator0
source:
resourceClaimTemplateName = trainer-gpu-claim-template
note# invalid illustrative only
