AI / ML Infrastructure
GPU Cluster Scheduler / Resource Manager
A GPU cluster scheduler decides which job runs on which GPUs, when, across a fleet of tens of thousands of accelerators. Unlike a stateless web scheduler, the workloads are distributed training jobs that need many GPUs at once and are useless with a partial allocation — so the scheduler must place them gang style (all-or-nothing), keep ultra-expensive hardware highly utilized, share it fairly across competing teams, and survive constant node and GPU failures. This is the control plane behind systems like Kubernetes + Volcano/Kueue, Slurm, Borg/Omega, and Ray.
Requirements
Treat the scheduler as a resource market: jobs bid for GPUs, the scheduler clears the market under priority, quota, and topology constraints while keeping utilization high. The hard part is not packing one job — it is packing gang jobs without deadlocking the cluster.
| Functional | Non-functional |
|---|---|
| Submit a job requesting N GPUs (+ CPU/RAM) of a given type (A100/H100), with affinity/topology constraints. | High utilization — GPUs cost ~$2-4/hr each; idle or fragmented GPUs are the dominant cost. Target high allocated-and-busy %. |
| Queues & priorities — multiple queues per org/team, priority classes (production > research > best-effort). | Fairness — no team starves; shares honored across tenants even under contention. |
| Multi-node gang jobs — a distributed-training job needs all of its tasks scheduled together or none (all-or-nothing). | Low scheduling latency — placement decisions in ms-to-seconds even at 100k GPUs and deep queues. |
Lifecycle — queue → schedule → run →
complete/fail; plus suspend, resume,
preempt, requeue.
|
Fault tolerance — survive node/GPU failures and scheduler crashes; no lost or double-bound GPUs. |
| Quotas & reservations — guaranteed + burst capacity per team; reserve capacity for a pending large gang. | Scalability — 10k-100k GPUs, thousands of jobs/day, elastic with cluster growth. |
| Topology awareness — co-locate a job's GPUs on the same NVLink island / rack for fast all-reduce. | Predictability / SLOs — high-priority jobs get bounded wait; decisions are auditable and reproducible. |
The defining constraint
A 1024-GPU pretraining job is worthless with 1000 GPUs — every rank must start before training can begin. That single fact (all-or-nothing demand) is what separates a GPU scheduler from a generic container scheduler and forces gang scheduling, topology awareness, and preemption to the center of the design.
Scale & estimates
Numbers for a single large cluster. The point is to show that per-task scheduling throughput and cluster state size are both tractable on one host, so the scheduler can keep a consistent global view — which is exactly what gang placement needs.
| Dimension | Estimate | How / notes |
|---|---|---|
| GPUs in cluster | 10k - 100k | e.g. 12,500 nodes × 8 GPUs = 100k GPUs. |
| Nodes | ~1,250 - 12,500 | 8 GPUs/node typical (DGX-style). Node is the unit of placement/failure. |
| Jobs submitted / day | thousands - tens of thousands | Mix of short evals/notebooks and long training runs. |
| Job size (GPUs) | 1 → thousands | Heavy long tail: many 1-8 GPU jobs, a few 1024-4096 GPU gangs that dominate capacity. |
| Job duration | minutes → weeks | Eval = minutes; LLM pretraining = days-to-weeks. Long jobs make preemption/checkpointing matter. |
| Pending queue depth | hundreds - thousands of tasks | Scheduler must reason over the whole queue to do fair-share + backfill. |
| Scheduling decisions | ~1k - 10k task placements/sec (peak) | Bursty: a 4,096-GPU gang admits thousands of tasks in one cycle. |
| Node heartbeats | ~2,500 / sec | 12,500 nodes × every ~5s. Drives failure detection and free-capacity view. |
| Hot cluster state | tens of MB | 100k GPUs × small records → fits in memory; snapshot to a replicated store (etcd-class). |
Takeaway: the live state is small enough to hold in RAM on the scheduler, so a single active scheduler with a consistent snapshot can make global gang decisions; throughput is bursty, not sustained-huge, so the design optimizes for good packing decisions over raw decisions/sec.
Core entities
Six entities carry the whole model. A Job owns a gang of Tasks; Tasks consume GPUs on Nodes; Queues order jobs and carry policy; Quotas bound how much a tenant may take.
| Entity | Key fields | Notes |
|---|---|---|
| Job |
id, owner, queue,
priority,
min_members/desired,
per_task_request, constraints,
status
|
The gang unit. min_members = how many tasks must
start together (gang size). Carries topology/affinity hints.
|
| Task (Pod) |
id, job_id, node_id,
gpu_ids[], status,
restart_count
|
One schedulable replica/rank. Bound to a node and a specific set of GPUs. |
| Node |
id, rack/island,
total_gpus, allocatable,
gpu_type, nvlink_topology,
health, labels/taints
|
Unit of placement and of failure. Topology fields drive co-location. |
| GPU |
id, node_id,
type (A100/H100), memory,
allocated_to, health (XID/ECC)
|
The scarce resource. May be sliced (MIG) for small jobs or shared for inference. |
| Queue |
name, parent, weight,
priority, ordering_policy
|
Hierarchical (org → team → user). Holds fair-share weight and ordering (FIFO/priority/DRF). |
| Quota |
scope, guaranteed, max,
used
|
guaranteed = always available; max =
burst ceiling when idle capacity exists (reclaimable via
preemption).
|
High-level design
The API admits jobs (quota/validation) into queues. A single active scheduler core runs a scheduling cycle: pick the next job by policy, ask the placement / bin-packer for a topology-fit set of GPUs, optionally invoke the preemption controller to free capacity, then the binder writes the assignment into the cluster state store. Node agents watch the store, launch containers on the assigned GPUs, and stream heartbeats/health back.
flowchart LR
U["ML Engineer"] --> API["Scheduler API"]
API --> Q["Queues (priority + quota)"]
Q --> SCHED["Scheduler Core"]
SCHED --> PLACE["Placement / Bin-packer"]
SCHED --> PRE["Preemption Controller"]
PLACE --> STATE["Cluster State Store"]
PRE --> STATE
STATE --> BIND["Binder"]
BIND --> A1["Node Agent"]
BIND --> A2["Node Agent"]
A1 --> G1["8x GPU"]
A2 --> G2["8x GPU"]
A1 -->|heartbeat| STATE
A2 -->|heartbeat| STATE
- Two-level / shared-state pattern. Like Omega/Kubernetes, the scheduler reads a consistent snapshot of free capacity, computes placements optimistically, and commits with a version check. This keeps the global view needed for gang decisions while allowing fast cycles.
- Binder is the source of truth, not the agent. A GPU is "taken" the moment the bind is committed to the state store (optimistic concurrency), so two cycles can't double-allocate.
- Node agents are reconcilers. They converge the node toward its assigned tasks and report actual state; the scheduler never RPCs GPUs directly.
- Preemption is a first-class subsystem. To admit a high-priority gang the scheduler may select victims (lower-priority/over-quota tasks), checkpoint or kill them, and reclaim their GPUs.
Deep dive: gang scheduling
Distributed training is bulk-synchronous: every rank must be alive to complete an all-reduce step. So tasks of a job must be scheduled all-or-nothing. If you instead allocate GPUs greedily one task at a time, you get the classic partial-allocation deadlock: two 64-GPU jobs each grab 50 of the 100 free GPUs, neither can reach 64, and both wait forever while 100 GPUs sit "used" but idle.
Partial-allocation deadlock
Greedy per-task placement lets jobs hoard resources
they can't yet use. Gang scheduling fixes this: a job is admitted
only if all min_members fit at once,
committed atomically; otherwise zero GPUs are taken
and the job stays queued.
flowchart TD
J["Gang job needs 64 GPUs"] --> CHK{"All 64 fit together?"}
CHK -->|yes| RES["Reserve all tasks atomically"]
CHK -->|no| HOLD["Hold in queue, no partial placement"]
RES --> TOPO["Topology-aware bind (same NVLink island / rack)"]
TOPO --> RUN["Launch all tasks together"]
HOLD --> BACK["Backfill smaller jobs meanwhile"]
BACK --> CHK
Three mechanisms make gang scheduling work in practice:
-
All-or-nothing admission. The scheduler tentatively
reserves a placement for every task; it commits only if the full set
is satisfiable in one consistent snapshot. Partial reservations are
released. (Kubernetes calls this a
PodGroup; Volcano/Kueue and Slurm have direct gang/coscheduling support.) - Topology-aware placement. All-reduce bandwidth dominates training throughput, so the binder prefers GPUs on the same NVLink island, then same rack/leaf switch, then same spine — minimizing cross-rack hops. A job split across the network can run 2-5× slower.
- Bin-packing. Pack jobs tightly onto fewest nodes/islands to leave large contiguous holes for future big gangs, rather than scattering and fragmenting the cluster.
The admission flow across nodes, including the all-or-nothing commit and rollback:
sequenceDiagram
participant SC as Scheduler
participant N1 as Node A
participant N2 as Node B
participant ST as State Store
SC->>ST: Read free GPUs snapshot
SC->>SC: Try to place 64-GPU gang
alt All tasks fit
SC->>N1: Reserve 32 GPUs
SC->>N2: Reserve 32 GPUs
SC->>ST: Commit gang all-or-nothing
N1-->>SC: Tasks started
N2-->>SC: Tasks started
else Not enough free
SC->>ST: Release tentative reservations
SC->>SC: Requeue and age up priority
end
The cost of all-or-nothing is wait: a big gang may sit idle while it accumulates enough free GPUs. Left unmanaged it both wastes capacity and head-of-line-blocks small jobs — which is exactly what reservations + backfill (next section) exist to fix.
Deep dive: fairness & quotas
Many teams share one cluster, so ordering can't be plain FIFO. Fairness has two layers: which job goes next (share/priority) and how much a tenant may hold (hierarchical quota), with preemption and backfill to enforce and optimize them.
- Dominant Resource Fairness (DRF). Jobs request multiple resources (GPU, CPU, RAM); DRF equalizes each user's dominant share — the largest fraction of any single resource they hold. On GPU clusters the dominant resource is almost always the GPU, so DRF effectively gives max-min fair GPU shares while still accounting for CPU/RAM-heavy jobs.
-
Hierarchical queues. org → team → user, each with a
weightand aguaranteed/maxquota. Idleguaranteedcapacity is lent to busy siblings (elastic), then reclaimed via preemption when the owner returns. -
Preemption. To admit a higher-priority or
under-quota job, evict victims (lowest priority / most over-quota
first). Prefer graceful preemption — signal the job
to
checkpointthen stop — over hard kill, so long training runs lose minimal work. - Backfill. When a large gang is waiting for capacity, the scheduler makes a reservation for it at a future time, then lets smaller/shorter jobs run in the gaps as long as they finish before the reservation (EASY/conservative backfill). This raises utilization and stops a giant gang from starving everyone behind it.
| Policy / knob | Pro | Con / risk | When to use |
|---|---|---|---|
| Priority-only (FIFO within priority) | Simple, predictable for prod. | Starves low priority; ignores cross-team fairness. | Small/homogeneous clusters; strict prod-first. |
| DRF fair-share | No team starves; multi-resource fair. | More complex; can fragment if not topology-aware. | Multi-tenant clusters with mixed job shapes. |
| Strict gang admission | No deadlock; jobs start whole. | Big gangs wait; idle GPUs while accumulating. | Always for distributed training. |
| Backfill + reservation | High utilization; small jobs flow. | Needs runtime estimates; mis-estimates delay the gang. | Whenever large gangs coexist with small jobs. |
| Preemption (graceful) | Reclaims lent capacity; honors SLOs. | Lost work if no checkpoint; churn. | Elastic quotas; prod must reclaim from research. |
| Bin-pack vs spread | Pack → room for big gangs + utilization. | Pack → bigger blast radius per node failure. | Pack training; spread replicated inference. |
The central tension: gang + bin-packing maximize the chance a huge job can ever run, but they raise wait times and fragmentation pressure. Backfill recovers the wasted capacity, and DRF + quotas + preemption keep the sharing fair while it happens.
Fault tolerance
At 100k GPUs something is always broken: nodes crash, GPUs throw
XID/ECC errors, NICs flap. The scheduler must detect
failures fast, recover gangs correctly, and never leave the cluster
with stale or double-bound placements — all while the
scheduler process itself can die.
sequenceDiagram
participant FD as Failure Detector
participant L as Scheduler Leader
participant ST as State Store
participant Q as Queue
participant A as New Node Agent
FD->>L: Node N heartbeat timed out
L->>ST: Mark node N lost
Note over L: A gang task ran on node N
L->>L: Kill surviving sibling tasks
L->>Q: Requeue the whole gang
Q-->>L: Capacity freed elsewhere
L->>A: Bind all tasks again
-
Failure detection. Missed node heartbeats and GPU
health signals (
XID, ECC, NVLink down) mark a nodedegraded/lost. The failing GPU is cordoned so nothing new schedules onto it. - Gang-aware recovery. Because training is all-or-nothing, losing one task usually kills the whole step. The scheduler tears down surviving siblings and requeues the entire gang (unless the framework supports elastic/fault-tolerant training). Checkpoint-restart on the job side caps lost work for week-long runs.
- Scheduler HA via leader election. Exactly one active scheduler at a time (active/standby) holds a lease in etcd/ZooKeeper. If it crashes, the lease expires and a standby takes over and rebuilds state from the store. A single writer avoids conflicting placement decisions.
- Avoiding stale placements. Binds use optimistic concurrency on a resource version: if free capacity changed since the snapshot, the commit fails and the cycle retries. A node agent rejects a bind for a GPU it already considers taken. Fencing tokens stop an old (paused) leader from committing once a new leader has taken over.
- Bin-pack vs spread, revisited. Packing maximizes contiguous capacity for big gangs but concentrates blast radius — one node failure can kill a tightly-packed gang. For replicated inference/services, prefer spread (anti-affinity) so a single node loss never takes down all replicas. Choose per workload class.
Bottlenecks & scaling
A GPU scheduler rarely falls over on raw QPS — it degrades through poor packing, head-of-line blocking, and contention on the global state. The mitigations keep the single-writer global view while removing the serial chokepoints.
| Bottleneck | Why it hurts | Mitigation |
|---|---|---|
| Scheduler throughput | One serial decision loop must reason over the whole queue + cluster for gang fit. | Batch admits per cycle; cache node scores; incremental updates; shard by pool/partition; shared-state (Omega) optimistic commits; equivalence classes for identical tasks. |
| Fragmentation | Free GPUs scattered across nodes → no contiguous island for the next big gang. | Bin-packing; topology-aware placement; defragmentation via preemption; size nodes to common gang shapes; reserve islands for large jobs. |
| Head-of-line blocking by huge gangs | A 4,096-GPU job at the front stalls everything behind it while it accumulates capacity. | Reservation + backfill small jobs into the gaps; separate large-job queue; DRF so small tenants still make progress; aging to bound wait. |
| Hotspots (popular GPU type) | Everyone wants H100; A100 pool idles → skewed utilization + unfair waits. | Per-type quotas + fair-share; MIG slicing for small jobs; encourage/auto-fallback to cheaper pools; price/priority incentives. |
| State store contention | Every bind + heartbeat hits the consistent store → write hotspot. | Keep hot state in scheduler RAM; commit in batches; optimistic concurrency; partition state; event-driven agent watches instead of polling. |
| Heartbeat / watch storm | 12,500 nodes × frequent heartbeats can swamp the control plane. | Jittered intervals; aggregate/coalesce; lease-based liveness; tune interval vs detection latency. |
Summary — what a staff answer nails
Lead with the defining constraint: distributed training is all-or-nothing, so the scheduler must gang-schedule tasks atomically and place them topology-aware (same NVLink island/rack) to keep all-reduce fast. Keep ultra-expensive GPUs busy with bin-packing + backfill, share them with DRF fair-share, hierarchical quotas, and preemption, and prevent giant gangs from starving small jobs via reservations. Make it robust with a single leader-elected scheduler over a consistent state store, optimistic concurrency + fencing to avoid stale/double binds, and gang-aware recovery + checkpointing when nodes or GPUs fail.