AI / ML Infrastructure

GPU Cluster Scheduler / Resource Manager

A GPU cluster scheduler decides which job runs on which GPUs, when, across a fleet of tens of thousands of accelerators. Unlike a stateless web scheduler, the workloads are distributed training jobs that need many GPUs at once and are useless with a partial allocation — so the scheduler must place them gang style (all-or-nothing), keep ultra-expensive hardware highly utilized, share it fairly across competing teams, and survive constant node and GPU failures. This is the control plane behind systems like Kubernetes + Volcano/Kueue, Slurm, Borg/Omega, and Ray.

Requirements

Treat the scheduler as a resource market: jobs bid for GPUs, the scheduler clears the market under priority, quota, and topology constraints while keeping utilization high. The hard part is not packing one job — it is packing gang jobs without deadlocking the cluster.

Functional	Non-functional
Submit a job requesting N GPUs (+ CPU/RAM) of a given type (A100/H100), with affinity/topology constraints.	High utilization — GPUs cost ~$2-4/hr each; idle or fragmented GPUs are the dominant cost. Target high allocated-and-busy %.
Queues & priorities — multiple queues per org/team, priority classes (production > research > best-effort).	Fairness — no team starves; shares honored across tenants even under contention.
Multi-node gang jobs — a distributed-training job needs all of its tasks scheduled together or none (all-or-nothing).	Low scheduling latency — placement decisions in ms-to-seconds even at 100k GPUs and deep queues.
Lifecycle — queue → schedule → run → complete/fail; plus `suspend`, `resume`, `preempt`, `requeue`.	Fault tolerance — survive node/GPU failures and scheduler crashes; no lost or double-bound GPUs.
Quotas & reservations — guaranteed + burst capacity per team; reserve capacity for a pending large gang.	Scalability — 10k-100k GPUs, thousands of jobs/day, elastic with cluster growth.
Topology awareness — co-locate a job's GPUs on the same NVLink island / rack for fast all-reduce.	Predictability / SLOs — high-priority jobs get bounded wait; decisions are auditable and reproducible.

The defining constraint

A 1024-GPU pretraining job is worthless with 1000 GPUs — every rank must start before training can begin. That single fact (all-or-nothing demand) is what separates a GPU scheduler from a generic container scheduler and forces gang scheduling, topology awareness, and preemption to the center of the design.

Scale & estimates

Numbers for a single large cluster. The point is to show that per-task scheduling throughput and cluster state size are both tractable on one host, so the scheduler can keep a consistent global view — which is exactly what gang placement needs.

Dimension	Estimate	How / notes
GPUs in cluster	10k - 100k	e.g. 12,500 nodes × 8 GPUs = 100k GPUs.
Nodes	~1,250 - 12,500	8 GPUs/node typical (DGX-style). Node is the unit of placement/failure.
Jobs submitted / day	thousands - tens of thousands	Mix of short evals/notebooks and long training runs.
Job size (GPUs)	1 → thousands	Heavy long tail: many 1-8 GPU jobs, a few 1024-4096 GPU gangs that dominate capacity.
Job duration	minutes → weeks	Eval = minutes; LLM pretraining = days-to-weeks. Long jobs make preemption/checkpointing matter.
Pending queue depth	hundreds - thousands of tasks	Scheduler must reason over the whole queue to do fair-share + backfill.
Scheduling decisions	~1k - 10k task placements/sec (peak)	Bursty: a 4,096-GPU gang admits thousands of tasks in one cycle.
Node heartbeats	~2,500 / sec	12,500 nodes × every ~5s. Drives failure detection and free-capacity view.
Hot cluster state	tens of MB	100k GPUs × small records → fits in memory; snapshot to a replicated store (etcd-class).

Takeaway: the live state is small enough to hold in RAM on the scheduler, so a single active scheduler with a consistent snapshot can make global gang decisions; throughput is bursty, not sustained-huge, so the design optimizes for good packing decisions over raw decisions/sec.

Core entities

Six entities carry the whole model. A Job owns a gang of Tasks; Tasks consume GPUs on Nodes; Queues order jobs and carry policy; Quotas bound how much a tenant may take.

Entity	Key fields	Notes
Job	`id`, `owner`, `queue`, `priority`, `min_members`/`desired`, `per_task_request`, `constraints`, `status`	The gang unit. `min_members` = how many tasks must start together (gang size). Carries topology/affinity hints.
Task (Pod)	`id`, `job_id`, `node_id`, `gpu_ids[]`, `status`, `restart_count`	One schedulable replica/rank. Bound to a node and a specific set of GPUs.
Node	`id`, `rack`/`island`, `total_gpus`, `allocatable`, `gpu_type`, `nvlink_topology`, `health`, `labels`/`taints`	Unit of placement and of failure. Topology fields drive co-location.
GPU	`id`, `node_id`, `type` (A100/H100), `memory`, `allocated_to`, `health` (XID/ECC)	The scarce resource. May be sliced (MIG) for small jobs or shared for inference.
Queue	`name`, `parent`, `weight`, `priority`, `ordering_policy`	Hierarchical (org → team → user). Holds fair-share weight and ordering (FIFO/priority/DRF).
Quota	`scope`, `guaranteed`, `max`, `used`	`guaranteed` = always available; `max` = burst ceiling when idle capacity exists (reclaimable via preemption).

High-level design

The API admits jobs (quota/validation) into queues. A single active scheduler core runs a scheduling cycle: pick the next job by policy, ask the placement / bin-packer for a topology-fit set of GPUs, optionally invoke the preemption controller to free capacity, then the binder writes the assignment into the cluster state store. Node agents watch the store, launch containers on the assigned GPUs, and stream heartbeats/health back.

flowchart LR
    U["ML Engineer"] --> API["Scheduler API"]
    API --> Q["Queues (priority + quota)"]
    Q --> SCHED["Scheduler Core"]
    SCHED --> PLACE["Placement / Bin-packer"]
    SCHED --> PRE["Preemption Controller"]
    PLACE --> STATE["Cluster State Store"]
    PRE --> STATE
    STATE --> BIND["Binder"]
    BIND --> A1["Node Agent"]
    BIND --> A2["Node Agent"]
    A1 --> G1["8x GPU"]
    A2 --> G2["8x GPU"]
    A1 -->|heartbeat| STATE
    A2 -->|heartbeat| STATE

Two-level / shared-state pattern. Like Omega/Kubernetes, the scheduler reads a consistent snapshot of free capacity, computes placements optimistically, and commits with a version check. This keeps the global view needed for gang decisions while allowing fast cycles.
Binder is the source of truth, not the agent. A GPU is "taken" the moment the bind is committed to the state store (optimistic concurrency), so two cycles can't double-allocate.
Node agents are reconcilers. They converge the node toward its assigned tasks and report actual state; the scheduler never RPCs GPUs directly.
Preemption is a first-class subsystem. To admit a high-priority gang the scheduler may select victims (lower-priority/over-quota tasks), checkpoint or kill them, and reclaim their GPUs.

Deep dive: gang scheduling

Distributed training is bulk-synchronous: every rank must be alive to complete an all-reduce step. So tasks of a job must be scheduled all-or-nothing. If you instead allocate GPUs greedily one task at a time, you get the classic partial-allocation deadlock: two 64-GPU jobs each grab 50 of the 100 free GPUs, neither can reach 64, and both wait forever while 100 GPUs sit "used" but idle.

Partial-allocation deadlock

Greedy per-task placement lets jobs hoard resources they can't yet use. Gang scheduling fixes this: a job is admitted only if all min_members fit at once, committed atomically; otherwise zero GPUs are taken and the job stays queued.

flowchart TD
    J["Gang job needs 64 GPUs"] --> CHK{"All 64 fit together?"}
    CHK -->|yes| RES["Reserve all tasks atomically"]
    CHK -->|no| HOLD["Hold in queue, no partial placement"]
    RES --> TOPO["Topology-aware bind (same NVLink island / rack)"]
    TOPO --> RUN["Launch all tasks together"]
    HOLD --> BACK["Backfill smaller jobs meanwhile"]
    BACK --> CHK

Three mechanisms make gang scheduling work in practice:

All-or-nothing admission. The scheduler tentatively reserves a placement for every task; it commits only if the full set is satisfiable in one consistent snapshot. Partial reservations are released. (Kubernetes calls this a PodGroup; Volcano/Kueue and Slurm have direct gang/coscheduling support.)
Topology-aware placement. All-reduce bandwidth dominates training throughput, so the binder prefers GPUs on the same NVLink island, then same rack/leaf switch, then same spine — minimizing cross-rack hops. A job split across the network can run 2-5× slower.
Bin-packing. Pack jobs tightly onto fewest nodes/islands to leave large contiguous holes for future big gangs, rather than scattering and fragmenting the cluster.

The admission flow across nodes, including the all-or-nothing commit and rollback:

sequenceDiagram
    participant SC as Scheduler
    participant N1 as Node A
    participant N2 as Node B
    participant ST as State Store
    SC->>ST: Read free GPUs snapshot
    SC->>SC: Try to place 64-GPU gang
    alt All tasks fit
        SC->>N1: Reserve 32 GPUs
        SC->>N2: Reserve 32 GPUs
        SC->>ST: Commit gang all-or-nothing
        N1-->>SC: Tasks started
        N2-->>SC: Tasks started
    else Not enough free
        SC->>ST: Release tentative reservations
        SC->>SC: Requeue and age up priority
    end

The cost of all-or-nothing is wait: a big gang may sit idle while it accumulates enough free GPUs. Left unmanaged it both wastes capacity and head-of-line-blocks small jobs — which is exactly what reservations + backfill (next section) exist to fix.

Deep dive: fairness & quotas

Many teams share one cluster, so ordering can't be plain FIFO. Fairness has two layers: which job goes next (share/priority) and how much a tenant may hold (hierarchical quota), with preemption and backfill to enforce and optimize them.

Dominant Resource Fairness (DRF). Jobs request multiple resources (GPU, CPU, RAM); DRF equalizes each user's dominant share — the largest fraction of any single resource they hold. On GPU clusters the dominant resource is almost always the GPU, so DRF effectively gives max-min fair GPU shares while still accounting for CPU/RAM-heavy jobs.
Hierarchical queues. org → team → user, each with a weight and a guaranteed/max quota. Idle guaranteed capacity is lent to busy siblings (elastic), then reclaimed via preemption when the owner returns.
Preemption. To admit a higher-priority or under-quota job, evict victims (lowest priority / most over-quota first). Prefer graceful preemption — signal the job to checkpoint then stop — over hard kill, so long training runs lose minimal work.
Backfill. When a large gang is waiting for capacity, the scheduler makes a reservation for it at a future time, then lets smaller/shorter jobs run in the gaps as long as they finish before the reservation (EASY/conservative backfill). This raises utilization and stops a giant gang from starving everyone behind it.

Policy / knob	Pro	Con / risk	When to use
Priority-only (FIFO within priority)	Simple, predictable for prod.	Starves low priority; ignores cross-team fairness.	Small/homogeneous clusters; strict prod-first.
DRF fair-share	No team starves; multi-resource fair.	More complex; can fragment if not topology-aware.	Multi-tenant clusters with mixed job shapes.
Strict gang admission	No deadlock; jobs start whole.	Big gangs wait; idle GPUs while accumulating.	Always for distributed training.
Backfill + reservation	High utilization; small jobs flow.	Needs runtime estimates; mis-estimates delay the gang.	Whenever large gangs coexist with small jobs.
Preemption (graceful)	Reclaims lent capacity; honors SLOs.	Lost work if no checkpoint; churn.	Elastic quotas; prod must reclaim from research.
Bin-pack vs spread	Pack → room for big gangs + utilization.	Pack → bigger blast radius per node failure.	Pack training; spread replicated inference.

The central tension: gang + bin-packing maximize the chance a huge job can ever run, but they raise wait times and fragmentation pressure. Backfill recovers the wasted capacity, and DRF + quotas + preemption keep the sharing fair while it happens.

Fault tolerance

At 100k GPUs something is always broken: nodes crash, GPUs throw XID/ECC errors, NICs flap. The scheduler must detect failures fast, recover gangs correctly, and never leave the cluster with stale or double-bound placements — all while the scheduler process itself can die.

sequenceDiagram
    participant FD as Failure Detector
    participant L as Scheduler Leader
    participant ST as State Store
    participant Q as Queue
    participant A as New Node Agent
    FD->>L: Node N heartbeat timed out
    L->>ST: Mark node N lost
    Note over L: A gang task ran on node N
    L->>L: Kill surviving sibling tasks
    L->>Q: Requeue the whole gang
    Q-->>L: Capacity freed elsewhere
    L->>A: Bind all tasks again

Failure detection. Missed node heartbeats and GPU health signals (XID, ECC, NVLink down) mark a node degraded/lost. The failing GPU is cordoned so nothing new schedules onto it.
Gang-aware recovery. Because training is all-or-nothing, losing one task usually kills the whole step. The scheduler tears down surviving siblings and requeues the entire gang (unless the framework supports elastic/fault-tolerant training). Checkpoint-restart on the job side caps lost work for week-long runs.
Scheduler HA via leader election. Exactly one active scheduler at a time (active/standby) holds a lease in etcd/ZooKeeper. If it crashes, the lease expires and a standby takes over and rebuilds state from the store. A single writer avoids conflicting placement decisions.
Avoiding stale placements. Binds use optimistic concurrency on a resource version: if free capacity changed since the snapshot, the commit fails and the cycle retries. A node agent rejects a bind for a GPU it already considers taken. Fencing tokens stop an old (paused) leader from committing once a new leader has taken over.
Bin-pack vs spread, revisited. Packing maximizes contiguous capacity for big gangs but concentrates blast radius — one node failure can kill a tightly-packed gang. For replicated inference/services, prefer spread (anti-affinity) so a single node loss never takes down all replicas. Choose per workload class.

Bottlenecks & scaling

A GPU scheduler rarely falls over on raw QPS — it degrades through poor packing, head-of-line blocking, and contention on the global state. The mitigations keep the single-writer global view while removing the serial chokepoints.

Bottleneck	Why it hurts	Mitigation
Scheduler throughput	One serial decision loop must reason over the whole queue + cluster for gang fit.	Batch admits per cycle; cache node scores; incremental updates; shard by pool/partition; shared-state (Omega) optimistic commits; equivalence classes for identical tasks.
Fragmentation	Free GPUs scattered across nodes → no contiguous island for the next big gang.	Bin-packing; topology-aware placement; defragmentation via preemption; size nodes to common gang shapes; reserve islands for large jobs.
Head-of-line blocking by huge gangs	A 4,096-GPU job at the front stalls everything behind it while it accumulates capacity.	Reservation + backfill small jobs into the gaps; separate large-job queue; DRF so small tenants still make progress; aging to bound wait.
Hotspots (popular GPU type)	Everyone wants H100; A100 pool idles → skewed utilization + unfair waits.	Per-type quotas + fair-share; MIG slicing for small jobs; encourage/auto-fallback to cheaper pools; price/priority incentives.
State store contention	Every bind + heartbeat hits the consistent store → write hotspot.	Keep hot state in scheduler RAM; commit in batches; optimistic concurrency; partition state; event-driven agent watches instead of polling.
Heartbeat / watch storm	12,500 nodes × frequent heartbeats can swamp the control plane.	Jittered intervals; aggregate/coalesce; lease-based liveness; tune interval vs detection latency.

Summary — what a staff answer nails

Lead with the defining constraint: distributed training is all-or-nothing, so the scheduler must gang-schedule tasks atomically and place them topology-aware (same NVLink island/rack) to keep all-reduce fast. Keep ultra-expensive GPUs busy with bin-packing + backfill, share them with DRF fair-share, hierarchical quotas, and preemption, and prevent giant gangs from starving small jobs via reservations. Make it robust with a single leader-elected scheduler over a consistent state store, optimistic concurrency + fencing to avoid stale/double binds, and gang-aware recovery + checkpointing when nodes or GPUs fail.