System Design Notes All designs

AI / ML Infrastructure

GPU Cluster Scheduler / Resource Manager

A GPU cluster scheduler decides which job runs on which GPUs, when, across a fleet of tens of thousands of accelerators. Unlike a stateless web scheduler, the workloads are distributed training jobs that need many GPUs at once and are useless with a partial allocation — so the scheduler must place them gang style (all-or-nothing), keep ultra-expensive hardware highly utilized, share it fairly across competing teams, and survive constant node and GPU failures. This is the control plane behind systems like Kubernetes + Volcano/Kueue, Slurm, Borg/Omega, and Ray.

Requirements

Treat the scheduler as a resource market: jobs bid for GPUs, the scheduler clears the market under priority, quota, and topology constraints while keeping utilization high. The hard part is not packing one job — it is packing gang jobs without deadlocking the cluster.

Functional Non-functional
Submit a job requesting N GPUs (+ CPU/RAM) of a given type (A100/H100), with affinity/topology constraints. High utilization — GPUs cost ~$2-4/hr each; idle or fragmented GPUs are the dominant cost. Target high allocated-and-busy %.
Queues & priorities — multiple queues per org/team, priority classes (production > research > best-effort). Fairness — no team starves; shares honored across tenants even under contention.
Multi-node gang jobs — a distributed-training job needs all of its tasks scheduled together or none (all-or-nothing). Low scheduling latency — placement decisions in ms-to-seconds even at 100k GPUs and deep queues.
Lifecycle — queue → schedule → run → complete/fail; plus suspend, resume, preempt, requeue. Fault tolerance — survive node/GPU failures and scheduler crashes; no lost or double-bound GPUs.
Quotas & reservations — guaranteed + burst capacity per team; reserve capacity for a pending large gang. Scalability — 10k-100k GPUs, thousands of jobs/day, elastic with cluster growth.
Topology awareness — co-locate a job's GPUs on the same NVLink island / rack for fast all-reduce. Predictability / SLOs — high-priority jobs get bounded wait; decisions are auditable and reproducible.

The defining constraint

A 1024-GPU pretraining job is worthless with 1000 GPUs — every rank must start before training can begin. That single fact (all-or-nothing demand) is what separates a GPU scheduler from a generic container scheduler and forces gang scheduling, topology awareness, and preemption to the center of the design.

Scale & estimates

Numbers for a single large cluster. The point is to show that per-task scheduling throughput and cluster state size are both tractable on one host, so the scheduler can keep a consistent global view — which is exactly what gang placement needs.

Dimension Estimate How / notes
GPUs in cluster 10k - 100k e.g. 12,500 nodes × 8 GPUs = 100k GPUs.
Nodes ~1,250 - 12,500 8 GPUs/node typical (DGX-style). Node is the unit of placement/failure.
Jobs submitted / day thousands - tens of thousands Mix of short evals/notebooks and long training runs.
Job size (GPUs) 1 → thousands Heavy long tail: many 1-8 GPU jobs, a few 1024-4096 GPU gangs that dominate capacity.
Job duration minutes → weeks Eval = minutes; LLM pretraining = days-to-weeks. Long jobs make preemption/checkpointing matter.
Pending queue depth hundreds - thousands of tasks Scheduler must reason over the whole queue to do fair-share + backfill.
Scheduling decisions ~1k - 10k task placements/sec (peak) Bursty: a 4,096-GPU gang admits thousands of tasks in one cycle.
Node heartbeats ~2,500 / sec 12,500 nodes × every ~5s. Drives failure detection and free-capacity view.
Hot cluster state tens of MB 100k GPUs × small records → fits in memory; snapshot to a replicated store (etcd-class).

Takeaway: the live state is small enough to hold in RAM on the scheduler, so a single active scheduler with a consistent snapshot can make global gang decisions; throughput is bursty, not sustained-huge, so the design optimizes for good packing decisions over raw decisions/sec.

Core entities

Six entities carry the whole model. A Job owns a gang of Tasks; Tasks consume GPUs on Nodes; Queues order jobs and carry policy; Quotas bound how much a tenant may take.

Entity Key fields Notes
Job id, owner, queue, priority, min_members/desired, per_task_request, constraints, status The gang unit. min_members = how many tasks must start together (gang size). Carries topology/affinity hints.
Task (Pod) id, job_id, node_id, gpu_ids[], status, restart_count One schedulable replica/rank. Bound to a node and a specific set of GPUs.
Node id, rack/island, total_gpus, allocatable, gpu_type, nvlink_topology, health, labels/taints Unit of placement and of failure. Topology fields drive co-location.
GPU id, node_id, type (A100/H100), memory, allocated_to, health (XID/ECC) The scarce resource. May be sliced (MIG) for small jobs or shared for inference.
Queue name, parent, weight, priority, ordering_policy Hierarchical (org → team → user). Holds fair-share weight and ordering (FIFO/priority/DRF).
Quota scope, guaranteed, max, used guaranteed = always available; max = burst ceiling when idle capacity exists (reclaimable via preemption).

High-level design

The API admits jobs (quota/validation) into queues. A single active scheduler core runs a scheduling cycle: pick the next job by policy, ask the placement / bin-packer for a topology-fit set of GPUs, optionally invoke the preemption controller to free capacity, then the binder writes the assignment into the cluster state store. Node agents watch the store, launch containers on the assigned GPUs, and stream heartbeats/health back.

flowchart LR
    U["ML Engineer"] --> API["Scheduler API"]
    API --> Q["Queues (priority + quota)"]
    Q --> SCHED["Scheduler Core"]
    SCHED --> PLACE["Placement / Bin-packer"]
    SCHED --> PRE["Preemption Controller"]
    PLACE --> STATE["Cluster State Store"]
    PRE --> STATE
    STATE --> BIND["Binder"]
    BIND --> A1["Node Agent"]
    BIND --> A2["Node Agent"]
    A1 --> G1["8x GPU"]
    A2 --> G2["8x GPU"]
    A1 -->|heartbeat| STATE
    A2 -->|heartbeat| STATE
      

Deep dive: gang scheduling

Distributed training is bulk-synchronous: every rank must be alive to complete an all-reduce step. So tasks of a job must be scheduled all-or-nothing. If you instead allocate GPUs greedily one task at a time, you get the classic partial-allocation deadlock: two 64-GPU jobs each grab 50 of the 100 free GPUs, neither can reach 64, and both wait forever while 100 GPUs sit "used" but idle.

Partial-allocation deadlock

Greedy per-task placement lets jobs hoard resources they can't yet use. Gang scheduling fixes this: a job is admitted only if all min_members fit at once, committed atomically; otherwise zero GPUs are taken and the job stays queued.

flowchart TD
    J["Gang job needs 64 GPUs"] --> CHK{"All 64 fit together?"}
    CHK -->|yes| RES["Reserve all tasks atomically"]
    CHK -->|no| HOLD["Hold in queue, no partial placement"]
    RES --> TOPO["Topology-aware bind (same NVLink island / rack)"]
    TOPO --> RUN["Launch all tasks together"]
    HOLD --> BACK["Backfill smaller jobs meanwhile"]
    BACK --> CHK
      

Three mechanisms make gang scheduling work in practice:

The admission flow across nodes, including the all-or-nothing commit and rollback:

sequenceDiagram
    participant SC as Scheduler
    participant N1 as Node A
    participant N2 as Node B
    participant ST as State Store
    SC->>ST: Read free GPUs snapshot
    SC->>SC: Try to place 64-GPU gang
    alt All tasks fit
        SC->>N1: Reserve 32 GPUs
        SC->>N2: Reserve 32 GPUs
        SC->>ST: Commit gang all-or-nothing
        N1-->>SC: Tasks started
        N2-->>SC: Tasks started
    else Not enough free
        SC->>ST: Release tentative reservations
        SC->>SC: Requeue and age up priority
    end
      

The cost of all-or-nothing is wait: a big gang may sit idle while it accumulates enough free GPUs. Left unmanaged it both wastes capacity and head-of-line-blocks small jobs — which is exactly what reservations + backfill (next section) exist to fix.

Deep dive: fairness & quotas

Many teams share one cluster, so ordering can't be plain FIFO. Fairness has two layers: which job goes next (share/priority) and how much a tenant may hold (hierarchical quota), with preemption and backfill to enforce and optimize them.

Policy / knob Pro Con / risk When to use
Priority-only (FIFO within priority) Simple, predictable for prod. Starves low priority; ignores cross-team fairness. Small/homogeneous clusters; strict prod-first.
DRF fair-share No team starves; multi-resource fair. More complex; can fragment if not topology-aware. Multi-tenant clusters with mixed job shapes.
Strict gang admission No deadlock; jobs start whole. Big gangs wait; idle GPUs while accumulating. Always for distributed training.
Backfill + reservation High utilization; small jobs flow. Needs runtime estimates; mis-estimates delay the gang. Whenever large gangs coexist with small jobs.
Preemption (graceful) Reclaims lent capacity; honors SLOs. Lost work if no checkpoint; churn. Elastic quotas; prod must reclaim from research.
Bin-pack vs spread Pack → room for big gangs + utilization. Pack → bigger blast radius per node failure. Pack training; spread replicated inference.

The central tension: gang + bin-packing maximize the chance a huge job can ever run, but they raise wait times and fragmentation pressure. Backfill recovers the wasted capacity, and DRF + quotas + preemption keep the sharing fair while it happens.

Fault tolerance

At 100k GPUs something is always broken: nodes crash, GPUs throw XID/ECC errors, NICs flap. The scheduler must detect failures fast, recover gangs correctly, and never leave the cluster with stale or double-bound placements — all while the scheduler process itself can die.

sequenceDiagram
    participant FD as Failure Detector
    participant L as Scheduler Leader
    participant ST as State Store
    participant Q as Queue
    participant A as New Node Agent
    FD->>L: Node N heartbeat timed out
    L->>ST: Mark node N lost
    Note over L: A gang task ran on node N
    L->>L: Kill surviving sibling tasks
    L->>Q: Requeue the whole gang
    Q-->>L: Capacity freed elsewhere
    L->>A: Bind all tasks again
      

Bottlenecks & scaling

A GPU scheduler rarely falls over on raw QPS — it degrades through poor packing, head-of-line blocking, and contention on the global state. The mitigations keep the single-writer global view while removing the serial chokepoints.

Bottleneck Why it hurts Mitigation
Scheduler throughput One serial decision loop must reason over the whole queue + cluster for gang fit. Batch admits per cycle; cache node scores; incremental updates; shard by pool/partition; shared-state (Omega) optimistic commits; equivalence classes for identical tasks.
Fragmentation Free GPUs scattered across nodes → no contiguous island for the next big gang. Bin-packing; topology-aware placement; defragmentation via preemption; size nodes to common gang shapes; reserve islands for large jobs.
Head-of-line blocking by huge gangs A 4,096-GPU job at the front stalls everything behind it while it accumulates capacity. Reservation + backfill small jobs into the gaps; separate large-job queue; DRF so small tenants still make progress; aging to bound wait.
Hotspots (popular GPU type) Everyone wants H100; A100 pool idles → skewed utilization + unfair waits. Per-type quotas + fair-share; MIG slicing for small jobs; encourage/auto-fallback to cheaper pools; price/priority incentives.
State store contention Every bind + heartbeat hits the consistent store → write hotspot. Keep hot state in scheduler RAM; commit in batches; optimistic concurrency; partition state; event-driven agent watches instead of polling.
Heartbeat / watch storm 12,500 nodes × frequent heartbeats can swamp the control plane. Jittered intervals; aggregate/coalesce; lease-based liveness; tune interval vs detection latency.

Summary — what a staff answer nails

Lead with the defining constraint: distributed training is all-or-nothing, so the scheduler must gang-schedule tasks atomically and place them topology-aware (same NVLink island/rack) to keep all-reduce fast. Keep ultra-expensive GPUs busy with bin-packing + backfill, share them with DRF fair-share, hierarchical quotas, and preemption, and prevent giant gangs from starving small jobs via reservations. Make it robust with a single leader-elected scheduler over a consistent state store, optimistic concurrency + fencing to avoid stale/double binds, and gang-aware recovery + checkpointing when nodes or GPUs fail.