System Design Notes All designs

AI / ML Infrastructure

Multi-Tenant GPU Cluster

Hundreds of teams share one fleet of extremely expensive, extremely scarce GPUs to run everything from 512-GPU training runs to fractional inference pods. The whole problem is a single tension: strong isolation and fair quotas (so tenants don't see each other, starve each other, or interfere) versus high utilization (so a billion-dollar fleet isn't sitting 30% idle). The answer is a layered system — hardware/software isolation levels (whole-GPU, MIG, MPS, time-slicing), hierarchical quotas with DRF fair-share and preemptive borrowing, noisy-neighbor control, and per-tenant metering that feeds chargeback.

Requirements

Functional

Non-functional

Scale & back-of-the-envelope

A fleet this size is a capital question before it is an engineering one: ~50k H100-class GPUs is on the order of $1.5-2B of hardware. At an internal price of ~$2-4 / GPU-hour, every 1% of utilization you win (or waste) is worth tens of millions of dollars a year — which is exactly why we will trade some isolation purity for packing density.

Dimension Estimate Notes
Fleet 50,000 GPUs (H100 / A100 / L40S mix) spread across DCs & availability zones
Tenants 300+ teams, 5,000+ users nested orgs → teams → projects
Jobs / day ~30k (training runs + inference deploys) bursty submission, long-tailed durations
Training 8-512 GPUs each, hours-days gang-scheduled, topology-aware, NVLink/IB
Inference fractional (1g.10gb MIG / MPS), 24×7 latency-sensitive, tight p99 SLOs
Guaranteed quota ~70% of fleet (35k) reserved to teams the non-preemptible floor
Burst pool ~30% (15k) borrowable, preemptible reclaimed on demand by owners
Utilization target 60-80% (from 30-40% baseline) via bin-packing + fractional + preemption
Oversubscription 2-4× logical:physical for dev/inference time-sliced, best-effort only

Where the idle goes

Baseline GPU clusters waste capacity in three ways: reserved but idle (a team holds a guarantee it isn't using), allocated but under-driven (a notebook owns a whole GPU at 5% SM occupancy), and fragmentation (free GPUs are scattered so no 8-GPU gang fits). The design attacks all three: borrowing reclaims reserved-idle, fractional sharing fixes under-driven, and topology-aware bin-packing fixes fragmentation.

Deep dive · GPU sharing & isolation levels

This is the core decision. A GPU can be handed to tenants at four increasingly granular levels, trading isolation strength for utilization. Picking the right level per workload class and per trust boundary is what makes multi-tenancy safe and dense.

Level Mechanism Isolation Utilization Use-case
Whole-GPU device plugin assigns one full GPU to one container Strongest — dedicated hardware Low for small jobs large training; untrusted tenants
MIG hardware partition (A100/H100) into up to 7 instances Strong — dedicated SMs, L2 slice, memory & mem-bandwidth; fault containment Medium-high inference, notebooks, mixed tenants
MPS spatial SM sharing across processes (time-slice avoidance) Weak — shared memory space, no fault isolation High trusted, cooperative co-location
Time-slicing plugin advertises N logical replicas; temporal sharing None — shared everything, oversubscribed Highest dev / notebooks, bursty inference

MIG (Multi-Instance GPU) is the workhorse for multi-tenancy because it is the only hardware partition: an H100/A100 is sliced into instances such as 1g.10gb, 2g.20gb, 3g.40gb, 7g.80gb, each getting its own SM compute slices, its own L2-cache slice and memory controllers, and its own slice of HBM bandwidth. A fault or ECC error in one slice is contained and does not crash the others.

flowchart TD
    GPU["H100 80GB physical GPU"] --> MIG["MIG mode enabled"]
    MIG --> S1["Slice 3g.40gb · Team A training"]
    MIG --> S2["Slice 2g.20gb · Team B inference"]
    MIG --> S3["Slice 1g.10gb · Team C notebook"]
    MIG --> S4["Slice 1g.10gb · Team D notebook"]
    S1 --> HW1["Dedicated SMs + L2 slice + mem BW"]
    S2 --> HW2["Dedicated SMs + L2 slice + mem BW"]
    S3 --> HW3["Dedicated SMs + L2 slice + mem BW"]
    S4 --> HW4["Dedicated SMs + L2 slice + mem BW"]
      

Security boundaries

Trust dictates the floor. For untrusted or external tenants, only whole-GPU or MIG are acceptable — both give hardware-enforced memory isolation, and MIG adds error containment. MPS and time-slicing share a GPU address space and control daemon, so a buggy or malicious process can OOM its neighbors or crash the shared MPS server — the blast radius is every co-located tenant. They are only for mutually-trusting, cooperative workloads.

Memory isolation

Residual data must never leak across tenants: the driver zeroes memory on context teardown, but the platform should verify/scrub between allocations and treat shared-SM timing side-channels as a risk for untrusted co-tenancy. At the container layer the NVIDIA Container Toolkit exposes only the assigned device or MIG UUID; cgroups cap CPU/RAM, and seccomp/AppArmor + network policy fence the rest. Crucially, MIG partitions memory bandwidth too — which is what makes it immune to the dominant noisy-neighbor effect that MPS/time-slicing suffer (next section).

Rule of thumb

Untrusted or latency-critical → whole-GPU or MIG. Trusted batch that under-drives a GPU → MPS to recover utilization. Dev notebooks and spiky best-effort inference → time-slicing with oversubscription. One cluster runs all three, chosen per pool.

Deep dive · quotas & fair-share

Quotas decide who may use what; fair-share decides who runs next when demand exceeds supply. The goal is to give teams predictable floors while still lending out every idle GPU.

Hierarchical quotas

Capacity is a tree: cluster → org → team → project → user. Each node carries a guarantee (min reserved, never preempted) and a max (hard ceiling). This is what Kubernetes ResourceQuota + hierarchical schedulers (Kueue cohorts, Volcano/YuniKorn queues) or Slurm associations/QOS encode.

flowchart TD
    ROOT["Cluster · 50k GPUs"] --> ORG1["Org Research · guarantee 20k"]
    ROOT --> ORG2["Org Product · guarantee 25k"]
    ROOT --> POOL["Shared burst pool · 5k"]
    ORG1 --> T1["Team NLP · min 8k / max 15k"]
    ORG1 --> T2["Team Vision · min 12k / max 18k"]
    ORG2 --> T3["Team Ads · min 15k / max 22k"]
    ORG2 --> T4["Team Search · min 10k / max 16k"]
    POOL -. borrow idle .-> T1
    POOL -. borrow idle .-> T3
      

Guaranteed vs. burstable

Every job is admitted as guaranteed (within the team's floor → reserved, non-preemptible, predictable) or burstable (above the floor → borrows idle capacity up to max, but is preemptible the moment the lender reclaims). This mirrors Borg/Kubernetes QoS classes and is what lets the fleet run hot without breaking anyone's SLA.

DRF fair-share

With multiple resource types, fairness can't be a single number. Dominant Resource Fairness (DRF) computes each tenant's dominant share (the largest fraction of any one resource it holds — GPU, CPU, or RAM) and equalizes that across tenants. It stops a CPU-heavy data team from crowding out a GPU-heavy training team and vice-versa.

Borrowing, lending & preemption

Under-used guarantees are lent into the burst pool; cohorts share reclaimable capacity. When an owner returns, the scheduler preempts the lowest-priority burstable jobs — preferring graceful preemption (signal → checkpoint → requeue) over kill, and gang-preempting whole distributed jobs so a half-killed training run doesn't waste the survivors.

Aspect Static reservation Fair-share + borrowing (chosen)
Utilization Low — idle guarantees stranded High — idle lent out
Predictability High — capacity always there Lower — borrowed capacity is preemptible
Fairness Coarse, by admin fiat Dynamic DRF across resources
Complexity Simple High — preemption, checkpoint, accounting
Best for Hard SLAs, untrusted isolation Maximizing fleet ROI

Chargeback / showback

Metering closes the loop: bill guaranteed capacity by reservation (you pay for the floor you hold) and burst usage by consumption (GPU-seconds × type, with a premium for NVLink/topology). Start with showback (report only) to build trust, then move to chargeback (real budget) — which is what finally makes teams release idle reservations voluntarily.

Deep dive · noisy-neighbor control

Even with quotas correct, co-located tenants can still degrade each other through shared physical paths the scheduler doesn't see. The contention points, roughly in order of pain:

The fundamental isolation vs. utilization trade-off reappears here: hard partitions (MIG, whole-GPU, dedicated NVLink domains) eliminate interference but strand capacity; soft sharing lifts utilization but reintroduces variance — deadly for latency-sensitive inference.

Per-tenant monitoring

Observability is the control loop. DCGM (Data Center GPU Manager) exports per-GPU and per-MIG-instance metrics — SM occupancy, memory utilization, memory bandwidth, NVLink/PCIe throughput, power, ECC errors — all tagged with a tenant label. The platform watches for inference p99 regressions that correlate with a co-tenant's arrival, then reacts: migrate the victim, harden it onto a MIG slice or whole GPU, or use topology-aware scheduling to stop placing different tenants in the same NVLink domain in the first place.

High-level design

Jobs flow through a submission API into an admission + quota controller (validates hierarchical min/max, runs DRF, decides guaranteed vs. burstable) and then a gang + topology-aware scheduler that binds work onto partitioned GPU nodes (MIG / whole-GPU / MPS pools). DCGM telemetry streams into a metering service that produces GPU-seconds for billing — and feeds utilization back to the scheduler to inform preemption.

flowchart LR
    subgraph Tenants
      T1["Team A jobs"]
      T2["Team B jobs"]
    end
    T1 --> API["Submission API / CLI"]
    T2 --> API
    API --> ADM["Admission + Quota Controller"]
    ADM -->|"check min/max, DRF"| Q[("Quota + usage store")]
    ADM --> SCHED["Gang + topology scheduler"]
    SCHED -->|"bind to MIG slice"| N1["GPU node 1 · MIG"]
    SCHED --> N2["GPU node 2 · whole GPU"]
    SCHED --> N3["GPU node 3 · MPS"]
    N1 --> DCGM["DCGM telemetry"]
    N2 --> DCGM
    N3 --> DCGM
    DCGM --> MET["Usage metering · GPU-seconds"]
    MET --> BILL["Billing / chargeback"]
    MET -. feedback .-> SCHED
      

The interesting path is admission of a burstable job that borrows idle capacity, and what happens when the lender reclaims it:

sequenceDiagram
    participant U as Team (burstable job)
    participant A as Admission Controller
    participant S as Scheduler
    participant P as Preemptor
    participant N as GPU Node
    U->>A: submit job, needs 8 GPUs
    A->>A: check team min + borrowable idle
    alt within quota or idle capacity
        A->>S: admit and enqueue
        S->>N: gang-schedule 8 GPUs
        N-->>U: running
    else owner reclaims lent GPUs
        P->>N: preempt burstable job, checkpoint
        N-->>U: evicted and requeued
    end
      

Bottlenecks & scaling

Bottleneck Why it hurts Mitigation
Utilization vs. isolation Hard partitions strand capacity; soft sharing leaks interference Tiered pools — MIG/whole-GPU for untrusted & latency; MPS/time-slice for trusted batch; pack small jobs into MIG slices
Quota fragmentation Static MIG profiles + gang jobs strand partial slices; an 8-GPU gang can't land on scattered frees Defrag / bin-pack; re-profile MIG by demand; topology-aware gang scheduling; backfill small jobs into holes
Fairness vs. efficiency DRF + preemption lower packing density; preemption wastes compute Checkpoint/requeue; preempt lowest-priority burstable first; cooldowns to damp churn
Security / blast radius Shared-memory tenants (MPS/time-slice) can leak or DoS neighbors Untrusted → whole-GPU/MIG only; scrub memory; per-tenant namespace, network policy, seccomp
Metering accuracy Chargeback disputes; fractional usage hard to attribute Per-tenant/per-MIG DCGM GPU-seconds; reconcile; reserve-priced floors + usage-priced burst
Control-plane scale 50k GPUs × 30k jobs/day strains one scheduler + quota tree Shard the scheduler; cache the quota tree; async admission; eventually-consistent usage accounting

Key decisions recap

One fleet, tiered isolation: whole-GPU/MIG for untrusted & latency-critical, MPS/time-slicing to recover utilization on trusted batch. Hierarchical quotas with a non-preemptible guarantee floor and a preemptible burst ceiling; DRF for cross-resource fairness; idle capacity lent and reclaimed via graceful gang-preemption. MIG partitions memory bandwidth, which is what tames the worst noisy neighbor; DCGM per-tenant telemetry drives both interference response and chargeback. The through-line: spend isolation only where trust or SLOs demand it, and pack everything else to push the fleet to 60-80%.