AI / ML Infrastructure
Multi-Tenant GPU Cluster
Hundreds of teams share one fleet of extremely expensive, extremely scarce GPUs to run everything from 512-GPU training runs to fractional inference pods. The whole problem is a single tension: strong isolation and fair quotas (so tenants don't see each other, starve each other, or interfere) versus high utilization (so a billion-dollar fleet isn't sitting 30% idle). The answer is a layered system — hardware/software isolation levels (whole-GPU, MIG, MPS, time-slicing), hierarchical quotas with DRF fair-share and preemptive borrowing, noisy-neighbor control, and per-tenant metering that feeds chargeback.
Requirements
Functional
- Many teams submit both training (multi-GPU, gang-scheduled, hours-to-days, batch) and inference (fractional GPU, latency-sensitive, long-running services) through one common API/CLI.
- Per-team quotas: each team has a guaranteed GPU allotment plus an optional burst limit; admins manage hierarchical quotas (org → team → project → user).
-
Chargeback / showback: meter actual usage (
GPU-seconds × type) and attribute cost back to the owning team. -
Self-service: a team picks GPU type and
topology (e.g.
8×H100on one NVLink domain), picks an isolation level, and sees its own utilization. - Idle capacity is borrowable by other teams and reclaimable (preemption) when the owner returns.
Non-functional
- Isolation & security: hard boundaries between tenants for memory, faults, and performance — tenants may be mutually untrusted.
- Fairness: no team starves; share fairly across multiple resource types (GPU, CPU, RAM) via DRF.
- High utilization: drive the fleet from a typical 30-40% to 60-80% allocated/busy.
- No noisy neighbors: bound cross-tenant interference on memory bandwidth, PCIe, NVLink, and the network fabric so inference SLOs hold.
- Elastic, highly available control plane; deep observability; accurate, disputable-proof metering.
Scale & back-of-the-envelope
A fleet this size is a capital question before it is an engineering
one: ~50k H100-class GPUs is on the order of
$1.5-2B of hardware. At an internal price of
~$2-4 / GPU-hour, every
1% of utilization you win (or waste) is worth
tens of millions of dollars a year — which is exactly
why we will trade some isolation purity for packing density.
| Dimension | Estimate | Notes |
|---|---|---|
| Fleet | 50,000 GPUs (H100 / A100 / L40S mix) | spread across DCs & availability zones |
| Tenants | 300+ teams, 5,000+ users | nested orgs → teams → projects |
| Jobs / day | ~30k (training runs + inference deploys) | bursty submission, long-tailed durations |
| Training | 8-512 GPUs each, hours-days | gang-scheduled, topology-aware, NVLink/IB |
| Inference | fractional (1g.10gb MIG / MPS), 24×7 |
latency-sensitive, tight p99 SLOs |
| Guaranteed quota | ~70% of fleet (35k) reserved to teams | the non-preemptible floor |
| Burst pool | ~30% (15k) borrowable, preemptible | reclaimed on demand by owners |
| Utilization target | 60-80% (from 30-40% baseline) | via bin-packing + fractional + preemption |
| Oversubscription | 2-4× logical:physical for dev/inference | time-sliced, best-effort only |
Where the idle goes
Baseline GPU clusters waste capacity in three ways: reserved but idle (a team holds a guarantee it isn't using), allocated but under-driven (a notebook owns a whole GPU at 5% SM occupancy), and fragmentation (free GPUs are scattered so no 8-GPU gang fits). The design attacks all three: borrowing reclaims reserved-idle, fractional sharing fixes under-driven, and topology-aware bin-packing fixes fragmentation.
Deep dive · GPU sharing & isolation levels
This is the core decision. A GPU can be handed to tenants at four increasingly granular levels, trading isolation strength for utilization. Picking the right level per workload class and per trust boundary is what makes multi-tenancy safe and dense.
| Level | Mechanism | Isolation | Utilization | Use-case |
|---|---|---|---|---|
| Whole-GPU | device plugin assigns one full GPU to one container | Strongest — dedicated hardware | Low for small jobs | large training; untrusted tenants |
| MIG | hardware partition (A100/H100) into up to 7 instances | Strong — dedicated SMs, L2 slice, memory & mem-bandwidth; fault containment | Medium-high | inference, notebooks, mixed tenants |
| MPS | spatial SM sharing across processes (time-slice avoidance) | Weak — shared memory space, no fault isolation | High | trusted, cooperative co-location |
| Time-slicing | plugin advertises N logical replicas; temporal sharing | None — shared everything, oversubscribed | Highest | dev / notebooks, bursty inference |
MIG (Multi-Instance GPU) is the workhorse for
multi-tenancy because it is the only hardware partition: an
H100/A100 is sliced into instances such as 1g.10gb,
2g.20gb, 3g.40gb, 7g.80gb, each
getting its own SM compute slices, its own L2-cache slice and memory
controllers, and its own slice of HBM bandwidth. A
fault or ECC error in one slice is contained and does not crash the
others.
flowchart TD
GPU["H100 80GB physical GPU"] --> MIG["MIG mode enabled"]
MIG --> S1["Slice 3g.40gb · Team A training"]
MIG --> S2["Slice 2g.20gb · Team B inference"]
MIG --> S3["Slice 1g.10gb · Team C notebook"]
MIG --> S4["Slice 1g.10gb · Team D notebook"]
S1 --> HW1["Dedicated SMs + L2 slice + mem BW"]
S2 --> HW2["Dedicated SMs + L2 slice + mem BW"]
S3 --> HW3["Dedicated SMs + L2 slice + mem BW"]
S4 --> HW4["Dedicated SMs + L2 slice + mem BW"]
Security boundaries
Trust dictates the floor. For untrusted or external tenants, only whole-GPU or MIG are acceptable — both give hardware-enforced memory isolation, and MIG adds error containment. MPS and time-slicing share a GPU address space and control daemon, so a buggy or malicious process can OOM its neighbors or crash the shared MPS server — the blast radius is every co-located tenant. They are only for mutually-trusting, cooperative workloads.
Memory isolation
Residual data must never leak across tenants: the driver zeroes memory
on context teardown, but the platform should
verify/scrub
between allocations and treat shared-SM timing side-channels as a risk
for untrusted co-tenancy. At the container layer the
NVIDIA Container Toolkit exposes only the assigned
device or MIG UUID; cgroups cap CPU/RAM, and
seccomp/AppArmor + network policy fence the rest. Crucially, MIG
partitions memory bandwidth too — which is what makes
it immune to the dominant noisy-neighbor effect that MPS/time-slicing
suffer (next section).
Rule of thumb
Untrusted or latency-critical → whole-GPU or MIG. Trusted batch that under-drives a GPU → MPS to recover utilization. Dev notebooks and spiky best-effort inference → time-slicing with oversubscription. One cluster runs all three, chosen per pool.
Deep dive · quotas & fair-share
Quotas decide who may use what; fair-share decides who runs next when demand exceeds supply. The goal is to give teams predictable floors while still lending out every idle GPU.
Hierarchical quotas
Capacity is a tree:
cluster → org → team → project → user. Each node carries
a guarantee (min reserved, never preempted) and a
max (hard ceiling). This is what Kubernetes
ResourceQuota + hierarchical schedulers (Kueue
cohorts, Volcano/YuniKorn queues) or
Slurm associations/QOS encode.
flowchart TD
ROOT["Cluster · 50k GPUs"] --> ORG1["Org Research · guarantee 20k"]
ROOT --> ORG2["Org Product · guarantee 25k"]
ROOT --> POOL["Shared burst pool · 5k"]
ORG1 --> T1["Team NLP · min 8k / max 15k"]
ORG1 --> T2["Team Vision · min 12k / max 18k"]
ORG2 --> T3["Team Ads · min 15k / max 22k"]
ORG2 --> T4["Team Search · min 10k / max 16k"]
POOL -. borrow idle .-> T1
POOL -. borrow idle .-> T3
Guaranteed vs. burstable
Every job is admitted as guaranteed (within the
team's floor → reserved, non-preemptible, predictable) or
burstable (above the floor → borrows idle capacity up
to max, but is preemptible the moment
the lender reclaims). This mirrors Borg/Kubernetes QoS classes and is
what lets the fleet run hot without breaking anyone's SLA.
DRF fair-share
With multiple resource types, fairness can't be a single number. Dominant Resource Fairness (DRF) computes each tenant's dominant share (the largest fraction of any one resource it holds — GPU, CPU, or RAM) and equalizes that across tenants. It stops a CPU-heavy data team from crowding out a GPU-heavy training team and vice-versa.
Borrowing, lending & preemption
Under-used guarantees are lent into the burst pool; cohorts share reclaimable capacity. When an owner returns, the scheduler preempts the lowest-priority burstable jobs — preferring graceful preemption (signal → checkpoint → requeue) over kill, and gang-preempting whole distributed jobs so a half-killed training run doesn't waste the survivors.
| Aspect | Static reservation | Fair-share + borrowing (chosen) |
|---|---|---|
| Utilization | Low — idle guarantees stranded | High — idle lent out |
| Predictability | High — capacity always there | Lower — borrowed capacity is preemptible |
| Fairness | Coarse, by admin fiat | Dynamic DRF across resources |
| Complexity | Simple | High — preemption, checkpoint, accounting |
| Best for | Hard SLAs, untrusted isolation | Maximizing fleet ROI |
Chargeback / showback
Metering closes the loop: bill guaranteed capacity by
reservation (you pay for the floor you hold) and
burst usage by consumption (GPU-seconds × type, with a premium for NVLink/topology). Start with
showback (report only) to build trust, then move to
chargeback (real budget) — which is what finally
makes teams release idle reservations voluntarily.
Deep dive · noisy-neighbor control
Even with quotas correct, co-located tenants can still degrade each other through shared physical paths the scheduler doesn't see. The contention points, roughly in order of pain:
- HBM memory bandwidth — the worst offender for MPS/time-sliced co-tenancy: two kernels can each show high "GPU util %" while real throughput collapses because they're fighting for the same memory controllers. MIG fixes this by partitioning bandwidth; MPS/time-slicing do not.
- PCIe — host↔device copies share lanes; a data-loading-heavy job starves a neighbor's transfers.
- NVLink / NVSwitch — collective all-reduce in distributed training saturates the intra-node fabric; two big jobs on the same NVLink domain interfere.
- Network fabric — InfiniBand / RoCE RDMA for multi-node training; cross-tenant flows contend for switch bandwidth.
The fundamental isolation vs. utilization trade-off reappears here: hard partitions (MIG, whole-GPU, dedicated NVLink domains) eliminate interference but strand capacity; soft sharing lifts utilization but reintroduces variance — deadly for latency-sensitive inference.
Per-tenant monitoring
Observability is the control loop. DCGM (Data Center
GPU Manager) exports per-GPU and
per-MIG-instance metrics — SM occupancy, memory
utilization, memory bandwidth, NVLink/PCIe
throughput, power, ECC errors — all tagged with a tenant label. The
platform watches for inference p99 regressions that
correlate with a co-tenant's arrival, then reacts: migrate the victim,
harden it onto a MIG slice or whole GPU, or use
topology-aware scheduling to stop placing different
tenants in the same NVLink domain in the first place.
High-level design
Jobs flow through a submission API into an
admission + quota controller (validates hierarchical
min/max, runs DRF, decides guaranteed vs. burstable) and then a
gang + topology-aware scheduler that binds work onto
partitioned GPU nodes (MIG / whole-GPU / MPS pools).
DCGM telemetry streams into a
metering service that produces
GPU-seconds for billing — and feeds
utilization back to the scheduler to inform preemption.
flowchart LR
subgraph Tenants
T1["Team A jobs"]
T2["Team B jobs"]
end
T1 --> API["Submission API / CLI"]
T2 --> API
API --> ADM["Admission + Quota Controller"]
ADM -->|"check min/max, DRF"| Q[("Quota + usage store")]
ADM --> SCHED["Gang + topology scheduler"]
SCHED -->|"bind to MIG slice"| N1["GPU node 1 · MIG"]
SCHED --> N2["GPU node 2 · whole GPU"]
SCHED --> N3["GPU node 3 · MPS"]
N1 --> DCGM["DCGM telemetry"]
N2 --> DCGM
N3 --> DCGM
DCGM --> MET["Usage metering · GPU-seconds"]
MET --> BILL["Billing / chargeback"]
MET -. feedback .-> SCHED
The interesting path is admission of a burstable job that borrows idle capacity, and what happens when the lender reclaims it:
sequenceDiagram
participant U as Team (burstable job)
participant A as Admission Controller
participant S as Scheduler
participant P as Preemptor
participant N as GPU Node
U->>A: submit job, needs 8 GPUs
A->>A: check team min + borrowable idle
alt within quota or idle capacity
A->>S: admit and enqueue
S->>N: gang-schedule 8 GPUs
N-->>U: running
else owner reclaims lent GPUs
P->>N: preempt burstable job, checkpoint
N-->>U: evicted and requeued
end
Bottlenecks & scaling
| Bottleneck | Why it hurts | Mitigation |
|---|---|---|
| Utilization vs. isolation | Hard partitions strand capacity; soft sharing leaks interference | Tiered pools — MIG/whole-GPU for untrusted & latency; MPS/time-slice for trusted batch; pack small jobs into MIG slices |
| Quota fragmentation | Static MIG profiles + gang jobs strand partial slices; an 8-GPU gang can't land on scattered frees | Defrag / bin-pack; re-profile MIG by demand; topology-aware gang scheduling; backfill small jobs into holes |
| Fairness vs. efficiency | DRF + preemption lower packing density; preemption wastes compute | Checkpoint/requeue; preempt lowest-priority burstable first; cooldowns to damp churn |
| Security / blast radius | Shared-memory tenants (MPS/time-slice) can leak or DoS neighbors | Untrusted → whole-GPU/MIG only; scrub memory; per-tenant namespace, network policy, seccomp |
| Metering accuracy | Chargeback disputes; fractional usage hard to attribute |
Per-tenant/per-MIG DCGM GPU-seconds; reconcile;
reserve-priced floors + usage-priced burst
|
| Control-plane scale | 50k GPUs × 30k jobs/day strains one scheduler + quota tree | Shard the scheduler; cache the quota tree; async admission; eventually-consistent usage accounting |
Key decisions recap
One fleet, tiered isolation: whole-GPU/MIG for untrusted & latency-critical, MPS/time-slicing to recover utilization on trusted batch. Hierarchical quotas with a non-preemptible guarantee floor and a preemptible burst ceiling; DRF for cross-resource fairness; idle capacity lent and reclaimed via graceful gang-preemption. MIG partitions memory bandwidth, which is what tames the worst noisy neighbor; DCGM per-tenant telemetry drives both interference response and chargeback. The through-line: spend isolation only where trust or SLOs demand it, and pack everything else to push the fleet to 60-80%.