AI / ML Infrastructure

Multi-Tenant GPU Cluster

Hundreds of teams share one fleet of extremely expensive, extremely scarce GPUs to run everything from 512-GPU training runs to fractional inference pods. The whole problem is a single tension: strong isolation and fair quotas (so tenants don't see each other, starve each other, or interfere) versus high utilization (so a billion-dollar fleet isn't sitting 30% idle). The answer is a layered system — hardware/software isolation levels (whole-GPU, MIG, MPS, time-slicing), hierarchical quotas with DRF fair-share and preemptive borrowing, noisy-neighbor control, and per-tenant metering that feeds chargeback.

Requirements

Functional

Many teams submit both training (multi-GPU, gang-scheduled, hours-to-days, batch) and inference (fractional GPU, latency-sensitive, long-running services) through one common API/CLI.
Per-team quotas: each team has a guaranteed GPU allotment plus an optional burst limit; admins manage hierarchical quotas (org → team → project → user).
Chargeback / showback: meter actual usage (GPU-seconds × type) and attribute cost back to the owning team.
Self-service: a team picks GPU type and topology (e.g. 8×H100 on one NVLink domain), picks an isolation level, and sees its own utilization.
Idle capacity is borrowable by other teams and reclaimable (preemption) when the owner returns.

Non-functional

Isolation & security: hard boundaries between tenants for memory, faults, and performance — tenants may be mutually untrusted.
Fairness: no team starves; share fairly across multiple resource types (GPU, CPU, RAM) via DRF.
High utilization: drive the fleet from a typical 30-40% to 60-80% allocated/busy.
No noisy neighbors: bound cross-tenant interference on memory bandwidth, PCIe, NVLink, and the network fabric so inference SLOs hold.
Elastic, highly available control plane; deep observability; accurate, disputable-proof metering.

Scale & back-of-the-envelope

A fleet this size is a capital question before it is an engineering one: ~50k H100-class GPUs is on the order of $1.5-2B of hardware. At an internal price of ~$2-4 / GPU-hour, every 1% of utilization you win (or waste) is worth tens of millions of dollars a year — which is exactly why we will trade some isolation purity for packing density.

Dimension	Estimate	Notes
Fleet	50,000 GPUs (H100 / A100 / L40S mix)	spread across DCs & availability zones
Tenants	300+ teams, 5,000+ users	nested orgs → teams → projects
Jobs / day	~30k (training runs + inference deploys)	bursty submission, long-tailed durations
Training	8-512 GPUs each, hours-days	gang-scheduled, topology-aware, NVLink/IB
Inference	fractional (`1g.10gb` MIG / MPS), 24×7	latency-sensitive, tight p99 SLOs
Guaranteed quota	~70% of fleet (35k) reserved to teams	the non-preemptible floor
Burst pool	~30% (15k) borrowable, preemptible	reclaimed on demand by owners
Utilization target	60-80% (from 30-40% baseline)	via bin-packing + fractional + preemption
Oversubscription	2-4× logical:physical for dev/inference	time-sliced, best-effort only

Where the idle goes

Baseline GPU clusters waste capacity in three ways: reserved but idle (a team holds a guarantee it isn't using), allocated but under-driven (a notebook owns a whole GPU at 5% SM occupancy), and fragmentation (free GPUs are scattered so no 8-GPU gang fits). The design attacks all three: borrowing reclaims reserved-idle, fractional sharing fixes under-driven, and topology-aware bin-packing fixes fragmentation.

Deep dive · GPU sharing & isolation levels

This is the core decision. A GPU can be handed to tenants at four increasingly granular levels, trading isolation strength for utilization. Picking the right level per workload class and per trust boundary is what makes multi-tenancy safe and dense.

Level	Mechanism	Isolation	Utilization	Use-case
Whole-GPU	device plugin assigns one full GPU to one container	Strongest — dedicated hardware	Low for small jobs	large training; untrusted tenants
MIG	hardware partition (A100/H100) into up to 7 instances	Strong — dedicated SMs, L2 slice, memory & mem-bandwidth; fault containment	Medium-high	inference, notebooks, mixed tenants
MPS	spatial SM sharing across processes (time-slice avoidance)	Weak — shared memory space, no fault isolation	High	trusted, cooperative co-location
Time-slicing	plugin advertises N logical replicas; temporal sharing	None — shared everything, oversubscribed	Highest	dev / notebooks, bursty inference

MIG (Multi-Instance GPU) is the workhorse for multi-tenancy because it is the only hardware partition: an H100/A100 is sliced into instances such as 1g.10gb, 2g.20gb, 3g.40gb, 7g.80gb, each getting its own SM compute slices, its own L2-cache slice and memory controllers, and its own slice of HBM bandwidth. A fault or ECC error in one slice is contained and does not crash the others.

flowchart TD
    GPU["H100 80GB physical GPU"] --> MIG["MIG mode enabled"]
    MIG --> S1["Slice 3g.40gb · Team A training"]
    MIG --> S2["Slice 2g.20gb · Team B inference"]
    MIG --> S3["Slice 1g.10gb · Team C notebook"]
    MIG --> S4["Slice 1g.10gb · Team D notebook"]
    S1 --> HW1["Dedicated SMs + L2 slice + mem BW"]
    S2 --> HW2["Dedicated SMs + L2 slice + mem BW"]
    S3 --> HW3["Dedicated SMs + L2 slice + mem BW"]
    S4 --> HW4["Dedicated SMs + L2 slice + mem BW"]

Security boundaries

Trust dictates the floor. For untrusted or external tenants, only whole-GPU or MIG are acceptable — both give hardware-enforced memory isolation, and MIG adds error containment. MPS and time-slicing share a GPU address space and control daemon, so a buggy or malicious process can OOM its neighbors or crash the shared MPS server — the blast radius is every co-located tenant. They are only for mutually-trusting, cooperative workloads.

Memory isolation

Residual data must never leak across tenants: the driver zeroes memory on context teardown, but the platform should verify/scrub between allocations and treat shared-SM timing side-channels as a risk for untrusted co-tenancy. At the container layer the NVIDIA Container Toolkit exposes only the assigned device or MIG UUID; cgroups cap CPU/RAM, and seccomp/AppArmor + network policy fence the rest. Crucially, MIG partitions memory bandwidth too — which is what makes it immune to the dominant noisy-neighbor effect that MPS/time-slicing suffer (next section).

Rule of thumb

Untrusted or latency-critical → whole-GPU or MIG. Trusted batch that under-drives a GPU → MPS to recover utilization. Dev notebooks and spiky best-effort inference → time-slicing with oversubscription. One cluster runs all three, chosen per pool.

Deep dive · quotas & fair-share

Quotas decide who may use what; fair-share decides who runs next when demand exceeds supply. The goal is to give teams predictable floors while still lending out every idle GPU.

Hierarchical quotas

Capacity is a tree: cluster → org → team → project → user. Each node carries a guarantee (min reserved, never preempted) and a max (hard ceiling). This is what Kubernetes ResourceQuota + hierarchical schedulers (Kueue cohorts, Volcano/YuniKorn queues) or Slurm associations/QOS encode.

flowchart TD
    ROOT["Cluster · 50k GPUs"] --> ORG1["Org Research · guarantee 20k"]
    ROOT --> ORG2["Org Product · guarantee 25k"]
    ROOT --> POOL["Shared burst pool · 5k"]
    ORG1 --> T1["Team NLP · min 8k / max 15k"]
    ORG1 --> T2["Team Vision · min 12k / max 18k"]
    ORG2 --> T3["Team Ads · min 15k / max 22k"]
    ORG2 --> T4["Team Search · min 10k / max 16k"]
    POOL -. borrow idle .-> T1
    POOL -. borrow idle .-> T3

Guaranteed vs. burstable

Every job is admitted as guaranteed (within the team's floor → reserved, non-preemptible, predictable) or burstable (above the floor → borrows idle capacity up to max, but is preemptible the moment the lender reclaims). This mirrors Borg/Kubernetes QoS classes and is what lets the fleet run hot without breaking anyone's SLA.

DRF fair-share

With multiple resource types, fairness can't be a single number. Dominant Resource Fairness (DRF) computes each tenant's dominant share (the largest fraction of any one resource it holds — GPU, CPU, or RAM) and equalizes that across tenants. It stops a CPU-heavy data team from crowding out a GPU-heavy training team and vice-versa.

Borrowing, lending & preemption

Under-used guarantees are lent into the burst pool; cohorts share reclaimable capacity. When an owner returns, the scheduler preempts the lowest-priority burstable jobs — preferring graceful preemption (signal → checkpoint → requeue) over kill, and gang-preempting whole distributed jobs so a half-killed training run doesn't waste the survivors.

Aspect	Static reservation	Fair-share + borrowing (chosen)
Utilization	Low — idle guarantees stranded	High — idle lent out
Predictability	High — capacity always there	Lower — borrowed capacity is preemptible
Fairness	Coarse, by admin fiat	Dynamic DRF across resources
Complexity	Simple	High — preemption, checkpoint, accounting
Best for	Hard SLAs, untrusted isolation	Maximizing fleet ROI

Chargeback / showback

Metering closes the loop: bill guaranteed capacity by reservation (you pay for the floor you hold) and burst usage by consumption (GPU-seconds × type, with a premium for NVLink/topology). Start with showback (report only) to build trust, then move to chargeback (real budget) — which is what finally makes teams release idle reservations voluntarily.

Deep dive · noisy-neighbor control

Even with quotas correct, co-located tenants can still degrade each other through shared physical paths the scheduler doesn't see. The contention points, roughly in order of pain:

HBM memory bandwidth — the worst offender for MPS/time-sliced co-tenancy: two kernels can each show high "GPU util %" while real throughput collapses because they're fighting for the same memory controllers. MIG fixes this by partitioning bandwidth; MPS/time-slicing do not.
PCIe — host↔device copies share lanes; a data-loading-heavy job starves a neighbor's transfers.
NVLink / NVSwitch — collective all-reduce in distributed training saturates the intra-node fabric; two big jobs on the same NVLink domain interfere.
Network fabric — InfiniBand / RoCE RDMA for multi-node training; cross-tenant flows contend for switch bandwidth.

The fundamental isolation vs. utilization trade-off reappears here: hard partitions (MIG, whole-GPU, dedicated NVLink domains) eliminate interference but strand capacity; soft sharing lifts utilization but reintroduces variance — deadly for latency-sensitive inference.

Per-tenant monitoring

Observability is the control loop. DCGM (Data Center GPU Manager) exports per-GPU and per-MIG-instance metrics — SM occupancy, memory utilization, memory bandwidth, NVLink/PCIe throughput, power, ECC errors — all tagged with a tenant label. The platform watches for inference p99 regressions that correlate with a co-tenant's arrival, then reacts: migrate the victim, harden it onto a MIG slice or whole GPU, or use topology-aware scheduling to stop placing different tenants in the same NVLink domain in the first place.

High-level design

Jobs flow through a submission API into an admission + quota controller (validates hierarchical min/max, runs DRF, decides guaranteed vs. burstable) and then a gang + topology-aware scheduler that binds work onto partitioned GPU nodes (MIG / whole-GPU / MPS pools). DCGM telemetry streams into a metering service that produces GPU-seconds for billing — and feeds utilization back to the scheduler to inform preemption.

flowchart LR
    subgraph Tenants
      T1["Team A jobs"]
      T2["Team B jobs"]
    end
    T1 --> API["Submission API / CLI"]
    T2 --> API
    API --> ADM["Admission + Quota Controller"]
    ADM -->|"check min/max, DRF"| Q[("Quota + usage store")]
    ADM --> SCHED["Gang + topology scheduler"]
    SCHED -->|"bind to MIG slice"| N1["GPU node 1 · MIG"]
    SCHED --> N2["GPU node 2 · whole GPU"]
    SCHED --> N3["GPU node 3 · MPS"]
    N1 --> DCGM["DCGM telemetry"]
    N2 --> DCGM
    N3 --> DCGM
    DCGM --> MET["Usage metering · GPU-seconds"]
    MET --> BILL["Billing / chargeback"]
    MET -. feedback .-> SCHED

The interesting path is admission of a burstable job that borrows idle capacity, and what happens when the lender reclaims it:

sequenceDiagram
    participant U as Team (burstable job)
    participant A as Admission Controller
    participant S as Scheduler
    participant P as Preemptor
    participant N as GPU Node
    U->>A: submit job, needs 8 GPUs
    A->>A: check team min + borrowable idle
    alt within quota or idle capacity
        A->>S: admit and enqueue
        S->>N: gang-schedule 8 GPUs
        N-->>U: running
    else owner reclaims lent GPUs
        P->>N: preempt burstable job, checkpoint
        N-->>U: evicted and requeued
    end

Bottlenecks & scaling

Bottleneck	Why it hurts	Mitigation
Utilization vs. isolation	Hard partitions strand capacity; soft sharing leaks interference	Tiered pools — MIG/whole-GPU for untrusted & latency; MPS/time-slice for trusted batch; pack small jobs into MIG slices
Quota fragmentation	Static MIG profiles + gang jobs strand partial slices; an 8-GPU gang can't land on scattered frees	Defrag / bin-pack; re-profile MIG by demand; topology-aware gang scheduling; backfill small jobs into holes
Fairness vs. efficiency	DRF + preemption lower packing density; preemption wastes compute	Checkpoint/requeue; preempt lowest-priority burstable first; cooldowns to damp churn
Security / blast radius	Shared-memory tenants (MPS/time-slice) can leak or DoS neighbors	Untrusted → whole-GPU/MIG only; scrub memory; per-tenant namespace, network policy, seccomp
Metering accuracy	Chargeback disputes; fractional usage hard to attribute	Per-tenant/per-MIG DCGM `GPU-seconds`; reconcile; reserve-priced floors + usage-priced burst
Control-plane scale	50k GPUs × 30k jobs/day strains one scheduler + quota tree	Shard the scheduler; cache the quota tree; async admission; eventually-consistent usage accounting

Key decisions recap

One fleet, tiered isolation: whole-GPU/MIG for untrusted & latency-critical, MPS/time-slicing to recover utilization on trusted batch. Hierarchical quotas with a non-preemptible guarantee floor and a preemptible burst ceiling; DRF for cross-resource fairness; idle capacity lent and reclaimed via graceful gang-preemption. MIG partitions memory bandwidth, which is what tames the worst noisy neighbor; DCGM per-tenant telemetry drives both interference response and chargeback. The through-line: spend isolation only where trust or SLOs demand it, and pack everything else to push the fleet to 60-80%.