AI / ML Infrastructure
CUDA Kernel Profiling & Auto-Tuning
A modern training or inference stack is, in the end, a pile of GPU kernels — matmuls, attention, layernorm, all-reduce — and a handful of them dominate the wall-clock. The job here is twofold: find the slow kernels (profile them, and prove why they are slow — memory-bound, compute-bound, or stalling on something), and then automatically pick the fastest configuration for each one. The same GEMM compiled with a different tile size, block shape, or unroll factor can run 3-10× faster, and the best choice changes with the input shape and the GPU generation. That is exactly the space Triton, CUTLASS, and TVM/Ansor autotuners explore. This page designs the platform that does both at fleet scale: a low-overhead profiling pipeline, a search-and-cache autotuner, and a CI regression gate.
Requirements
The system has one north-star goal: find and fix slow GPU kernels, and automatically choose the fastest kernel configuration for every shape and device the fleet cares about. "Fix" here is not hand-tuning PTX — it is steering an autotuner and then locking in the winner so the next thousand jobs inherit it for free. Two pressures shape every decision: profiling must be cheap enough to run on real workloads, and search must be cheap enough that we are not burning a GPU-week to shave a microsecond.
Two loops, one platform
The profiling loop answers
"which kernel is slow and why?" — it observes production
and benchmark runs and ranks kernels by their share of the timeline.
The autotuning loop answers
"what is the fastest way to run this kernel?" — it searches
the config space offline and writes the winner into a
(op, shape, gpu) → config cache. Profiling tells you
where to point the autotuner; the autotuner produces
artifacts the next run consumes; the regression gate makes sure
neither silently rots.
| Functional | Non-functional |
|---|---|
| Capture kernel-level profiles — per-kernel duration, launch config, occupancy, achieved memory bandwidth and FLOPs, warp-stall reasons, and the launch timeline (via CUPTI / Nsight). | Low profiling overhead — instrumentation must not meaningfully perturb the workload it measures; lightweight tracing always-on, heavy metric replay sampled and time-boxed. |
| Identify bottleneck kernels — aggregate across runs and rank kernels by total time; classify each as memory-bound vs compute-bound and surface the top offenders. | Scale across a fleet — collect from thousands of GPUs and many model builds without a central profiler becoming the bottleneck; store and aggregate cheaply. |
| Autotune kernel params/configs — search tile sizes, block dims, unroll, pipeline depth, vector width, etc.; benchmark candidates; keep the fastest per shape+GPU. | Fast autotuning search — prune the combinatorial space with cost models / ML guidance so a tune finishes in minutes, not days; reuse results aggressively. |
| Regression-detect across builds — run a benchmark suite in CI, compare per-kernel timings to baselines, and flag slowdowns from a code, compiler, or driver change. | Reproducible — pin GPU model, clocks, driver, CUDA/compiler, and input shapes; lock clocks and warm up so a measured number means the same thing twice. |
Out of scope (worth saying out loud)
We are not writing the kernels — we are measuring and selecting them. We are also not a general APM/metrics product (that is GPU observability); this system is narrowly about kernel time and kernel configuration. It hands its verdicts (top kernels, regressions) to the broader observability stack rather than re-implementing dashboards from scratch.
Profiling concepts
You cannot tune what you cannot name. A profiler dumps dozens of counters; the skill is mapping them to a single question: is this kernel limited by doing math (compute-bound) or by moving bytes (memory-bound)? Everything else — occupancy, stalls, the timeline — either supports that classification or tells you why you are not yet hitting the relevant ceiling.
The roofline model
The roofline is the mental model that ties it together. Every kernel
has an arithmetic intensity —
AI = FLOPs / bytes moved — and the hardware has two
ceilings: peak compute (FLOP/s) and peak memory bandwidth (bytes/s).
Plot achievable performance against AI and you get a roof with two
slopes. Left of the ridge point a kernel is
bandwidth-limited (raising AI is the only way up); right of it, it is
compute-limited (you need more FLOP/s, e.g. tensor cores). Where your
kernel sits versus the roof tells you both how far from
optimal it is and which ceiling to push on.
flowchart TD
K["Kernel"] --> AI["Arithmetic intensity = FLOPs / bytes"]
AI --> CMP{"AI vs ridge point?"}
CMP -->|low AI| MEM["Memory-bound: capped by HBM bandwidth"]
CMP -->|high AI| CMPB["Compute-bound: capped by peak FLOPs"]
MEM --> FIX1["Fix: coalesce, reuse, fuse to raise AI"]
CMPB --> FIX2["Fix: occupancy, ILP, tensor cores"]
FIX1 --> ROOF["Plot on roofline, compare to peak"]
FIX2 --> ROOF
Two kernels can have identical runtime and need opposite fixes. A memory-bound softmax wants fusion and better reuse to raise AI; a compute-bound GEMM already near the bandwidth ceiling wants more occupancy and tensor-core utilization. The roofline stops you from "optimizing" the wrong dimension — adding FLOP/s to a memory-bound kernel buys nothing.
| Metric | What it tells you |
|---|---|
| Occupancy — active warps / max warps per SM | Whether you have enough resident threads to hide memory and pipeline latency. Low occupancy is often a register- or shared-memory-pressure problem; high occupancy is necessary but not sufficient for speed. |
Arithmetic intensity —
FLOPs / bytes
|
Which roofline ceiling applies. Low AI ⇒ memory-bound (raise reuse/fusion); high AI ⇒ compute-bound (chase FLOP/s and tensor cores). |
| Achieved memory bandwidth — GB/s vs HBM peak | For memory-bound kernels, how close you are to the bandwidth roof. Far from peak usually means uncoalesced or strided access patterns. |
| Achieved FLOP/s — vs peak (and tensor-core peak) | For compute-bound kernels, the gap to the compute roof; a big gap on a GEMM often means tensor cores are idle or the tile shape is wrong. |
| Warp-stall reasons — memory dep, exec dep, sync, etc. |
Where warps actually wait. long_scoreboard stalls
point at memory latency; barrier stalls point at
over-synchronization; this is the "why is occupancy not
helping?" detail.
|
| Kernel timeline — launch order, gaps, overlap | Whether the GPU is busy at all. Idle gaps mean host-side launch overhead or data starvation; missing compute/copy overlap means streams are not pipelined. |
Why the timeline matters as much as the counters
Per-kernel counters answer "is this kernel efficient?" but the timeline answers "is the GPU even working?". A fleet can be full of perfectly efficient kernels and still sit at 40% utilization because of launch overhead between thousands of tiny kernels, or because a copy and a compute stream never overlap. Kernel selection (this system) and kernel scheduling (the runtime) are different levers — read the timeline before you blame a kernel.
Profiling data pipeline
The naive approach — run Nsight Compute on production and read the report — does not survive contact with a fleet. Full kernel-metric replay can slow a kernel by 10-100× because the profiler re-runs each kernel many times to collect every counter. So the pipeline is layered by cost: cheap activity tracing is always on, expensive metric collection is sampled, and the heavy lifting (aggregation, ranking, regression) happens off the critical path in a store.
flowchart LR
subgraph FLEET["Instrumented runs (many GPUs)"]
APP["App + CUPTI / Nsight hook"]
SAMP["Sampler: 1 of N runs"]
end
APP --> COL["Trace collector"]
SAMP --> COL
COL --> STORE["Profile store + aggregate"]
STORE --> AN["Analysis: top kernels, regressions"]
AN --> DASH["Dashboards + alerts"]
DASH --> ENG["Performance engineer"]
ENG -->|drill in| STORE
- Instrument with CUPTI / Nsight. The CUPTI Activity API is the workhorse: it streams kernel launch records (name, grid/block, duration, stream) with low overhead, no replay. Reserve Nsight Compute-style metric collection (occupancy, stall reasons, bandwidth) for sampled deep dives, because it replays kernels and is expensive. Two tiers, two overhead budgets.
- Collect traces from many runs / GPUs. A node-local collector batches and compresses activity records and ships pointers to a central pipeline. Sampling keeps overhead low: trace 1-in-N runs, or arm deep metric collection only for a few iterations on a few ranks. The same kernel shows up across thousands of runs, so you do not need every run to characterize it.
-
Store + aggregate. Raw traces are large and
write-once → object storage. The aggregate — per
(kernel, shape, gpu, build)stats like p50/p95 duration, call count, total time, classification — goes to a queryable store. Aggregation is where "300M launch records" becomes "the 20 kernels that own 80% of the time." - Surface top kernels and regressions. Analysis ranks kernels by total fleet time (your tuning backlog, in priority order) and diffs each build against a baseline to flag slowdowns. The output is two short lists — tune these and something got slower — not a wall of counters.
Overhead is the whole game
The reason this is a pipeline and not "just run the
profiler" is that
profiling perturbs what it measures. Always-on
CUPTI activity tracing is a few percent; full Nsight metric replay
can be 50×. The design keeps the always-on tier
cheap and treats deep profiling as a sampled, time-boxed, opt-in
capture — the same discipline as production tracing, applied to
kernels.
Deep dive: auto-tuning the kernel config space
This is the core of the system. A single logical op — say a
GEMM of a given (M, N, K) — can be compiled
into thousands of distinct kernels that all compute the same result
but at wildly different speeds. The knobs are things like
tile sizes (BLOCK_M/N/K),
block / warp dims, unroll factors,
pipeline depth (software pipelining / multi-stage
shared-memory prefetch), vector width, and
num_warps / num_stages. The best combination depends
on the shape and the GPU, and no closed-form rule
picks it — so we search. This is exactly what the
Triton @autotune decorator,
CUTLASS profilers, and TVM/Ansor do.
The compile-and-benchmark loop
The atomic operation is: take a candidate config, compile the kernel variant, benchmark it on the target GPU (warm up, lock clocks, time many iterations, take a robust statistic), and keep the fastest. Compilation is not free and benchmarking burns real GPU time, so the loop is wrapped in caching and smart search to run it as few times as possible.
sequenceDiagram
participant T as Autotuner
participant Ca as Config cache
participant K as Compiler
participant G as Benchmark GPU
T->>Ca: Lookup best config for shape + GPU
Ca-->>T: Cache miss
loop Each candidate config
T->>K: Compile kernel variant
K-->>T: Binary
T->>G: Benchmark, timed run
G-->>T: Latency and throughput
end
T->>Ca: Store fastest config
Ca-->>T: Cached for reuse
Search strategies
The config space is combinatorial — easily 10^3 to
10^6 points once you cross all the knobs — and most
points are invalid (exceed shared memory or register budgets) or slow.
The strategy is how you avoid benchmarking all of them:
| Strategy | How it searches | When to use it |
|---|---|---|
| Grid / exhaustive | Enumerate a hand-curated candidate list and benchmark every one. | Small, well-understood spaces (Triton's default: a fixed list of configs). Simple and reproducible; explodes if the list grows. |
| Random | Sample N configs uniformly from the valid space. | Surprisingly strong baseline for big spaces; great for seeding a cost model. Cheap to parallelize across GPUs. |
| Evolutionary / genetic | Mutate and recombine the best configs found so far, generation by generation. | Large rugged spaces where good configs cluster (Ansor uses this). Finds strong points without a full model. |
| ML-guided / cost-model | A learned cost model predicts runtime from config features; only the most promising are actually compiled and benchmarked. | When real benchmarks are expensive and you tune the same op family often. Cuts measured trials by orders of magnitude (TVM/Ansor, learned cost models). |
In practice these compose: a cost model proposes, an evolutionary search explores around its predictions, and only the top handful are measured on hardware — because the measured benchmark is the ground truth and also the expensive part.
Caching best configs per shape + GPU
Tuning is amortized by
never tuning the same thing twice. The output of a
tune is a row in a config cache keyed by
(op, dtype, shape-bucket, gpu_arch, toolkit_version)
→ winning config + measured time. A few things make the cache
effective:
- Shape bucketing. Tuning every exact shape is hopeless, so shapes are bucketed (powers of two, padded to tile multiples). A new shape that hashes to a known bucket is an instant cache hit.
- Key on the GPU and toolkit. A config tuned for an A100 is often wrong on an H100, and a compiler upgrade can shift the winner — so arch and version are part of the key, never assumed.
- Persist and share. The cache is a fleet-wide artifact (object store + small index), so one tune on one GPU benefits every subsequent job. Cold start tunes; warm start reads.
The autotuner contract
Offline: for each hot
(op, shape, gpu), search the config space with a
model-guided strategy, benchmark the finalists on real hardware, and
write the winner to the cache. Online: a kernel
launch looks up its key and gets the pre-tuned config with
zero search cost. The profiling pipeline feeds the backlog
(which ops/shapes are hot), and the regression gate guards the
cached winners. That loop —
profile → tune → cache → guard — is the whole
product.
Performance regression detection
Correctness has CI; performance usually does not — which is why kernels silently get slower. A code refactor, a compiler bump, or a new driver can knock 20% off a GEMM and no test goes red. This system closes that gap with a performance CI: a curated benchmark suite, per-kernel timing baselines, and a gate that flags slowdowns across commits, compiler, and driver versions.
| Element | What it does |
|---|---|
| Benchmark suite in CI | A pinned set of representative kernels × shapes runs on dedicated, clock-locked GPUs on every relevant change. It is small enough to run often and representative enough to catch real regressions. |
| Per-kernel timing baselines |
Each (kernel, shape, gpu) has a stored baseline
distribution (p50/p95 over many runs), not a single number —
so noise does not masquerade as a regression.
|
| Slowdown flagging | A new result is compared to baseline; a statistically significant slowdown beyond a threshold (e.g. > 5% on p50) fails the gate and points at the offending commit. |
| Cross-version matrix | The same suite runs across CUDA toolkit and driver versions, so a regression introduced by the environment (not the code) is attributed correctly instead of blamed on the diff. |
Measure like you mean it (or measure noise)
GPU timing is noisy unless you control it: lock clocks (disable auto-boost/thermal drift), warm up before timing, run many iterations and take a robust statistic (median, not mean), pin the GPU model, and isolate the benchmark host. Skip this and your "regression detector" mostly detects thermal throttling. Reproducibility is a requirement precisely so a flagged slowdown is real signal.
Cross-reference with observability. Regression detection and the profiling pipeline are two views of the same data. CI catches a slowdown before merge in a controlled harness; fleet observability catches the slowdowns CI missed — a shape CI never benchmarked, a driver rolled out to production, a kernel hot only under real traffic. A regression flagged in CI deep-links to the same per-kernel profile drill-down the observability stack uses, so the responder starts at the evidence, not a red checkmark.
Bottlenecks & scaling
Every part of this system has a failure mode that only shows up at fleet scale, and they all trace back to the same tension: real measurement is expensive (it perturbs workloads and burns GPU time), so the design is a series of moves to measure less while learning more — sample, cache, model, and bucket.
| Bottleneck | Why it happens | Mitigation |
|---|---|---|
| Profiling overhead | Full Nsight metric replay re-runs each kernel many times, slowing it 10-100× and distorting timing — unusable always-on. | Tier it: always-on CUPTI activity tracing (a few %); sampled, time-boxed metric collection on a few ranks/iterations; never replay in the hot path. |
| Search space size |
Tile × block × unroll × stages ×
vector-width is 10^3-10^6 configs;
exhaustive benchmarking is a GPU-week per op.
|
Model-guided search (cost model proposes, evolutionary explores, only finalists measured); prune invalid configs (resource limits) before compiling. |
| Benchmark GPU cost | Tuning competes with production for the very GPUs it is trying to speed up; each measured config costs compile + warm-up + timed runs. | Tune offline on a small dedicated pool; batch candidates; reuse compiled binaries; cap measured trials per tune via the cost model. |
| Config cache hit rate | Too-specific keys (exact shapes) → constant cache misses → constant re-tuning, defeating the point. | Bucket shapes (pad to tile multiples, power-of-two buckets); key on arch + toolkit, not exact device; persist and share the cache fleet-wide. |
| Hardware variance | Clock drift, thermals, silicon variation, and a heterogeneous fleet (A100 vs H100) make a tuned config non-portable and timings noisy. | Lock clocks; warm up; median over many iterations; tune per arch; treat each GPU generation as a distinct cache key. |
| Stale winners | A compiler or driver upgrade can change which config is fastest, silently degrading cached choices. | Version the cache key on toolkit/driver; re-validate hot keys in the regression suite; invalidate and re-tune on environment bumps. |
Summary
A CUDA profiling & auto-tuning system turns "the GPUs feel slow"
into a precise, repeatable loop:
profile → tune → cache → guard. Profiling — layered
CUPTI activity tracing plus sampled Nsight metrics, read through the
roofline lens — ranks kernels by fleet time and
tells you whether each is
memory-bound or compute-bound. The autotuner takes
the hot ones and searches the config space (tiles,
blocks, unroll, pipelining) with model-guided strategies borrowed
from Triton, CUTLASS, and TVM/Ansor, then writes
the winner into a (op, shape, gpu, toolkit)
config cache so the next thousand jobs pay zero
search cost. A performance CI with clock-locked,
reproducible benchmarks guards those winners against code, compiler,
and driver regressions. The unifying discipline is that real
measurement is expensive — so
sample, cache, model, and bucket to measure as
little as possible while still always knowing the fastest way to run
every kernel that matters.