AI / ML Infrastructure

CUDA Kernel Profiling & Auto-Tuning

A modern training or inference stack is, in the end, a pile of GPU kernels — matmuls, attention, layernorm, all-reduce — and a handful of them dominate the wall-clock. The job here is twofold: find the slow kernels (profile them, and prove why they are slow — memory-bound, compute-bound, or stalling on something), and then automatically pick the fastest configuration for each one. The same GEMM compiled with a different tile size, block shape, or unroll factor can run 3-10× faster, and the best choice changes with the input shape and the GPU generation. That is exactly the space Triton, CUTLASS, and TVM/Ansor autotuners explore. This page designs the platform that does both at fleet scale: a low-overhead profiling pipeline, a search-and-cache autotuner, and a CI regression gate.

Requirements

The system has one north-star goal: find and fix slow GPU kernels, and automatically choose the fastest kernel configuration for every shape and device the fleet cares about. "Fix" here is not hand-tuning PTX — it is steering an autotuner and then locking in the winner so the next thousand jobs inherit it for free. Two pressures shape every decision: profiling must be cheap enough to run on real workloads, and search must be cheap enough that we are not burning a GPU-week to shave a microsecond.

Two loops, one platform

The profiling loop answers "which kernel is slow and why?" — it observes production and benchmark runs and ranks kernels by their share of the timeline. The autotuning loop answers "what is the fastest way to run this kernel?" — it searches the config space offline and writes the winner into a (op, shape, gpu) → config cache. Profiling tells you where to point the autotuner; the autotuner produces artifacts the next run consumes; the regression gate makes sure neither silently rots.

Functional	Non-functional
Capture kernel-level profiles — per-kernel duration, launch config, occupancy, achieved memory bandwidth and FLOPs, warp-stall reasons, and the launch timeline (via CUPTI / Nsight).	Low profiling overhead — instrumentation must not meaningfully perturb the workload it measures; lightweight tracing always-on, heavy metric replay sampled and time-boxed.
Identify bottleneck kernels — aggregate across runs and rank kernels by total time; classify each as memory-bound vs compute-bound and surface the top offenders.	Scale across a fleet — collect from thousands of GPUs and many model builds without a central profiler becoming the bottleneck; store and aggregate cheaply.
Autotune kernel params/configs — search tile sizes, block dims, unroll, pipeline depth, vector width, etc.; benchmark candidates; keep the fastest per shape+GPU.	Fast autotuning search — prune the combinatorial space with cost models / ML guidance so a tune finishes in minutes, not days; reuse results aggressively.
Regression-detect across builds — run a benchmark suite in CI, compare per-kernel timings to baselines, and flag slowdowns from a code, compiler, or driver change.	Reproducible — pin GPU model, clocks, driver, CUDA/compiler, and input shapes; lock clocks and warm up so a measured number means the same thing twice.

Out of scope (worth saying out loud)

We are not writing the kernels — we are measuring and selecting them. We are also not a general APM/metrics product (that is GPU observability); this system is narrowly about kernel time and kernel configuration. It hands its verdicts (top kernels, regressions) to the broader observability stack rather than re-implementing dashboards from scratch.

Profiling concepts

You cannot tune what you cannot name. A profiler dumps dozens of counters; the skill is mapping them to a single question: is this kernel limited by doing math (compute-bound) or by moving bytes (memory-bound)? Everything else — occupancy, stalls, the timeline — either supports that classification or tells you why you are not yet hitting the relevant ceiling.

The roofline model

The roofline is the mental model that ties it together. Every kernel has an arithmetic intensity — AI = FLOPs / bytes moved — and the hardware has two ceilings: peak compute (FLOP/s) and peak memory bandwidth (bytes/s). Plot achievable performance against AI and you get a roof with two slopes. Left of the ridge point a kernel is bandwidth-limited (raising AI is the only way up); right of it, it is compute-limited (you need more FLOP/s, e.g. tensor cores). Where your kernel sits versus the roof tells you both how far from optimal it is and which ceiling to push on.

flowchart TD
    K["Kernel"] --> AI["Arithmetic intensity = FLOPs / bytes"]
    AI --> CMP{"AI vs ridge point?"}
    CMP -->|low AI| MEM["Memory-bound: capped by HBM bandwidth"]
    CMP -->|high AI| CMPB["Compute-bound: capped by peak FLOPs"]
    MEM --> FIX1["Fix: coalesce, reuse, fuse to raise AI"]
    CMPB --> FIX2["Fix: occupancy, ILP, tensor cores"]
    FIX1 --> ROOF["Plot on roofline, compare to peak"]
    FIX2 --> ROOF

Two kernels can have identical runtime and need opposite fixes. A memory-bound softmax wants fusion and better reuse to raise AI; a compute-bound GEMM already near the bandwidth ceiling wants more occupancy and tensor-core utilization. The roofline stops you from "optimizing" the wrong dimension — adding FLOP/s to a memory-bound kernel buys nothing.

Metric	What it tells you
Occupancy — active warps / max warps per SM	Whether you have enough resident threads to hide memory and pipeline latency. Low occupancy is often a register- or shared-memory-pressure problem; high occupancy is necessary but not sufficient for speed.
Arithmetic intensity — `FLOPs / bytes`	Which roofline ceiling applies. Low AI ⇒ memory-bound (raise reuse/fusion); high AI ⇒ compute-bound (chase FLOP/s and tensor cores).
Achieved memory bandwidth — GB/s vs HBM peak	For memory-bound kernels, how close you are to the bandwidth roof. Far from peak usually means uncoalesced or strided access patterns.
Achieved FLOP/s — vs peak (and tensor-core peak)	For compute-bound kernels, the gap to the compute roof; a big gap on a GEMM often means tensor cores are idle or the tile shape is wrong.
Warp-stall reasons — memory dep, exec dep, sync, etc.	Where warps actually wait. `long_scoreboard` stalls point at memory latency; barrier stalls point at over-synchronization; this is the "why is occupancy not helping?" detail.
Kernel timeline — launch order, gaps, overlap	Whether the GPU is busy at all. Idle gaps mean host-side launch overhead or data starvation; missing compute/copy overlap means streams are not pipelined.

Why the timeline matters as much as the counters

Per-kernel counters answer "is this kernel efficient?" but the timeline answers "is the GPU even working?". A fleet can be full of perfectly efficient kernels and still sit at 40% utilization because of launch overhead between thousands of tiny kernels, or because a copy and a compute stream never overlap. Kernel selection (this system) and kernel scheduling (the runtime) are different levers — read the timeline before you blame a kernel.

Profiling data pipeline

The naive approach — run Nsight Compute on production and read the report — does not survive contact with a fleet. Full kernel-metric replay can slow a kernel by 10-100× because the profiler re-runs each kernel many times to collect every counter. So the pipeline is layered by cost: cheap activity tracing is always on, expensive metric collection is sampled, and the heavy lifting (aggregation, ranking, regression) happens off the critical path in a store.

flowchart LR
    subgraph FLEET["Instrumented runs (many GPUs)"]
        APP["App + CUPTI / Nsight hook"]
        SAMP["Sampler: 1 of N runs"]
    end
    APP --> COL["Trace collector"]
    SAMP --> COL
    COL --> STORE["Profile store + aggregate"]
    STORE --> AN["Analysis: top kernels, regressions"]
    AN --> DASH["Dashboards + alerts"]
    DASH --> ENG["Performance engineer"]
    ENG -->|drill in| STORE

Instrument with CUPTI / Nsight. The CUPTI Activity API is the workhorse: it streams kernel launch records (name, grid/block, duration, stream) with low overhead, no replay. Reserve Nsight Compute-style metric collection (occupancy, stall reasons, bandwidth) for sampled deep dives, because it replays kernels and is expensive. Two tiers, two overhead budgets.
Collect traces from many runs / GPUs. A node-local collector batches and compresses activity records and ships pointers to a central pipeline. Sampling keeps overhead low: trace 1-in-N runs, or arm deep metric collection only for a few iterations on a few ranks. The same kernel shows up across thousands of runs, so you do not need every run to characterize it.
Store + aggregate. Raw traces are large and write-once → object storage. The aggregate — per (kernel, shape, gpu, build) stats like p50/p95 duration, call count, total time, classification — goes to a queryable store. Aggregation is where "300M launch records" becomes "the 20 kernels that own 80% of the time."
Surface top kernels and regressions. Analysis ranks kernels by total fleet time (your tuning backlog, in priority order) and diffs each build against a baseline to flag slowdowns. The output is two short lists — tune these and something got slower — not a wall of counters.

Overhead is the whole game

The reason this is a pipeline and not "just run the profiler" is that profiling perturbs what it measures. Always-on CUPTI activity tracing is a few percent; full Nsight metric replay can be 50×. The design keeps the always-on tier cheap and treats deep profiling as a sampled, time-boxed, opt-in capture — the same discipline as production tracing, applied to kernels.

Deep dive: auto-tuning the kernel config space

This is the core of the system. A single logical op — say a GEMM of a given (M, N, K) — can be compiled into thousands of distinct kernels that all compute the same result but at wildly different speeds. The knobs are things like tile sizes (BLOCK_M/N/K), block / warp dims, unroll factors, pipeline depth (software pipelining / multi-stage shared-memory prefetch), vector width, and num_warps / num_stages. The best combination depends on the shape and the GPU, and no closed-form rule picks it — so we search. This is exactly what the Triton @autotune decorator, CUTLASS profilers, and TVM/Ansor do.

The compile-and-benchmark loop

The atomic operation is: take a candidate config, compile the kernel variant, benchmark it on the target GPU (warm up, lock clocks, time many iterations, take a robust statistic), and keep the fastest. Compilation is not free and benchmarking burns real GPU time, so the loop is wrapped in caching and smart search to run it as few times as possible.

sequenceDiagram
    participant T as Autotuner
    participant Ca as Config cache
    participant K as Compiler
    participant G as Benchmark GPU
    T->>Ca: Lookup best config for shape + GPU
    Ca-->>T: Cache miss
    loop Each candidate config
        T->>K: Compile kernel variant
        K-->>T: Binary
        T->>G: Benchmark, timed run
        G-->>T: Latency and throughput
    end
    T->>Ca: Store fastest config
    Ca-->>T: Cached for reuse

Search strategies

The config space is combinatorial — easily 10^3 to 10^6 points once you cross all the knobs — and most points are invalid (exceed shared memory or register budgets) or slow. The strategy is how you avoid benchmarking all of them:

Strategy	How it searches	When to use it
Grid / exhaustive	Enumerate a hand-curated candidate list and benchmark every one.	Small, well-understood spaces (Triton's default: a fixed list of configs). Simple and reproducible; explodes if the list grows.
Random	Sample N configs uniformly from the valid space.	Surprisingly strong baseline for big spaces; great for seeding a cost model. Cheap to parallelize across GPUs.
Evolutionary / genetic	Mutate and recombine the best configs found so far, generation by generation.	Large rugged spaces where good configs cluster (Ansor uses this). Finds strong points without a full model.
ML-guided / cost-model	A learned cost model predicts runtime from config features; only the most promising are actually compiled and benchmarked.	When real benchmarks are expensive and you tune the same op family often. Cuts measured trials by orders of magnitude (TVM/Ansor, learned cost models).

In practice these compose: a cost model proposes, an evolutionary search explores around its predictions, and only the top handful are measured on hardware — because the measured benchmark is the ground truth and also the expensive part.

Caching best configs per shape + GPU

Tuning is amortized by never tuning the same thing twice. The output of a tune is a row in a config cache keyed by (op, dtype, shape-bucket, gpu_arch, toolkit_version) → winning config + measured time. A few things make the cache effective:

Shape bucketing. Tuning every exact shape is hopeless, so shapes are bucketed (powers of two, padded to tile multiples). A new shape that hashes to a known bucket is an instant cache hit.
Key on the GPU and toolkit. A config tuned for an A100 is often wrong on an H100, and a compiler upgrade can shift the winner — so arch and version are part of the key, never assumed.
Persist and share. The cache is a fleet-wide artifact (object store + small index), so one tune on one GPU benefits every subsequent job. Cold start tunes; warm start reads.

The autotuner contract

Offline: for each hot (op, shape, gpu), search the config space with a model-guided strategy, benchmark the finalists on real hardware, and write the winner to the cache. Online: a kernel launch looks up its key and gets the pre-tuned config with zero search cost. The profiling pipeline feeds the backlog (which ops/shapes are hot), and the regression gate guards the cached winners. That loop — profile → tune → cache → guard — is the whole product.

Performance regression detection

Correctness has CI; performance usually does not — which is why kernels silently get slower. A code refactor, a compiler bump, or a new driver can knock 20% off a GEMM and no test goes red. This system closes that gap with a performance CI: a curated benchmark suite, per-kernel timing baselines, and a gate that flags slowdowns across commits, compiler, and driver versions.

Element	What it does
Benchmark suite in CI	A pinned set of representative kernels × shapes runs on dedicated, clock-locked GPUs on every relevant change. It is small enough to run often and representative enough to catch real regressions.
Per-kernel timing baselines	Each `(kernel, shape, gpu)` has a stored baseline distribution (p50/p95 over many runs), not a single number — so noise does not masquerade as a regression.
Slowdown flagging	A new result is compared to baseline; a statistically significant slowdown beyond a threshold (e.g. > 5% on p50) fails the gate and points at the offending commit.
Cross-version matrix	The same suite runs across CUDA toolkit and driver versions, so a regression introduced by the environment (not the code) is attributed correctly instead of blamed on the diff.

Measure like you mean it (or measure noise)

GPU timing is noisy unless you control it: lock clocks (disable auto-boost/thermal drift), warm up before timing, run many iterations and take a robust statistic (median, not mean), pin the GPU model, and isolate the benchmark host. Skip this and your "regression detector" mostly detects thermal throttling. Reproducibility is a requirement precisely so a flagged slowdown is real signal.

Cross-reference with observability. Regression detection and the profiling pipeline are two views of the same data. CI catches a slowdown before merge in a controlled harness; fleet observability catches the slowdowns CI missed — a shape CI never benchmarked, a driver rolled out to production, a kernel hot only under real traffic. A regression flagged in CI deep-links to the same per-kernel profile drill-down the observability stack uses, so the responder starts at the evidence, not a red checkmark.

Bottlenecks & scaling

Every part of this system has a failure mode that only shows up at fleet scale, and they all trace back to the same tension: real measurement is expensive (it perturbs workloads and burns GPU time), so the design is a series of moves to measure less while learning more — sample, cache, model, and bucket.

Bottleneck	Why it happens	Mitigation
Profiling overhead	Full Nsight metric replay re-runs each kernel many times, slowing it 10-100× and distorting timing — unusable always-on.	Tier it: always-on CUPTI activity tracing (a few %); sampled, time-boxed metric collection on a few ranks/iterations; never replay in the hot path.
Search space size	Tile × block × unroll × stages × vector-width is `10^3`-`10^6` configs; exhaustive benchmarking is a GPU-week per op.	Model-guided search (cost model proposes, evolutionary explores, only finalists measured); prune invalid configs (resource limits) before compiling.
Benchmark GPU cost	Tuning competes with production for the very GPUs it is trying to speed up; each measured config costs compile + warm-up + timed runs.	Tune offline on a small dedicated pool; batch candidates; reuse compiled binaries; cap measured trials per tune via the cost model.
Config cache hit rate	Too-specific keys (exact shapes) → constant cache misses → constant re-tuning, defeating the point.	Bucket shapes (pad to tile multiples, power-of-two buckets); key on arch + toolkit, not exact device; persist and share the cache fleet-wide.
Hardware variance	Clock drift, thermals, silicon variation, and a heterogeneous fleet (A100 vs H100) make a tuned config non-portable and timings noisy.	Lock clocks; warm up; median over many iterations; tune per arch; treat each GPU generation as a distinct cache key.
Stale winners	A compiler or driver upgrade can change which config is fastest, silently degrading cached choices.	Version the cache key on toolkit/driver; re-validate hot keys in the regression suite; invalidate and re-tune on environment bumps.

Summary

A CUDA profiling & auto-tuning system turns "the GPUs feel slow" into a precise, repeatable loop: profile → tune → cache → guard. Profiling — layered CUPTI activity tracing plus sampled Nsight metrics, read through the roofline lens — ranks kernels by fleet time and tells you whether each is memory-bound or compute-bound. The autotuner takes the hot ones and searches the config space (tiles, blocks, unroll, pipelining) with model-guided strategies borrowed from Triton, CUTLASS, and TVM/Ansor, then writes the winner into a (op, shape, gpu, toolkit) config cache so the next thousand jobs pay zero search cost. A performance CI with clock-locked, reproducible benchmarks guards those winners against code, compiler, and driver regressions. The unifying discipline is that real measurement is expensive — so sample, cache, model, and bucket to measure as little as possible while still always knowing the fastest way to run every kernel that matters.