GPU Autotuning

SCAMP’s GPU path picks one of several pre-built kernel variants (different combinations of block size, diagonals-per-thread, tile height, etc.) for each (profile type, precision) tuple at launch time. Different GPUs prefer different variants — what wins on Ampere isn’t necessarily what wins on Pascal — so SCAMP carries an autotune cache that maps each device to its preferred variant. This page covers what the cache is, where it lives, and how to use it.

TL;DR

SCAMP runs on any supported GPU out of the box with a safe per- profile compile-time default. Running the autotuner once gets you the best variant for your specific GPU; the result is cached per-user and reused automatically by every subsequent SCAMP / pyscamp call.
```
# CLI:
$ SCAMP --autotune

# Python:
>>> import pyscamp
>>> pyscamp.autotune()
```
This takes a few minutes to run and persists its choices to disk. The default location is ~/.cache/scamp/autotune.txt on Linux/macOS and %LOCALAPPDATA%\scamp\autotune.txt on Windows. All subsequent SCAMP runs will pick up your tuned config automatically. See Default cache location for the full resolution rules.

How lookups work

When SCAMP launches a GPU kernel, it asks the autotuner for the best config for the current (device, profile_type, precision) tuple. The lookup tries these sources in order; the first hit wins:

Process-wide override. Used internally by the autotune benchmark loop to force a specific variant per timed trial, and by CI via the SCAMP_FORCE_VARIANT env var to exercise individual variants.
User cache — ~/.cache/scamp/autotune.txt on Linux/macOS or %LOCALAPPDATA%\scamp\autotune.txt on Windows by default (see Default cache location for the full resolution rules). Written by SCAMP --autotune / pyscamp.autotune(). If you’ve tuned for your GPU, this is what gets used.
Compile-time default. A safe per-profile-type variant. Works on every supported device with sensible out-of-the-box performance, but running --autotune once for your GPU is typically faster still.

Default cache location

--autotune writes to (and GetKernelConfigForDevice reads from) the first of these paths that resolves:

$SCAMP_AUTOTUNE_CACHE — when set, used verbatim (Linux, macOS, and Windows).
$XDG_CACHE_HOME/scamp/autotune.txt — when XDG_CACHE_HOME is set (Linux default for users following the XDG Base Directory spec; honored on Windows too if explicitly set).
Platform-specific user dir:
- Linux / macOS: $HOME/.cache/scamp/autotune.txt.
- Windows: %LOCALAPPDATA%\scamp\autotune.txt (typically C:\Users\<you>\AppData\Local\scamp\autotune.txt), falling back to %USERPROFILE%\.cache\scamp\autotune.txt if LOCALAPPDATA is unset.

The parent directory is created automatically by Save() if it doesn’t exist, so you don’t need to mkdir -p it yourself.

Running the autotuner

SCAMP --autotune (or pyscamp.autotune()) sweeps every enabled variant × every supported block size for every (profile_type, precision) pair and persists the per-tuple winner to the user cache. A full sweep is the current variant count × 4 block sizes × 10 targets; with the 5 variants enabled today that’s 200 benchmark trials. With the default benchmark workload (256K-element synthetic self-join) the sweep takes ~10-20 minutes on a recent GPU; the output is verbose by default so you can see progress.

Choosing the benchmark workload size

The synthetic workload used per trial is sized via SCAMP_AUTOTUNE_INPUT_LENGTH (default 262144 = 256K elements). Work scales like n², so doubling the size roughly quadruples the sweep’s wall-clock cost — but the per-variant ranking gets tighter as n grows: at small n the FFT/stats prelude dominates and trial timings collapse toward the noise floor, while at production sizes the kernel work swamps the prelude and the ranking is dominated by what you actually care about.

Empirical comparison on an RTX 3080 across the standard 10 (profile, precision) targets:

Input length	Sweep wall	Cross-target geomean	Worst-case ratio	Per-target winners
65536 (64K)	~4 min	1.325	2.25	shift vs 128K
131072 (128K)	~8 min	1.308	3.47	shift vs 256K
262144 (256K)	~25 min	1.278	2.77	default

The geomean ratio above is the cross-target “best recommended default” score (lower is better — 1.000 would mean a single variant tied with every per-target winner). The 256K row is meaningfully tighter than 128K (worst-case ratio drops 3.47 → 2.77, a ~20% reduction), and importantly, the per-target winners themselves shift between the rows (e.g. SUM_THRESH/DOUBLE picks different variants at 64K vs 128K vs 256K), so a smaller-N autotune doesn’t just mis-rank the cross-target default — it picks suboptimal entries for individual cache rows.

Run with a smaller value if 256K is impractically slow on your GPU (older Pascal or T-class cards can take well over an hour at 256K), or with a larger value when you want tighter rankings for a workload you know runs at large input sizes:

$ SCAMP_AUTOTUNE_INPUT_LENGTH=131072 SCAMP --autotune  # fast/casual
$ SCAMP_AUTOTUNE_INPUT_LENGTH=524288 SCAMP --autotune  # tighter still

The trade-off is wall-clock: 256k takes 16x longer than 128k.

Another important note is that if you don’t plan on running large joins in practice, tuning with a smaller input size is more relevant to your workload. If you will only use SCAMP on smaller inputs there is no need to tune for a larger size.

Choosing the device(s)

Both the CLI and pyscamp.autotune() default to device 0 only — on a multi-GPU box with identical devices, tuning them all wastes time. Override explicitly if you really do need to tune a second physical GPU type:

$ SCAMP --autotune --gpus=0,1            # CLI
>>> pyscamp.autotune(devices=[0, 1])     # Python

Other autotune environment variables

A handful of additional env vars let you tune SCAMP’s autotune / launch-time behavior without rebuilding. All are read on first use and their value is cached for the lifetime of the process — re-export changes after first use have no effect.

SCAMP_AUTOTUNE_PRECISION_FILTER: SINGLE | DOUBLE | all (default). Restricts SCAMP --autotune (and pyscamp.autotune()) to one precision. Filtered targets are reported as SKIPPED and their cache entries are left untouched, so you can re-run for the other precision without losing the existing entries.
SCAMP_AUTOTUNE_VARIANT_FILTER: shfl | sliding-window (also sw / smem) | all (default). Restricts the autotune sweep to one variant family. shfl matches every variant with unrolled_rows == 0 (the cov-shuffle kernel); the other names match the sliding-window kernel. Useful when iterating on a specific kernel family without sweeping the other.
SCAMP_AUTOTUNE_WARMUP_RUNS: Per-trial warmup count for the autotuner’s bench function. Default 0: the first launch of a given (variant, blocksz) instantiation is typically only a few percent slower than steady-state because most JIT / module-load cost is amortized by the process-level first launch, and the cross-target geomean ranking tolerates a few percent of noise. Set to 1 (or more) when trial timings look noisy or on a colder GPU/driver where the first launch of a never-before-seen template instantiation takes significantly longer than steady-state. Value is cached at first autotune call.
SCAMP_FORCE_VARIANT: Index of a single GPU kernel variant to force for every (profile, precision) launch, bypassing the autotune cache and cold-start default. The precision still picks the cold-start blocksz; the full {64, 128, 256, 512} blocksz axis is NOT swept here. Used by CI to exercise each compiled variant against the correctness test suite without writing per-variant cache files. Valid indices are reported by SCAMP --list_variants. Out-of-range or malformed values silently fall through to the normal lookup path. Value is cached at first kernel launch.

The value is interpreted as truthy unless it’s exactly 0, false, FALSE, or empty.

Clearing or resetting the cache

The user cache is a plain-text file at the location described in Default cache location. To start fresh:

# Linux / macOS:
$ rm ~/.cache/scamp/autotune.txt

# Windows (PowerShell):
> Remove-Item "$env:LOCALAPPDATA\scamp\autotune.txt"

The next SCAMP run will fall through to the compile-time default and emit a miss warning. Run --autotune again to regenerate the file.

If you suspect the user cache has a bad entry but don’t want to delete the file (e.g. it has good entries for some devices), you can edit it by hand — each line is one record, # starts a comment, and the format is documented in the file’s own header.

To bypass the cache entirely without deleting it, point SCAMP_AUTOTUNE_CACHE at an empty file:

$ touch /tmp/empty_cache.txt
$ SCAMP_AUTOTUNE_CACHE=/tmp/empty_cache.txt SCAMP ...

What happens to my cache when I upgrade SCAMP?

By default, an existing user cache survives the upgrade — the file format is keyed by the variant geometry tuple (blocks_per_sm, diags_per_thread, unrolled_rows, outer_unrolled_rows, kernel_tile_iters), not by a position in some table, so cache entries that still match a current kernel variant continue to hit.

The three things that can happen to an existing entry after an upgrade:

The new SCAMP build still has your entry’s variant tuple. Lookup succeeds and you keep your tuned config. This is the common case when a release just adds new variants.
The new build retired your entry’s variant. The runtime rejects the entry (it doesn’t match any current variant) and falls through to the compile-time default. Other entries in the same cache file are unaffected; only the one(s) naming the retired variant fall through. Run --autotune again to refresh the affected (device, profile, precision) tuples.
The release bumped the cache file’s version header. This is reserved for hard-incompatibility changes (the file schema changed, or kernel semantics shifted enough that every tuned config is stale). The new SCAMP silently treats the file as empty — --autotune is required to get back to a tuned state. SCAMP will not throw or refuse to run; it will just use the compile-time defaults until you re-tune.

You don’t need to delete your cache after a SCAMP upgrade. Run --autotune again if a release note tells you to; otherwise, your existing entries keep being used wherever they remain applicable.

Troubleshooting

“My configs aren’t being respected on a multi-GPU box.”: The cache is keyed by sanitized device name + compute capability (e.g. NVIDIA_GeForce_RTX_3080__sm_86). If you have two different GPU models, you need entries for both — autotune device 0 first, then re-run with --gpus=1 (or devices=[1]) for the second.
“I want to test a specific variant by hand.”: Edit the user cache directly (~/.cache/scamp/autotune.txt on Linux/macOS, %LOCALAPPDATA%\scamp\autotune.txt on Windows): each line is device_key|profile|precision|blocksz|bps|dpt|ur|our|kti. However, you must specify a valid variant defined in src/core/gpu_kernel/CMakeLists.txt; the variants have to be specified at BUILD time for them to be included in the SCAMP binary. Lines that name an unknown variant are silently rejected at lookup time and the next source is consulted, so it’s safe to experiment.