GPU Autotuning
SCAMP’s GPU path picks one of several pre-built kernel variants (different combinations of block size, diagonals-per-thread, tile height, etc.) for each (profile type, precision) tuple at launch time. Different GPUs prefer different variants — what wins on Ampere isn’t necessarily what wins on Pascal — so SCAMP carries an autotune cache that maps each device to its preferred variant. This page covers what the cache is, where it lives, and how to use it.
TL;DR
SCAMP runs on any supported GPU out of the box with a safe per- profile compile-time default. Running the autotuner once gets you the best variant for your specific GPU; the result is cached per-user and reused automatically by every subsequent SCAMP / pyscamp call.
# CLI: $ SCAMP --autotune # Python: >>> import pyscamp >>> pyscamp.autotune()
This takes a few minutes to run and persists its choices to disk. The default location is
~/.cache/scamp/autotune.txton Linux/macOS and%LOCALAPPDATA%\scamp\autotune.txton Windows. All subsequent SCAMP runs will pick up your tuned config automatically. See Default cache location for the full resolution rules.
How lookups work
When SCAMP launches a GPU kernel, it asks the autotuner for the best
config for the current (device, profile_type, precision) tuple.
The lookup tries these sources in order; the first hit wins:
Process-wide override. Used internally by the autotune benchmark loop to force a specific variant per timed trial, and by CI via the
SCAMP_FORCE_VARIANTenv var to exercise individual variants.User cache —
~/.cache/scamp/autotune.txton Linux/macOS or%LOCALAPPDATA%\scamp\autotune.txton Windows by default (see Default cache location for the full resolution rules). Written bySCAMP --autotune/pyscamp.autotune(). If you’ve tuned for your GPU, this is what gets used.Compile-time default. A safe per-profile-type variant. Works on every supported device with sensible out-of-the-box performance, but running
--autotuneonce for your GPU is typically faster still.
Default cache location
--autotune writes to (and GetKernelConfigForDevice reads from)
the first of these paths that resolves:
$SCAMP_AUTOTUNE_CACHE— when set, used verbatim (Linux, macOS, and Windows).$XDG_CACHE_HOME/scamp/autotune.txt— whenXDG_CACHE_HOMEis set (Linux default for users following the XDG Base Directory spec; honored on Windows too if explicitly set).Platform-specific user dir:
Linux / macOS:
$HOME/.cache/scamp/autotune.txt.Windows:
%LOCALAPPDATA%\scamp\autotune.txt(typicallyC:\Users\<you>\AppData\Local\scamp\autotune.txt), falling back to%USERPROFILE%\.cache\scamp\autotune.txtifLOCALAPPDATAis unset.
The parent directory is created automatically by Save() if it
doesn’t exist, so you don’t need to mkdir -p it yourself.
Running the autotuner
SCAMP --autotune (or pyscamp.autotune()) sweeps every enabled
variant × every supported block size for every (profile_type,
precision) pair and persists the per-tuple winner to the user cache.
A full sweep is the current variant count × 4 block sizes × 10 targets;
with the 5 variants enabled today that’s 200 benchmark trials. With
the default benchmark workload (256K-element synthetic self-join) the
sweep takes ~10-20 minutes on a recent GPU; the output is verbose by
default so you can see progress.
Choosing the benchmark workload size
The synthetic workload used per trial is sized via
SCAMP_AUTOTUNE_INPUT_LENGTH (default 262144 = 256K elements).
Work scales like n², so doubling the size roughly quadruples the
sweep’s wall-clock cost — but the per-variant ranking gets tighter as
n grows: at small n the FFT/stats prelude dominates and trial
timings collapse toward the noise floor, while at production sizes the
kernel work swamps the prelude and the ranking is dominated by what
you actually care about.
Empirical comparison on an RTX 3080 across the standard 10 (profile, precision) targets:
Input length |
Sweep wall |
Cross-target geomean |
Worst-case ratio |
Per-target winners |
|---|---|---|---|---|
65536 (64K) |
~4 min |
1.325 |
2.25 |
shift vs 128K |
131072 (128K) |
~8 min |
1.308 |
3.47 |
shift vs 256K |
262144 (256K) |
~25 min |
1.278 |
2.77 |
default |
The geomean ratio above is the cross-target “best recommended default” score (lower is better — 1.000 would mean a single variant tied with every per-target winner). The 256K row is meaningfully tighter than 128K (worst-case ratio drops 3.47 → 2.77, a ~20% reduction), and importantly, the per-target winners themselves shift between the rows (e.g. SUM_THRESH/DOUBLE picks different variants at 64K vs 128K vs 256K), so a smaller-N autotune doesn’t just mis-rank the cross-target default — it picks suboptimal entries for individual cache rows.
Run with a smaller value if 256K is impractically slow on your GPU (older Pascal or T-class cards can take well over an hour at 256K), or with a larger value when you want tighter rankings for a workload you know runs at large input sizes:
$ SCAMP_AUTOTUNE_INPUT_LENGTH=131072 SCAMP --autotune # fast/casual
$ SCAMP_AUTOTUNE_INPUT_LENGTH=524288 SCAMP --autotune # tighter still
The trade-off is wall-clock: 256k takes 16x longer than 128k.
Another important note is that if you don’t plan on running large joins in practice, tuning with a smaller input size is more relevant to your workload. If you will only use SCAMP on smaller inputs there is no need to tune for a larger size.
Choosing the device(s)
Both the CLI and pyscamp.autotune() default to device 0 only —
on a multi-GPU box with identical devices, tuning them all wastes time.
Override explicitly if you really do need to tune a second physical GPU
type:
$ SCAMP --autotune --gpus=0,1 # CLI
>>> pyscamp.autotune(devices=[0, 1]) # Python
Other autotune environment variables
A handful of additional env vars let you tune SCAMP’s autotune / launch-time behavior without rebuilding. All are read on first use and their value is cached for the lifetime of the process — re-export changes after first use have no effect.
SCAMP_AUTOTUNE_PRECISION_FILTERSINGLE|DOUBLE|all(default). RestrictsSCAMP --autotune(andpyscamp.autotune()) to one precision. Filtered targets are reported asSKIPPEDand their cache entries are left untouched, so you can re-run for the other precision without losing the existing entries.SCAMP_AUTOTUNE_VARIANT_FILTERshfl|sliding-window(alsosw/smem) |all(default). Restricts the autotune sweep to one variant family.shflmatches every variant withunrolled_rows == 0(the cov-shuffle kernel); the other names match the sliding-window kernel. Useful when iterating on a specific kernel family without sweeping the other.SCAMP_AUTOTUNE_WARMUP_RUNSPer-trial warmup count for the autotuner’s bench function. Default
0: the first launch of a given(variant, blocksz)instantiation is typically only a few percent slower than steady-state because most JIT / module-load cost is amortized by the process-level first launch, and the cross-target geomean ranking tolerates a few percent of noise. Set to1(or more) when trial timings look noisy or on a colder GPU/driver where the first launch of a never-before-seen template instantiation takes significantly longer than steady-state. Value is cached at first autotune call.SCAMP_FORCE_VARIANTIndex of a single GPU kernel variant to force for every
(profile, precision)launch, bypassing the autotune cache and cold-start default. The precision still picks the cold-start blocksz; the full{64, 128, 256, 512}blocksz axis is NOT swept here. Used by CI to exercise each compiled variant against the correctness test suite without writing per-variant cache files. Valid indices are reported bySCAMP --list_variants. Out-of-range or malformed values silently fall through to the normal lookup path. Value is cached at first kernel launch.
The value is interpreted as truthy unless it’s exactly 0, false,
FALSE, or empty.
Clearing or resetting the cache
The user cache is a plain-text file at the location described in Default cache location. To start fresh:
# Linux / macOS:
$ rm ~/.cache/scamp/autotune.txt
# Windows (PowerShell):
> Remove-Item "$env:LOCALAPPDATA\scamp\autotune.txt"
The next SCAMP run will fall through to the compile-time default and
emit a miss warning. Run --autotune again to regenerate the file.
If you suspect the user cache has a bad entry but don’t want to delete
the file (e.g. it has good entries for some devices), you can edit it
by hand — each line is one record, # starts a comment, and the
format is documented in the file’s own header.
To bypass the cache entirely without deleting it, point
SCAMP_AUTOTUNE_CACHE at an empty file:
$ touch /tmp/empty_cache.txt
$ SCAMP_AUTOTUNE_CACHE=/tmp/empty_cache.txt SCAMP ...
What happens to my cache when I upgrade SCAMP?
By default, an existing user cache survives the upgrade — the file
format is keyed by the variant geometry tuple
(blocks_per_sm, diags_per_thread, unrolled_rows, outer_unrolled_rows,
kernel_tile_iters), not by a position in some table, so cache entries
that still match a current kernel variant continue to hit.
The three things that can happen to an existing entry after an upgrade:
The new SCAMP build still has your entry’s variant tuple. Lookup succeeds and you keep your tuned config. This is the common case when a release just adds new variants.
The new build retired your entry’s variant. The runtime rejects the entry (it doesn’t match any current variant) and falls through to the compile-time default. Other entries in the same cache file are unaffected; only the one(s) naming the retired variant fall through. Run
--autotuneagain to refresh the affected(device, profile, precision)tuples.The release bumped the cache file’s version header. This is reserved for hard-incompatibility changes (the file schema changed, or kernel semantics shifted enough that every tuned config is stale). The new SCAMP silently treats the file as empty —
--autotuneis required to get back to a tuned state. SCAMP will not throw or refuse to run; it will just use the compile-time defaults until you re-tune.
You don’t need to delete your cache after a SCAMP upgrade. Run
--autotune again if a release note tells you to; otherwise, your
existing entries keep being used wherever they remain applicable.
Troubleshooting
- “My configs aren’t being respected on a multi-GPU box.”
The cache is keyed by sanitized device name + compute capability (e.g.
NVIDIA_GeForce_RTX_3080__sm_86). If you have two different GPU models, you need entries for both — autotune device 0 first, then re-run with--gpus=1(ordevices=[1]) for the second.- “I want to test a specific variant by hand.”
Edit the user cache directly (
~/.cache/scamp/autotune.txton Linux/macOS,%LOCALAPPDATA%\scamp\autotune.txton Windows): each line isdevice_key|profile|precision|blocksz|bps|dpt|ur|our|kti. However, you must specify a valid variant defined insrc/core/gpu_kernel/CMakeLists.txt; the variants have to be specified at BUILD time for them to be included in the SCAMP binary. Lines that name an unknown variant are silently rejected at lookup time and the next source is consulted, so it’s safe to experiment.