GPU Autotuning
==============

SCAMP's GPU path picks one of several pre-built kernel variants (different
combinations of block size, diagonals-per-thread, tile height, etc.) for
each (profile type, precision) tuple at launch time. Different GPUs prefer
different variants — what wins on Ampere isn't necessarily what wins on
Pascal — so SCAMP carries an *autotune cache* that maps each device to
its preferred variant. This page covers what the cache is, where it
lives, and how to use it.

TL;DR
-----

* SCAMP runs on any supported GPU out of the box with a safe per-
  profile compile-time default. Running the autotuner once gets you the
  best variant for your specific GPU; the result is cached per-user and
  reused automatically by every subsequent SCAMP / pyscamp call.

  .. code-block:: console

     # CLI:
     $ SCAMP --autotune

     # Python:
     >>> import pyscamp
     >>> pyscamp.autotune()

  This takes a few minutes to run and persists its choices to disk. The
  default location is ``~/.cache/scamp/autotune.txt`` on Linux/macOS and
  ``%LOCALAPPDATA%\scamp\autotune.txt`` on Windows. All subsequent SCAMP
  runs will pick up your tuned config automatically. See
  :ref:`autotune-default-path` for the full resolution rules.

How lookups work
----------------

When SCAMP launches a GPU kernel, it asks the autotuner for the best
config for the current ``(device, profile_type, precision)`` tuple.
The lookup tries these sources in order; the first hit wins:

1. **Process-wide override.** Used internally by the autotune benchmark
   loop to force a specific variant per timed trial, and by CI via the
   ``SCAMP_FORCE_VARIANT`` env var to exercise individual variants.

2. **User cache** — ``~/.cache/scamp/autotune.txt`` on Linux/macOS or
   ``%LOCALAPPDATA%\scamp\autotune.txt`` on Windows by default (see
   :ref:`autotune-default-path` for the full resolution rules).
   Written by ``SCAMP --autotune`` / ``pyscamp.autotune()``. If you've
   tuned for your GPU, this is what gets used.

3. **Compile-time default.** A safe per-profile-type variant. Works on
   every supported device with sensible out-of-the-box performance, but
   running ``--autotune`` once for your GPU is typically faster still.

.. _autotune-default-path:

Default cache location
----------------------

``--autotune`` writes to (and ``GetKernelConfigForDevice`` reads from)
the first of these paths that resolves:

1. ``$SCAMP_AUTOTUNE_CACHE`` — when set, used verbatim (Linux, macOS,
   and Windows).
2. ``$XDG_CACHE_HOME/scamp/autotune.txt`` — when ``XDG_CACHE_HOME`` is
   set (Linux default for users following the XDG Base Directory spec;
   honored on Windows too if explicitly set).
3. Platform-specific user dir:

   * Linux / macOS: ``$HOME/.cache/scamp/autotune.txt``.
   * Windows: ``%LOCALAPPDATA%\scamp\autotune.txt`` (typically
     ``C:\Users\<you>\AppData\Local\scamp\autotune.txt``), falling back
     to ``%USERPROFILE%\.cache\scamp\autotune.txt`` if ``LOCALAPPDATA``
     is unset.

The parent directory is created automatically by ``Save()`` if it
doesn't exist, so you don't need to ``mkdir -p`` it yourself.

Running the autotuner
---------------------

``SCAMP --autotune`` (or ``pyscamp.autotune()``) sweeps every enabled
variant × every supported block size for every ``(profile_type,
precision)`` pair and persists the per-tuple winner to the user cache.
A full sweep is the current variant count × 4 block sizes × 10 targets;
with the 5 variants enabled today that's 200 benchmark trials. With
the default benchmark workload (256K-element synthetic self-join) the
sweep takes ~10-20 minutes on a recent GPU; the output is verbose by
default so you can see progress.

Choosing the benchmark workload size
""""""""""""""""""""""""""""""""""""

The synthetic workload used per trial is sized via
``SCAMP_AUTOTUNE_INPUT_LENGTH`` (default 262144 = 256K elements).
Work scales like *n²*, so doubling the size roughly quadruples the
sweep's wall-clock cost — but the per-variant ranking gets tighter as
*n* grows: at small *n* the FFT/stats prelude dominates and trial
timings collapse toward the noise floor, while at production sizes the
kernel work swamps the prelude and the ranking is dominated by what
you actually care about.

Empirical comparison on an RTX 3080 across the standard 10
(profile, precision) targets:

============== ============== ============== ============== ===============
Input length   Sweep wall     Cross-target   Worst-case     Per-target
                              geomean        ratio          winners
============== ============== ============== ============== ===============
65536 (64K)    ~4 min         1.325          2.25           shift vs 128K
131072 (128K)  ~8 min         1.308          3.47           shift vs 256K
262144 (256K)  ~25 min        1.278          2.77           default
============== ============== ============== ============== ===============

The geomean ratio above is the cross-target "best recommended default"
score (lower is better — 1.000 would mean a single variant tied with
every per-target winner). The 256K row is meaningfully tighter than
128K (worst-case ratio drops 3.47 → 2.77, a ~20% reduction), and
importantly, the per-target *winners* themselves shift between the
rows (e.g. SUM_THRESH/DOUBLE picks different variants at 64K vs 128K
vs 256K), so a smaller-N autotune doesn't just mis-rank the
cross-target default — it picks suboptimal entries for individual
cache rows.

Run with a smaller value if 256K is impractically slow on your GPU
(older Pascal or T-class cards can take well over an hour at 256K),
or with a larger value when you want tighter rankings for a workload
you know runs at large input sizes:

.. code-block:: console

   $ SCAMP_AUTOTUNE_INPUT_LENGTH=131072 SCAMP --autotune  # fast/casual
   $ SCAMP_AUTOTUNE_INPUT_LENGTH=524288 SCAMP --autotune  # tighter still

The trade-off is wall-clock: 256k takes 16x longer than 128k.

Another important note is that if you don't plan on running large joins in
practice, tuning with a smaller input size is more relevant to your workload.
If you will only use SCAMP on smaller inputs there is no need to tune for a
larger size.

Choosing the device(s)
""""""""""""""""""""""

Both the CLI and ``pyscamp.autotune()`` default to **device 0 only** —
on a multi-GPU box with identical devices, tuning them all wastes time.
Override explicitly if you really do need to tune a second physical GPU
type:

.. code-block:: console

   $ SCAMP --autotune --gpus=0,1            # CLI
   >>> pyscamp.autotune(devices=[0, 1])     # Python

Other autotune environment variables
------------------------------------

A handful of additional env vars let you tune SCAMP's autotune /
launch-time behavior without rebuilding. All are read on first use and
their value is cached for the lifetime of the process — re-export
changes after first use have no effect.

``SCAMP_AUTOTUNE_PRECISION_FILTER``
    ``SINGLE`` | ``DOUBLE`` | ``all`` (default). Restricts ``SCAMP
    --autotune`` (and ``pyscamp.autotune()``) to one precision.
    Filtered targets are reported as ``SKIPPED`` and their cache entries
    are left untouched, so you can re-run for the other precision
    without losing the existing entries.

``SCAMP_AUTOTUNE_VARIANT_FILTER``
    ``shfl`` | ``sliding-window`` (also ``sw`` / ``smem``) | ``all``
    (default). Restricts the autotune sweep to one variant family.
    ``shfl`` matches every variant with ``unrolled_rows == 0`` (the
    cov-shuffle kernel); the other names match the sliding-window
    kernel. Useful when iterating on a specific kernel family without
    sweeping the other.

``SCAMP_AUTOTUNE_WARMUP_RUNS``
    Per-trial warmup count for the autotuner's bench function. Default
    ``0``: the first launch of a given ``(variant, blocksz)``
    instantiation is typically only a few percent slower than
    steady-state because most JIT / module-load cost is amortized by
    the process-level first launch, and the cross-target geomean
    ranking tolerates a few percent of noise. Set to ``1`` (or more)
    when trial timings look noisy or on a colder GPU/driver where the
    first launch of a never-before-seen template instantiation takes
    significantly longer than steady-state. Value is cached at first
    autotune call.

``SCAMP_FORCE_VARIANT``
    Index of a single GPU kernel variant to force for every
    ``(profile, precision)`` launch, bypassing the autotune cache and
    cold-start default. The precision still picks the cold-start
    blocksz; the full ``{64, 128, 256, 512}`` blocksz axis is NOT
    swept here. Used by CI to exercise each compiled variant against
    the correctness test suite without writing per-variant cache
    files. Valid indices are reported by ``SCAMP --list_variants``.
    Out-of-range or malformed values silently fall through to the
    normal lookup path. Value is cached at first kernel launch.

The value is interpreted as truthy unless it's exactly ``0``, ``false``,
``FALSE``, or empty.

Clearing or resetting the cache
-------------------------------

The user cache is a plain-text file at the location described in
:ref:`autotune-default-path`. To start fresh:

.. code-block:: console

   # Linux / macOS:
   $ rm ~/.cache/scamp/autotune.txt

   # Windows (PowerShell):
   > Remove-Item "$env:LOCALAPPDATA\scamp\autotune.txt"

The next SCAMP run will fall through to the compile-time default and
emit a miss warning. Run ``--autotune`` again to regenerate the file.

If you suspect the user cache has a bad entry but don't want to delete
the file (e.g. it has good entries for *some* devices), you can edit it
by hand — each line is one record, ``#`` starts a comment, and the
format is documented in the file's own header.

To bypass the cache entirely without deleting it, point
``SCAMP_AUTOTUNE_CACHE`` at an empty file:

.. code-block:: console

   $ touch /tmp/empty_cache.txt
   $ SCAMP_AUTOTUNE_CACHE=/tmp/empty_cache.txt SCAMP ...

What happens to my cache when I upgrade SCAMP?
----------------------------------------------

By default, an existing user cache survives the upgrade — the file
format is keyed by the variant geometry tuple
``(blocks_per_sm, diags_per_thread, unrolled_rows, outer_unrolled_rows,
kernel_tile_iters)``, not by a position in some table, so cache entries
that still match a current kernel variant continue to hit.

The three things that can happen to an existing entry after an upgrade:

* **The new SCAMP build still has your entry's variant tuple.** Lookup
  succeeds and you keep your tuned config. This is the common case
  when a release just adds new variants.
* **The new build retired your entry's variant.** The runtime rejects
  the entry (it doesn't match any current variant) and falls through to
  the compile-time default. Other entries in the same cache file are
  unaffected; only the one(s) naming the retired variant fall through.
  Run ``--autotune`` again to refresh the affected ``(device, profile,
  precision)`` tuples.
* **The release bumped the cache file's version header.** This is
  reserved for hard-incompatibility changes (the file schema changed,
  or kernel semantics shifted enough that *every* tuned config is
  stale). The new SCAMP silently treats the file as empty —
  ``--autotune`` is required to get back to a tuned state. SCAMP will
  not throw or refuse to run; it will just use the compile-time
  defaults until you re-tune.

You don't need to delete your cache after a SCAMP upgrade. Run
``--autotune`` again if a release note tells you to; otherwise, your
existing entries keep being used wherever they remain applicable.

Troubleshooting
---------------

**"My configs aren't being respected on a multi-GPU box."**
   The cache is keyed by sanitized device name + compute capability
   (e.g. ``NVIDIA_GeForce_RTX_3080__sm_86``). If you have two different
   GPU models, you need entries for both — autotune device 0 first,
   then re-run with ``--gpus=1`` (or ``devices=[1]``) for the second.

**"I want to test a specific variant by hand."**
   Edit the user cache directly (``~/.cache/scamp/autotune.txt`` on
   Linux/macOS, ``%LOCALAPPDATA%\scamp\autotune.txt`` on Windows): each
   line is ``device_key|profile|precision|blocksz|bps|dpt|ur|our|kti``.
   However, you must specify a valid variant defined in 
   ``src/core/gpu_kernel/CMakeLists.txt``; the variants have to be
   specified at BUILD time for them to be included in the SCAMP binary.
   Lines that name an unknown variant are silently rejected at lookup
   time and the next source is consulted, so it's safe to experiment.