10. Autotuning and benchmarking
10.1 Autotuner (autotune.py)
For RPA-based decoders, there are several knobs that directly impact runtime and quality:
- Beam size for Dumer-list (
list_size/ “beam”). - Number of RPA iterations.
- Parameters for local SNAP refinement.
Picking them manually for each system and workload is tedious. The autotune.py module automates this by running small synthetic experiments and caching the results.
Effort parameter model
The autotuner uses two small dataclasses:
EffortParams collects the knobs for a particular configuration:
beam– list size for Dumer-list (and RPA beam).chase_limit– reserved for potential chase-based variants.rpa_iters– number of RPA-1 iterations.snap_pool,snap_t,snap_strong– SNAP neighborhood size and whether to enable branch-and-bound “strong SNAP”.
MappingEntry records the measured performance for a specific bucket:
params– the chosenEffortParams.median_ms,mean_ms– timing statistics over a small sample.trials– how many sample words were used.
These are stored as plain JSON via asdict() so the cache is human-readable.
Cache format and bucketing
The cache lives in a single JSON file:
- Path is either
RM_AUTOTUNE_CACHE(if set) or~/.rmsynth_autotune.json. _load_cache()and_save_cache()read/write that file; errors are swallowed so autotuning never crashes the main program.
Each calibration result is keyed by a string of the form:
n{n}/pre{lo}-{hi}/sel:{selector}/pd:{pareto_dist_slack}/pl:{pareto_latency_factor}
where:
nis the number of qubits.pre{lo}-{hi}is a bucket for the pre-optimization T-count (odd weight), grouped in bins of 8 and capped at \(L = 2^n - 1\).selectoris the selection strategy (quality-under-targetorpareto).pdandplencode the Pareto parameters so different tradeoff policies don’t collide.
This design ensures that if you ask for the same problem size, pre‑T range, and policy, you get consistent EffortParams across runs.
Candidate grid and measurement
The core of autotuning builds a fixed “grid” of candidate EffortParams:
_candidate_grid(target_ms) enumerates beams in {4, 8, 16, 32}, RPA iterations in {1, 2, 3}, and SNAP pool sizes in {8, 12, 16, 24}, always with chase_limit=16 and snap_t=2.
The grid is sorted differently depending on target_ms so that very tight latency budgets prioritize small beams and few iterations first.
Synthetic test words are created by _make_sample_words(n, flips, L, trials, seed):
- For each trial, it chooses
flipsrandom positions in[0..L-1]and sets those bits to 1, others to 0. - This models oddness vectors with a given sparsity.
_measure_config(...) runs a full decode loop for each synthetic word with a particular EffortParams:
- It uses
decode_rm(..., "rpa-adv", list_size=beam, rpa_iters=rpa_iters, snap=..., ...). - First call is a warm‑up (not timed) to amortize import and JIT overhead.
- It then records the per-trial latency (in milliseconds) and the resulting Hamming distances, and returns
(median_ms, mean_ms, mean_dist).
Selection strategies
calibrate_single(...) ties everything together for one bucket:
- It checks that
rmcoreis available; otherwise it returns a dummy entry marking the configuration as unusable (very large median, zero trials). - It reads selector and Pareto parameters from arguments or environment:
RM_AUTOTUNE_SELECTOR(default"quality-under-target").RM_AUTOTUNE_PARETO_DIST(default 1).RM_AUTOTUNE_PARETO_LAT(default 1.10).
Two selection modes are supported:
"quality-under-target" (default)
For all configs with median_ms <= target_ms, it picks the one with smallest mean_dist, breaking ties by (median_ms, mean_ms). If no config meets the target, it simply picks the fastest median.
"pareto"
Let f be the fastest median across all configs. The tuner defines a latency bound lat_ref = min(target_ms, f * pareto_latency_factor). It then:
- Keeps only configs with
median_ms <= lat_ref. - Finds the best
mean_distin this pool. - Keeps configurations within
pareto_dist_slackof this best distance. - Among those, picks the smallest median.
This gives a simple latency–quality Pareto selection while avoiding extreme outliers.
Finally, calibrate_single stores the chosen configuration into the cache with all relevant metadata and returns a MappingEntry.
suggest_params(...) is the user-facing entry point:
It computes the bucket key, looks up the cache, and if present, reconstructs an EffortParams object. On cache miss, it calls calibrate_single using environment defaults for number of trials (RM_AUTOTUNE_TRIALS, default 4) and RNG seed (RM_AUTOTUNE_SEED, default 123), then returns the calibrated parameters.
The Optimizer uses this through its “auto-latency” mode.
10.2 Benchmarks (bench.py)
The bench.py module provides a small synthetic benchmarking harness that can be used from Python (or tests) without installing the CLI.
Instance generators
All instance generators produce a length \(L = 2^n - 1\) vector over \(\mathbb{Z}_8\):
gen_near_constant_one(n, flips, rng, randomize_positions=True): Starts from the all-ones oddness vector (every coefficient \(\equiv 1 \pmod 2\)) and flipsflipsrandom positions to 0. This models “near-identity” circuits. It then encodes this oddness into a \(\mathbb{Z}_8\) vector where only the least significant bit is used.gen_xor_two_triples(n, rng): Requires \(n \ge 6\). It constructs an oddness pattern equal to the XOR of two degree‑3 monomials: \((x_0 \wedge x_1 \wedge x_2) \oplus (x_3 \wedge x_4 \wedge x_5)\). The even part is zero. This is useful to probe behavior on low‑order structured polynomials.gen_random_sparse(n, density, rng): Draws odd bits i.i.d.Bernoulli(density), with even bits zero.gen_random_z8(n, density_odd, rng): Draws each coefficient as: odd bitBernoulli(density_odd), plus an even part chosen uniformly from \(\{0,2,4,6\}\). This is closer to a generic phase‑polynomial workload.
The dispatcher _gen_instance(cfg, rng) chooses among these based on BenchConfig.gen.
Benchmark configuration and result types
BenchConfig is a dataclass describing one benchmark run:
- Problem parameters:
n,trials. - Instance generator and its parameters:
gen,flips,density. - Strategies to test:
strategies(default: a mix of Dumer, Dumer-list, Chase, RPA beam, and RPA-adv). - Mode:
"decode"(default) or"pipeline". - Effort and tuning:
effort,list_size,chase_t,chase_limit,rpa_iters,snap_t,snap_pool. - RNG seed.
TrialResult records, for each strategy and each instance:
strategy– name used indecode_rmorOptimizer.dist– distance (decode mode) or after‑T (pipeline mode).before_t,after_t– T-count-like quantities before and after the step.time_ms– runtime in milliseconds.
Aggregate holds per-strategy summary statistics:
- Mean, median, stdev, min, max of
dist. - Mean
after_t. - Mean latency in milliseconds.
Modes: decode vs pipeline
run_benchmark(cfg) orchestrates everything:
It first checks that rmcore is available if cfg.mode != "ml-exact". For \(n \ge 6\), a pure brute‑force ML decoder would be unacceptably slow, so the benchmark refuses to run without the C++ extension.
For each trial:
- It samples a
vecaccording tocfg.gen. - It sets
r = n - 4, consistent with the main optimizer. - It loops over
cfg.strategies.
In "decode" mode:
It calls _decode_once(vec, n, r, strategy, cfg), which:
- Extracts
w_bitsas the oddness pattern. - Maps the strategy string to an appropriate
decode_rmcall. - Measures latency and returns
(dist, elapsed_ms). It recordsbefore_t= number of odd entries invec, andafter_t = dist, so distance is interpreted directly as T-count reduction.
In "pipeline" mode:
It calls _pipeline_once(vec, n, strategy, cfg), which:
- Synthesizes a circuit from
vec. - Wraps it in an
Optimizerconfigured with the given strategy. - Runs
opt.optimize(circ)and recordsbefore_t,after_tfrom the report. It records those T-counts and the full pipeline latency.
At the end, run_benchmark:
- Groups
TrialResultobjects by strategy. - Computes
Aggregatefor each group. - Sorts aggregates by
(mean_dist, mean_time_ms)so “better and faster” strategies rise to the top.
This makes it easy to compare decoders both as raw decoders and as full T-count optimizers.