ADR 005: Self-Benchmarking — Baselines and Regression Detection

Status: Proposed Date: 2026-05-29

Context

ADR 002 forbids native code until a measurement of the user-visible path, against a named baseline, proves a limit Python cannot resolve. ADR 003 requires native load-generation work to benchmark the user-visible path — scheduling, metrics, reporting — not a function in isolation. The engineering policy already says performance claims must name a comparison baseline (trunk, tag, or release). None of that is enforceable unless rampa maintains a standing benchmark, a named baseline, and a way to notice when performance changes.

A load generator has a constraint most projects do not: wall-clock latency is simultaneously the product it sells and a quantity too noisy to decide a pull request on. A benchmark that fails a pull request because a shared runner was busy trains everyone to ignore it. Yet the per-request path is exactly where a silent regression would hide.

The resolution, drawn from how mature projects manage performance, is to keep two activities separate: notice regressions deterministically without wall-clock, and measure latency and throughput against a named baseline, deliberately, away from the pull-request path.

Decision

rampa separates two activities and does not conflate them.

  • Deterministic regression detection runs on every pull request. It asserts counts — function calls, allocations, events, connections — against a checked-in baseline. Because it never reads wall-clock time, it never flakes, so it can block a merge honestly.

  • Latency and throughput measurement runs deliberately — on a label or a schedule, on a controlled machine — not on every pull request. It compares against a named baseline and reports a geometric mean.

No native code is justified without a latency-or-throughput measurement of the user-visible path against a named baseline. This is the concrete form of ADR 002’s default rule and ADR 003’s benchmark policy.

Scope

This ADR governs how rampa benchmarks itself and notices performance regressions: the count-based regression checks, the latency/throughput measurement discipline, baseline naming and storage, and benchmark hygiene. It does not cover one-off profiling (ADR 006) or the test harness the benchmarks run on (ADR 004).

Requirements

1. Deterministic regression detection (every pull request)

The hot paths — request scheduling, per-sample metric ingestion, metric reduction — have assertions on counts, not time: function-call counts, allocation counts, and event or connection counts. Counts are deterministic, so a pull request can be blocked on them without flakiness. The baseline is checked into the repository and keyed by both environment and implementation path (whether the accelerator is present), so the pure-Python and native paths are checked separately, per ADR 001. A regenerate flag rewrites the baseline deliberately; the check fails when a count drifts beyond a documented tolerance.

SQLAlchemy is the reference: @profiling.function_call_count(variance=0.10) runs a function under cProfile, reads total_calls, and fails on drift from a per-environment baseline checked into the tree — its git log on that file is the performance history. See lib/sqlalchemy/testing/profiling.py, the checked-in baseline test/profiles.txt, and test/aaa_profiling/. Django uses the same philosophy for round-trips with assertNumQueries (django/test/testcases.py) — the analog of rampa’s ADR 003 connection accounting.

2. Latency and throughput measurement (named baseline, run deliberately)

Latency and throughput are measured against a named baseline: trunk or the merge-base for active development, a tag or release for release-facing claims. Results report a geometric mean with reproducibility controls (seeded randomness, pinned upstream state, a fixed machine). For micro-paths, prefer a deterministic measure such as instruction counts (cachegrind-style) over wall-clock; reserve wall-clock for a controlled machine. Include an end-to-end throughput measurement — rampa driving a local target — for the generator’s own ceiling. This runs on a label or a schedule, never on every pull request.

CodSpeed is the de-facto standard for hybrid Python/Rust projects: instruction-count measurement that is deterministic in CI, comparing each pull request against its base. ruff keeps one bench source runnable both locally and under CodSpeed behind a #[cfg(codspeed)] shim and a merge-base diff (crates/ruff_benchmark/src/criterion.rs, .github/workflows/ci.yaml), and uses a wall-time benchmark against real projects for the load-shaped case (crates/ruff_benchmark/benches/ty_walltime.rs). pydantic-core and pydantic run pytest-codspeed on the Python surface against a profiling-built wheel (pydantic-core/.github/workflows/codspeed.yml, pydantic/tests/benchmarks/test_model_validation.py). The end-to-end shape is axum’s: stand up a real server and point a separate load generator (rewrk) at it (axum/benches/benches.rs, lnx-search/rewrk).

3. Reproducibility and baseline naming

A claim names its baseline; an unnamed comparison is not a result. Determinism comes from seeded randomness, pinned upstream state, and a controlled machine. uv pins its benchmark inputs by priming a real cache and freezing the index with --exclude-newer (.github/workflows/bench.yml, BENCHMARKS.md); mypy compares commits by cloning each, building in parallel, and averaging N runs with a fixed PYTHONHASHSEED (misc/perf_compare.py). CPython states perf claims as a geometric mean over a named suite (pyperf/pyperformance), and ships a single-thread-vs-N-thread scaling benchmark with CPU-affinity pinning (Tools/ftscalingbench/ftscalingbench.py, Tools/scripts/sortperf.py).

4. Hygiene and storage

Benchmarks are disabled by default in the normal test run and runnable with one command; they emit structured JSON for tracking; large traces, dumps, and profiler captures stay out of tracked files (per the engineering policy). pydantic disables benchmarks by default and exposes a single make benchmark (Makefile). Expensive or hardware-sensitive latency runs may live in a sibling repository or behind a label, as Django keeps its ASV suite in an external repo triggered by a benchmark label (docs/internals/contributing/writing-code/submitting-patches.txt, django/django-asv) and polars runs its heavy benchmarks on a self-hosted machine against an external dataset (.github/workflows/benchmark-remote.yml).

Benchmark record

A pull request that adds or changes a benchmark, or that justifies native code, records:

kind:                    deterministic regression detection (counts) | latency | throughput
metric:                  call-count | allocation-count | event/connection-count |
                         instruction-count | wall-time
baseline:                trunk | merge-base | tag | release (named)
both paths:              python-only + native, keyed separately
tolerance:               documented drift allowed before the count check fails
reproducibility:         seeded RNG | pinned upstream | fixed machine | geometric mean
runs:                    every pull request | on a label | on a schedule
storage:                 checked-in baseline file | CI service | sibling repo
artifacts:               JSON results; large traces kept out of the tree

Pull request checklist

[ ] Native code (if any) is justified by a latency-or-throughput measurement of the user-visible path vs a named baseline.
[ ] Hot-path changes have deterministic count assertions (function calls / allocations / events / connections).
[ ] The count baseline is checked in and keyed by environment and by whether the accelerator is present.
[ ] Latency / throughput runs name their baseline and report a geometric mean with reproducibility controls.
[ ] Benchmarks are disabled by default and runnable in one command.
[ ] Large traces / dumps are kept out of tracked files.

Consequences

Positive

  • ADR 002’s “prove the bottleneck against a named baseline” becomes a concrete requirement, not a hope.

  • The count assertions notice per-request regressions deterministically, immune to CI timing noise.

  • Both implementation paths are checked separately, so native/Python drift surfaces immediately.

  • Latency claims always carry a named baseline.

Tradeoffs

  • Two separate activities are more machinery than a single pytest-benchmark run.

  • A checked-in count baseline must be regenerated deliberately and reviewed when it changes.

  • A controlled machine (or a CI service) is required for trustworthy latency numbers.

Risks

  • A tolerance set too wide makes the count assertions meaningless. Mitigation: document and review the tolerance.

  • Wall-clock creeping into the per-pull-request checks re-introduces flakiness. Mitigation: the per-pull-request checks assert counts only.

  • Stale baselines block legitimate change. Mitigation: a reviewed regenerate flow.

Relationship to ADR 001, 002, 003, and 004

This ADR makes ADR 002’s default rule and ADR 003’s benchmark policy enforceable, and it checks both ADR 001 paths separately. It runs on the harness defined in ADR 004. ADR 006 covers the profiling used to investigate a regression these checks detect.

Prior art

Final position

rampa earns performance claims and native code by measurement, against a named baseline, using checks that do not lie about timing. Counts are asserted on every pull request; latency is measured deliberately, away from the fast path. A faster number that no baseline names is not a result.