Skip to content

Honest benchmarking

Spec: ISO/IEC 25010 Spec: ISO/IEC 17025 Evidence: Benchmark-backed

A benchmark number without its method says almost nothing. “NextPDF renders a document in N milliseconds” tells you nothing unless you know the document, the hardware, the run count, and the variance. This page explains how NextPDF measures performance, why it reports a gated signal instead of a headline figure, and why it prints no speed number.

Performance claims are easy to publish and easy to fake, usually by accident. A single warm run on an idle laptop, the fastest result from ten attempts, or a microbenchmark of a function nobody calls in a hot path can all produce a real number that predicts nothing about your workload. Spec: ISO/IEC 25010 defines performance efficiency as performing functions within time and throughput parameters under specified conditions (ISO/IEC 25010, §3.10). Remove “under specified conditions” and the number stops being a measurement. It becomes a figure without meaning.

There is a quieter failure too: a number that was true once. The moment you paste a benchmark result into prose, it freezes. Meanwhile the code, the runtime, and the hardware keep changing. A stale “fast” claim is not only unhelpful. It is wrong, and it is wrong silently.

  • A performance number is meaningless without its method: the input, the environment, the run count, the warmup policy, and the spread.
  • NextPDF measures with repeated runs, discards warmup iterations, and reports a distribution, not a single best-case figure.
  • Regression detection is statistical: a result is judged against a baseline with Welch’s t-test, so a change must be both statistically significant and large enough to matter before it counts as a regression.
  • An unstable environment is detected and reported, not silently averaged away — high run-to-run variance invalidates the result rather than hiding in the mean.
  • Performance is published as a living signal generated with the build, never as a frozen headline — which is why no millisecond figure appears on this page.

The engine’s performance gate is a statistical test, not a single timed run. The documented methodology runs the measured suite many times, discards a configured number of warmup runs, and computes the mean, standard deviation, and coefficient of variation for both duration and memory. It then compares the current result against a committed baseline:

  1. Repeat Run the measured suite N times under fixed conditions.
  2. Discard warmup Drop the first W runs so cold-start noise is excluded.
  3. Summarise Compute mean, standard deviation, and coefficient of variation for duration and memory.
  4. Test vs baseline Welch's t-test (two-sample, unequal variance) against the committed baseline.
  5. Decide Significant AND effect over threshold → regression. Variance over threshold → unreliable, not a pass. Absolute ceiling breach → hard fail.
How NextPDF's benchmark gate decides pass, fail, or unreliable: take repeated measurements, drop warmups, summarise the distribution, then apply Welch's t-test against a committed baseline. A regression requires statistical significance AND a material effect size; excessive variance is reported as unreliable rather than averaged into a verdict.

Three properties keep this honest. First, a regression needs two things at once: statistical significance (the difference is unlikely to be noise) and an effect size past a threshold (the difference is big enough to care about). A tiny, real slowdown does not raise a false alarm. A large one in noisy data is not missed. Second, instability is a verdict: when run-to-run variation exceeds a bound, the gate reports the environment as unreliable. It does not average noise into a meaningless mean and call it a pass. Third, there is still an absolute ceiling — a hard upper bound that fails the build regardless of the statistics. Because of it, “no significant regression” can never excuse an already-too-slow result.

This is the measurement discipline ISO/IEC 17025 describes for any credible metric: results obtained under predetermined conditions — repeatability within one environment (ISO/IEC 17025, §3.7) and reproducibility across them (ISO/IEC 17025, §3.5). A NextPDF performance figure is only meaningful as “this method, this baseline, this run”. That is exactly why it lives with the build that produced it, not in a sentence here.

Evidence: Benchmark-backed The engine’s benchmark gate implements repeated measurement with warmup discard, mean / standard deviation / coefficient-of-variation summarisation, and a two-sample Welch’s t-test against a committed baseline, with explicit pass / regression / unstable / hard-ceiling outcomes. Performance benchmarks also exist as a dedicated, separately runnable suite and a PHPBench harness. Performance is measured deliberately, not estimated.

Evidence: Standard-backed “Under specified conditions” is not editorial caution. It is the definition. Spec: ISO/IEC 25010, §3.10 ties performance efficiency to specified time, throughput, and resource conditions. A number without its conditions is not a weaker measurement. It is not a measurement.

Evidence: Standard-backed The repeatability and reproducibility framing follows Spec: ISO/IEC 17025 : a result is credible only relative to predetermined conditions, distinguishing same-environment repeatability from cross-environment reproducibility. A benchmark that cannot state its conditions cannot claim either.

What “honest” looks like in practice is a method statement, not a figure:

<?php
declare(strict_types=1);
// The gate is invoked with its conditions made explicit, e.g.:
//
// php ci/scripts/benchmark-gate.php \
// --runs=5 --warmup=1 --testsuite=Unit \
// --baseline=<committed-baseline>
//
// It then reports, for duration AND memory:
// - mean, standard deviation, coefficient of variation (the spread)
// - Welch's t-test p-value and effect size vs the baseline
// - a verdict: PASS | REGRESSION | UNSTABLE | hard-ceiling FAIL
//
// An honest performance statement is therefore shaped like:
// "<suite>, <runs> runs (<warmup> warmup), <hardware/runtime>,
// no statistically significant regression vs baseline <id>;
// coefficient of variation within bound."
//
// It is NEVER shaped like:
// "NextPDF is fast" — or a bare millisecond number with no method.

The deliverable of a benchmark is the reproducible method and the verdict. The raw millisecond value belongs to the build that produced it, where you can re-derive it — not transcribed into documentation where you cannot.

The first misconception is that a benchmark is a number. A benchmark is a procedure that yields a distribution. The number is only one draw from it. Reporting the best of several runs, or one warm run, is not optimistic. It is measuring a different thing (peak under ideal conditions) and labelling it as typical performance.

The second is that “no statistically significant regression” means “as fast as before, guaranteed”. It means the observed difference is within what the method can distinguish from noise, given this run count and this variance. That is a bounded, conditional statement. That is precisely why NextPDF keeps the absolute ceiling as an independent safeguard and refuses to compress the result into an unqualified claim — for itself or against anyone else.

This page describes how NextPDF measures and reports performance. It states no throughput, latency, or memory figure on purpose. Those are living signals generated from continuous-integration artifacts under stated conditions, and the current values are published with the build. A number repeated here would be unconditioned and would go stale — the exact failure this page argues against. There is no stable performance constant to quote, so none is quoted. The discipline is the deliverable.

The run count, warmup policy, thresholds, and baseline are owned by the engine’s benchmark configuration and evolve with the engine and its hardware. That configuration is the authority if it ever disagrees with this explanation. NextPDF makes no performance comparison to any other library — favorable or unfavorable — because such a comparison without identical, stated conditions would be exactly the unqualified claim this page exists to refuse.

  • Golden-file testing — the same reproducibility discipline applied to output correctness, including the bitwise / structural / semantic profiles.
  • The NextPDF testing pyramid — where the performance tier sits and why it is opt-in rather than on every change.
  • Mutation testing, explained — another place NextPDF reports a gated signal rather than a vanity number.
  • Benchmark — a defined procedure that produces a distribution of measurements under stated conditions, not a single number.
  • Warmup run — an initial iteration discarded so cold-start effects (JIT, caches, autoload) do not contaminate the measured result.
  • Coefficient of variation — standard deviation divided by the mean; a unit-free measure of spread used to judge whether a run is stable enough to trust.
  • Welch’s t-test — a two-sample statistical test for unequal variances, used here to decide whether a result differs from the baseline beyond noise.
  • Effect size — how large a difference is, independent of statistical significance; NextPDF requires both before declaring a regression.
  • Repeatability / reproducibility — agreement of results under predetermined conditions within one environment (repeatability) or across environments (reproducibility), per ISO/IEC 17025.
  • Absolute ceiling — a hard upper bound that fails the build regardless of the statistical comparison.