Skip to content

Golden-file testing

Spec: ISO/IEC/IEEE 29119-4 Spec: ISO/IEC 25010 Evidence: Test-backed

A golden file is a recorded “this is what correct output looks like” that a test compares against on every run. NextPDF uses goldens to catch the change nobody meant to make: a stream that compressed differently, a paragraph that moved, or a coordinate that drifted. This page explains how that works and how a golden stays trustworthy instead of becoming an outdated reference nobody reads.

PDF generation is a long pipeline with many places to drift silently. A refactor that “changes nothing” can quietly reorder operators, alter a transform matrix, or shift a table cell by a tiny amount. Unit tests rarely catch this: they assert on a value you thought to check, not on the thousands of bytes you did not. Structure- and specification-based techniques detect different errors, and neither subsumes the other (ISO/IEC/IEEE 29119-4, Annex A). A golden file is the specification-by-example that pins the whole output, not one assertion.

The risk runs both ways. A golden that is too strict fails on every harmless change and gets blindly re-blessed until it proves nothing. A golden that is too loose lets real regressions pass. Getting that balance right is the entire craft.

  • A golden file is a pinned reference output, generated from known-good engine behaviour and committed to the repository.
  • A golden test re-generates the output and diffs it against the pinned reference; any difference fails the test and demands a human decision.
  • NextPDF compares at the level that is meaningful but stable: extracted text and normalised structural operators, not raw bytes, because raw bytes carry noise (timestamps, subset ordering, compression) that is not a regression.
  • Updating a golden is a deliberate, reviewed act behind an explicit GOLDEN_UPDATE switch — never an automatic “accept whatever changed”.
  • Golden differs from snapshot and characterization testing in one decisive way: a golden is never auto-updated.

The engine’s golden infrastructure uses an explicit two-layer diff rather than a byte comparison:

  1. Generate Render the fixture input through the current engine.
  2. Layer 1 — text Extract human-readable text from the content stream; diff against the text golden. Catches dropped or reordered content and encoding regressions.
  3. Layer 2 — structure Extract ordered PDF operators, normalise coordinates to a fixed precision, diff against the operator golden. Catches layout shifts and broken structure.
  4. Decide Any diff fails the test; a human judges whether it is a regression or an intended change.
How a NextPDF golden comparison runs: generate the PDF, extract the meaningful layers (text, then normalised structural operators), diff each against the pinned reference, and fail with a human-readable report on any difference.

What it deliberately does not compare is as important as what it does. Raw byte output is excluded because timestamps, font-subset ordering, and stream compression make it brittle without making it more correct. Pixel-level image diffing is excluded from this tier because it needs an external renderer and imports environment variance. Floating-point coordinates are normalised to a fixed precision so meaningless rounding noise does not masquerade as a regression. This is the difference between a golden that tests behaviour and one that tests transient environmental noise.

This choice also names the page’s reproducibility profile: structural. NextPDF documents three profiles — bitwise (the exact bytes reproduce), structural (the object graph and operator sequence reproduce, allowing benign byte-level variance), and semantic (the meaning reproduces). Golden tests at this tier assert the structural profile by construction. That is why their references survive a compression-library bump but still fail a moved table.

Evidence: Test-backed The two-layer comparison (text extraction, then normalised structural-operator comparison) is the engine’s own documented golden methodology, with raw-byte, image-diff, and exact-float comparison explicitly out of scope for the reasons above. The golden suite is a declared, separately runnable test suite, distinct from the snapshot and characterization suites.

Evidence: Test-backed The honesty mechanism is concrete: golden references are generated artefacts, not hand-written. Overwriting them is gated behind an explicit GOLDEN_UPDATE environment switch documented as a rare, always-reviewed operation. By contrast, snapshot tests in the engine regenerate on first run and acknowledge drift via an update flag. And characterization tests lock in legacy behaviour without claiming it is correct. The three are intentionally different tools.

Evidence: Standard-backed A golden is a specification-by-example. Spec: ISO/IEC/IEEE 29119-4, Annex A notes that specification-based and structure-based techniques catch different classes of error and that a strategy should combine them. That is why goldens sit alongside, not instead of, unit and structural tests in the testing pyramid.

A golden test is mechanically simple; the discipline is in the workflow around it:

<?php
declare(strict_types=1);
// 1. The fixture: a fixed HTML input committed next to the test.
// tests/Golden/fixtures/html-inputs/002-basic-table.html
// 2. The pinned references, generated once from known-good behaviour:
// 002-basic-table.text.golden (Layer 1 — extracted text)
// 002-basic-table.operators.golden (Layer 2 — normalised operators)
// 3. The run compares; ANY difference fails:
// vendor/bin/phpunit --testsuite Golden
// 4. An intended behaviour change is the ONLY time references move,
// and it is explicit and reviewed — never automatic:
// GOLDEN_UPDATE=1 vendor/bin/phpunit --testsuite Golden
//
// The regenerated *.golden files land in the diff of the same change
// that altered behaviour, so a reviewer sees the output delta next to
// the code delta and signs off on both together.

The example is the process. The test code only diffs. What makes the golden trustworthy is that a changed reference is reviewed as output in the same change that altered the engine.

The most common mistake is treating golden testing as byte-for-byte testing. NextPDF’s goldens are not the file’s bytes — they are its extracted text and its normalised structural operators. Asserting raw bytes would fail on a new zlib version, a different subset tag, or a regenerated timestamp, none of which is a regression. The test would be re-blessed into uselessness within a week. (Where exact bytes genuinely must reproduce, that is the separate, stricter bitwise reproducibility profile, not a golden.)

The second mistake is assuming a green golden suite proves correctness. It proves non-change. A golden generated from buggy output faithfully protects the bug. Goldens guard against regression from a known-good baseline; they do not establish that the baseline was good. That is what the unit, structural, and conformance tiers are for.

A golden test answers exactly one question: did the output change from the pinned reference. It does not say whether the reference was ever correct. It also does not measure performance, conformance, or security. Those are other tiers. The fixture corpus size, the suite’s pass rate, and any coverage figure are living quality signals generated from continuous-integration artifacts and published with the build. They are intentionally not stated here, where they could go stale.

The exact directory layout, comparator internals, and update switch are owned by the engine’s test infrastructure and may evolve. The test configuration is the authority if it ever disagrees with this explanation. This page makes no claim about any other library’s snapshot or golden tooling.

  • Golden file — a pinned reference output, generated from known-good engine behaviour and committed, that a test diffs against on every run. Never auto-updated.
  • Two-layer diff — NextPDF’s golden comparison: extracted text (Layer 1) plus normalised structural operators (Layer 2), instead of raw bytes.
  • Snapshot test — a related but distinct technique where the reference is regenerated on first run and drift is acknowledged via an update flag.
  • Characterization test — a test that locks in existing behaviour without asserting it is correct, typically to make a refactor safe.
  • Reproducibility profile — the level at which output must reproduce: bitwise (exact bytes), structural (object graph and operator sequence, benign byte variance allowed), or semantic (meaning). Golden tests here assert the structural profile.
  • GOLDEN_UPDATE — the explicit environment switch that authorises overwriting golden references; a rare, reviewed operation.