Accessibility: tagging primitives and the PDF/UA-2 structure model
At a glance
Section titled “At a glance”NextPDF Core provides primitives for accessible authoring: a logical structure tree, standard role mapping, marked-content tagging, and Best Current Practice (BCP) 47 language attributes that align with the structure-tree model in ISO 14289-2 (PDF/UA-2) and ISO 32000-2 §14.7. The produced file conforms only when the final document, the author’s content choices, and an external checker support that result. The library does not assert that guarantee for you.
Install
Section titled “Install”composer require nextpdf/coreConceptual overview
Section titled “Conceptual overview”A tagged Portable Document Format (PDF) file includes a logical structure tree
whose root holds a single Document structure element. Assistive technology
reads that tree to determine a meaningful reading order that does not depend on
the visual layout (ISO 32000-2 §14.7.2; ISO 14289-2 §8.2.5.2). NextPDF models
this with three cooperating types in the NextPDF\Accessibility namespace.
StructureTree owns the hierarchy. It allocates marked-content identifiers for
each page, tracks parent and child nesting, and serializes the structure-tree
root, structure elements, the parent tree, the role map, and the PDF 2.0
standard-structure namespace per ISO 32000-2 §14.7. createRoot() seeds the
mandatory single Document element with a language attribute. addElement()
attaches typed children. hasRoot() and rootHasChildren() report whether the
tree exists and whether it has descendants.
StructureElement is the value object for one structure-element dictionary. It
stores the standard structure type (Table 368 names such as H1 through H6,
P, L, LI, Table, Figure, Link), marked-content identifier entries,
and optional accessibility attributes for alternative text, replacement text,
title, and language. A single element can span multiple pages. It accumulates
one identifier entry per page so the kids array references marked content
across page boundaries.
TaggedContentEmitter connects the Hypertext Markup Language (HTML) pipeline
to the structure tree. When Document::enableTaggedPdf() is active, the HTML
renderer wires the emitter so block-level elements create paired
marked-content operators and matching structure-element nodes.
HtmlToStructureMap provides the table-driven mapping from HTML tags to PDF
structure types (ISO 14289-2 §8). The emitter routes decorative running
content, such as the HTML header and footer regions, to an artifact and keeps
it out of the reading order.
Bcp47Validator validates language tagging (Request for Comments (RFC) 5646).
It provides a well-formed syntactic check and a registry-backed validity check.
Strict mode (ConformancePolicy::strictUa2()) rejects malformed tags at the
application programming interface (API) boundary instead of dropping them
silently at write time. This matches the ISO 14289-2 §8.4.4 requirement that
the catalog language entry resolve to a specific language.
API surface
Section titled “API surface”| Symbol | Kind | Summary |
|---|---|---|
Document::enableTaggedPdf(string $lang = 'en', ?ConformancePolicy $policy = null): static | method | Activate the structure tree and HTML bridge; set the mark-info and catalog language entries. |
Document::setLanguage(string $lang): static | method | Set the document-level natural language (BCP 47). |
Document::isTaggedPdfEnabled(): bool | method | Report whether the active conformance mode mandates structural tagging. |
StructureTree::createRoot(string $lang = 'en'): int | method | Create the mandatory single Document root element. |
StructureTree::addElement(int $parentIndex, string $type, int $pageIndex, ...): int | method | Attach a typed child structure element. |
StructureTree::hasRoot(): bool and rootHasChildren(): bool | method | Report whether the tree exists and whether it has descendants. |
StructureElement | final class | Value object for one structure element (alternative text, replacement text, title, language, identifiers). |
RoleMap::standard(): array<string,string> | static | Return the standard structure-type vocabulary (ISO 32000-2 Table 368 plus PDF 2.0 types). |
Bcp47Validator::isWellFormed/isValid/validate/normalise | method | Validate RFC 5646 language tags with syntactic and registry-backed checks. |
AccessibilityAutoFixerRegistry | final class | Opt-in PHP Standards Recommendation (PSR)-11-style registry for heuristic structure fixers. |
Code sample — Quick start
Section titled “Code sample — Quick start”<?php
declare(strict_types=1);
use NextPDF\Core\Document;
$doc = Document::createStandalone();
// The BCP 47 tag drives the catalog language entry and the// structure-tree root language attribute.$doc->enableTaggedPdf(lang: 'en');$doc->setTitle('Tagged accessibility demo');$doc->addPage();
// Semantic HTML maps to structure elements: h1 to /H1, p to /P,// ul and li to /L plus /LI. Text runs are wrapped in// marked-content operators with stable identifiers.$doc->writeHtml('<h1>Document title</h1><p>Body paragraph.</p>');
$doc->save(__DIR__ . '/output/tagged.pdf');Code sample — Production
Section titled “Code sample — Production”<?php
declare(strict_types=1);
use NextPDF\Conformance\ConformancePolicy;use NextPDF\Core\Document;use NextPDF\Exception\InvalidConfigException;use Psr\Log\LoggerInterface;
final class AccessibleReportWriter{ public function __construct(private readonly LoggerInterface $logger) { }
public function render(string $html, string $bcp47Lang, string $outPath): void { $doc = Document::createStandalone();
try { // strictUa2() rejects malformed BCP 47 tags at the API // boundary (ISO 14289-2 §8.4.4) instead of dropping silently. $doc->enableTaggedPdf($bcp47Lang, ConformancePolicy::strictUa2()); } catch (InvalidConfigException $e) { $this->logger->error('Rejected language tag for tagged PDF', [ 'lang' => $bcp47Lang, 'reason' => $e->getMessage(), ]);
throw $e; }
$doc->setTitle('Quarterly accessibility report') ->setLanguage($bcp47Lang) ->addPage();
$doc->writeHtml($html);
// The engine emits a Degraded / ComplianceRisk advisory directing // the caller to validate externally; surface it to operators // rather than treating tagged output as certified. foreach ($doc->getWarnings() as $warning) { $this->logger->warning('Tagged-PDF advisory', [ 'code' => $warning->code->value, 'message' => $warning->message, ]); }
$doc->save($outPath); }}Edge cases & gotchas
Section titled “Edge cases & gotchas”- Order of calls. Call
enableTaggedPdf()beforewriteHtml(). The HTML pipeline checks the conformance mode when the parser is constructed and does not retroactively wire the emitter for content that has already rendered. - Empty structure tree. A document with
enableTaggedPdf()but no attached structure descendants does not advertise PDF/UA-2 in its metadata. The publication gate isrootHasChildren(), nothasRoot(), because validators reject a file that claims PDF/UA-2 with an empty structure tree (ISO 14289-2 §5; verified byEmptyTaggedPdfDoesNotAdvertisePdfUa2Test). - Conformance-mode collapse. When you call
enablePdfA()andenableTaggedPdf()on the same document, the single-valued conformance discriminator collapses to last-wins. Side effects (structure tree, mark-info) remain additive, and NextPDF emits aCONFORMANCE_MODE_CLOBBEREDwarning so the collapse is observable. - Auto-fixers are not automatic. Built-in fixers (
EmptyTagStripper,LegacyLangNormaliser,RootLangFallback) ship underNextPDF\Accessibility\AutoFixer\*but are never auto-registered. You must register them explicitly onAccessibilityAutoFixerRegistry.
Known limitations
Section titled “Known limitations”NextPDF emits structure consistent with the PDF/UA-2 structure-tree model, but it does not create semantics it cannot infer. You must supply markup or attributes for the following; NextPDF does not generate them for you:
- alternative text for images and other non-text content;
- table header scope and header-to-cell associations beyond what the HTML markup expresses;
- link purpose text when the visible link text is not self-describing;
- list semantics for content that is visually laid out as a list but lacks list markup;
- corrected reading order when the source order differs from the intended reading order;
- decorative-versus-meaningful classification for ambiguous content.
NextPDF performs no end-to-end PDF/UA-2 verification. At runtime, it emits a
Degraded / ComplianceRisk advisory (PDFUA2_FOUNDATIONAL) that directs the
caller to validate the output with an external checker before production
sign-off. Validate with a PDF/UA checker (for example, veraPDF). NextPDF does
not assert conformance on your behalf. Final-document conformance depends on
authoring choices and a validator, not on calling the API.
Performance
Section titled “Performance”Structure-tree construction is linear in the number of structure elements.
Identifier allocation is amortized constant time per marked-content sequence.
Serialization is a single linear pass over the element set. For HTML-driven
tagging, the dominant cost is the HTML pipeline itself, not tag emission. The
per-recipe cap declared in performance_budget (1500 ms wall time, 64 MB peak)
applies to a typical multi-page semantic document. Large documents scale
linearly with element count rather than page count.
Security notes
Section titled “Security notes”Language tags and accessibility attributes flow into PDF name and string
objects. NextPDF escapes them through PdfStringEscaper, so malformed or
hostile language, alternative-text, replacement-text, and title values cannot
break out of their PDF object context. Strict mode also rejects
unregistered BCP 47 tags at the API boundary, narrowing the input surface
before it reaches the writer. Accessibility attributes can carry author-supplied
free text. Treat them as untrusted output and review them as you review other
document content. See the Conformance module
for profile-checker behavior.
Conformance
Section titled “Conformance”This page maps library behavior to clause identifiers. It does not assert that
your output conforms. The cited clauses are paraphrased, never quoted. See the
PDF/UA-2 specification mapping for the
provision-level table and explicit non-coverage. Citation chunk hashes are
recorded in docs/public/modules/core/_normative-evidence-a11y.md.