Metadata: XMP packet build and streaming read
At a glance
Section titled “At a glance”The Metadata module is the engine’s Extensible Metadata Platform (XMP) layer. It builds the XMP packet that a Portable Document Format (PDF) file carries as a metadata stream. It reads an existing packet without loading the whole document into memory. It emits the engine’s audit-trail XMP extension.
Install
Section titled “Install”composer require nextpdf/core:^3Conceptual overview
Section titled “Conceptual overview”A PDF stores document-level metadata as an XMP packet in a metadata stream
attached to the document catalog, as described by ISO 32000-2 §14.3.
This module owns the production and consumption of that packet. Its surface is
deliberately small and focused: three classes under NextPDF\Metadata\Xmp.
XmpMetadataBuilder produces the packet. It serializes a property set into a
well-formed XMP document wrapped in the standard <?xpacket?> processing
instructions. It uses the canonical packet globally unique identifier (GUID) and
byte-order mark fixed by the XMP specification. The output is the byte string
that the Writer embeds as the metadata stream, the in-PDF XMP representation
described in §14.3.
XmpStreamReader consumes a packet. It is built for hostile input. The source
is streamed in 64 KB chunks to a bounded temporary file before parsing. The
reader enforces an aggregate byte cap during that write. The libxml entity loader
is set to null for the parse and restored afterward. A DOCTYPE triggers a hard
rejection. iterateProperties() returns a generator that yields (namespaceUri, localName, textContent) tuples for each leaf element without building the whole
tree in memory; only the current element and its text node are alive in the
parser at any moment. An oversized packet raises PacketTooLargeException;
malformed Extensible Markup Language (XML), a DOCTYPE, or non-UTF-8 input raises
InvalidConfigException.
XmpAuditFieldEmitter is the engine-specific extension. It renders an
AuditReport into a custom XMP field under the nextpdfAudit namespace, so a
document’s conformance audit travels with the file as standards-compliant XMP
instead of as a sidecar. The AuditReport it renders is not produced by the
emitter. The caller activates enrichment by running a render under
CssRenderingMode::Audit with a caller-supplied auditCollector configured
through Config(auditCollector: ...). The collector is caller-driven: the caller
feeds it, and the emitter renders whatever it has collected. It is newer than
the core XMP surface (@since 5.4.0). The builder and reader are @since 2.0.0.
API surface
Section titled “API surface”| Class | Key members | Role |
|---|---|---|
XmpMetadataBuilder | build(): string, XPACKET_GUID, XPACKET_BOM | Serializes a property set into an XMP packet (@since 2.0.0) |
XmpStreamReader | iterateProperties(mixed $source, int $byteCap = DEFAULT_BYTE_CAP): \Generator, DEFAULT_BYTE_CAP | Bounded, streaming, DOCTYPE-rejecting XMP reader (@since 2.0.0) |
PacketTooLargeException | extends NextPdfException | Raised when an XMP packet exceeds the byte cap (@since 2.0.0) |
XmpAuditFieldEmitter | render(?AuditReport $report): string, NAMESPACE_URI | Renders the audit trail as a custom XMP field (@since 5.4.0) |
Run composer docs:generate-api-php -- --module=Metadata to generate the full
PHPDoc table.
Code sample — Quick start
Section titled “Code sample — Quick start”Stream properties out of an existing XMP packet under an explicit byte cap.
<?php
declare(strict_types=1);
require_once __DIR__ . '/../vendor/autoload.php';
use NextPDF\Metadata\Xmp\XmpStreamReader;
$reader = new XmpStreamReader();
foreach ($reader->iterateProperties(file_get_contents('/srv/in/xmp.xml'), byteCap: 1_048_576) as [$ns, $name, $value]) { printf("%s:%s = %s\n", $ns, $name, $value);}Code sample — Production
Section titled “Code sample — Production”Read a packet defensively, and map the module’s typed failures to an application-level outcome instead of letting raw parser faults escape.
<?php
declare(strict_types=1);
require_once __DIR__ . '/../vendor/autoload.php';
use NextPDF\Exception\InvalidConfigException;use NextPDF\Metadata\Xmp\PacketTooLargeException;use NextPDF\Metadata\Xmp\XmpStreamReader;use Psr\Log\LoggerInterface;
final readonly class XmpIngestService{ public function __construct( private XmpStreamReader $reader, private LoggerInterface $logger, ) {}
/** * @param resource|string $source A stream resource or XMP byte string. * * @return array<string, string> Flattened "ns:name" => value map. */ public function ingest(mixed $source): array { $properties = [];
try { // Cap untrusted XMP at 4 MB regardless of the 1 GiB default. foreach ($this->reader->iterateProperties($source, byteCap: 4_194_304) as [$ns, $name, $value]) { $properties["{$ns}:{$name}"] = $value; } } catch (PacketTooLargeException $e) { $this->logger->warning('XMP packet exceeded ingest cap; rejected.', ['error' => $e->getMessage()]);
return []; } catch (InvalidConfigException $e) { $this->logger->warning('XMP packet malformed or unsafe; rejected.', ['error' => $e->getMessage()]);
return []; }
return $properties; }}Edge cases & gotchas
Section titled “Edge cases & gotchas”XmpStreamReaderrejects any DOCTYPE outright. This is an XML External Entity (XXE) defense, not a validation nicety; a packet that needs a DOCTYPE is not accepted. Sanitize it upstream.- The byte cap defaults to 1 GiB (
DEFAULT_BYTE_CAP). That default is a ceiling, not a recommendation. Pass a tightbyteCapfor untrusted input. iterateProperties()is a generator. Consume it once; iterating it twice does not replay.- The reader sets the libxml entity loader to null for the parse and restores it. Do not run it concurrently with other libxml-based parsing in the same request if that parsing depends on the entity loader.
XmpAuditFieldEmitter::render(null)is valid and yields an empty rendering; a nullAuditReportmeans “no audit”, not an error.
Performance
Section titled “Performance”The builder is linear in the property count. The reader’s memory use is
dominated by the longest single text run, not by document size, because only the
current element is alive in the parser; large packets stream instead of loading
into memory. The default reference workload sits within a 1500 ms wall / 64 MB
peak budget. The reproducibility profile is structural: an XMP packet records
modification timestamps. Two builds of the same logical metadata differ in those
fields, while their structure is identical.
Security notes
Section titled “Security notes”XmpStreamReader parses untrusted XML and is hardened accordingly. Streamed
chunking with an enforced byte cap bounds a memory-amplification denial of
service. Rejecting DOCTYPE closes XXE. LIBXML_NONET blocks network entity
resolution. Non-UTF-8 input is refused. Still set a deployment-appropriate
byteCap for any externally sourced packet instead of relying on the gigabyte
default. Treat XMP property values as untrusted strings when they re-enter the
application. See the engine threat model in /modules/core/security/.
Conformance
Section titled “Conformance”The packet XmpMetadataBuilder produces is the in-PDF XMP metadata-stream
representation defined in ISO 32000-2 §14.3
().
The XMP serialization form itself is governed by the XMP specification
(ISO 16684-1), which is not in the verifiable citation corpus. That requirement
is referenced by number, not chunk-pinned. These are implementation facts
produced by src/Metadata/Xmp/ and exercised by tests/Unit/Metadata/Xmp/.
End-to-end metadata conformance for a profile (PDF/A, PDF/UA) is validated by
the oracle and golden suites described in /modules/core/conformance/.
See also
Section titled “See also”- Document module — the DPart tree paired with Document Part Metadata (DPM).
- Audit module — produces the
AuditReportthe emitter renders. - Writer module — embeds the packet as a metadata stream.
- Conformance overview
- Engine security model