Document: DParts, split / merge, and vendor extensions
At a glance
Section titled “At a glance”The Document module works with whole Portable Document Format (PDF) files, not page content. It builds the Document Part hierarchy that regulated workflows use to attach metadata. It splits a PDF into page-range segments, merges several PDFs into one, and registers developer extensions in the document catalog.
Install
Section titled “Install”composer require nextpdf/core:^3Conceptual overview
Section titled “Conceptual overview”This module sits above page content. Where Graphics and Content emit operators, Document works at the structural level: page trees, the document catalog, and the Document Part tree.
A Document Part (DPart) is a logical partition of a PDF. ISO 32000-2 defines
a DPart hierarchy whose nodes carry Document Part Metadata (DPM). A regulated
workflow, such as a pharmaceutical, legal, or archival workflow, can associate
metadata with a sub-range of pages instead of the whole file — §14.12.
DPart is an immutable readonly node: a leaf references a contiguous run of
page indices, and an intermediate node groups child DPart nodes into a tree.
DPartRoot is the tree root that the Writer serializes. A leaf node’s /Start
and /End entries are indirect references to page objects, not page-index
integers — §14.12.
DPart::resolveWithPageObjects() resolves those entries against a
writer-supplied page-index→object-number map and returns the
/Start (and optional /End) reference form. It falls back to the integer
form only on test paths where the map is unavailable.
PdfMerger and PdfSplitter are the document-composition surface. PdfMerger
combines page objects from multiple input PDFs, renumbers objects to avoid
collisions, and rebuilds a single page tree and cross-reference table. The page
tree it produces is a balanced Pages node with Kids and Count, plus the
inheritable attribute model that PDF defines for page-tree nodes — §7.7.3.
PdfSplitter does the inverse: it extracts page ranges into standalone
SplitDocument objects. PageRange is the value object both classes consume.
It is 1-based, validates its bounds, and answers contains(), count(), and
toArray().
VendorExtensionRegistry, ExtensionsDictionary, and
DeveloperExtensionEntry model the developer-extensions dictionary in the
document catalog. An engine uses that dictionary to declare a vendor extension
level beyond the base specification. The registry rejects conflicting
re-registration of the same vendor prefix with
VendorExtensionRegistryConflictException. CollectionDictionary and
CollectionSort model a PDF collection (portable collection or portfolio)
catalog entry.
API surface
Section titled “API surface”| Class | Key methods | Role |
|---|---|---|
DPart | isLeaf(), hasMetadata(), resolveWithPageObjects(), write() | Immutable Document Part node (@since 1.12.0) |
DPartRoot | isEmpty(), write() | DPart tree root the Writer serializes (@since 1.12.0) |
PdfMerger | merge(array $pdfFiles, int $maxFiles = 100, int $maxTotalBytes = 200_000_000), append() | Multi-PDF merge with object renumbering (@since 1.9.0) |
PdfSplitter | split(), splitEvery(), extractPages() | Page-range split into SplitDocument (@since 1.9.0) |
PageRange | contains(int $page), count(), toArray() | 1-based page-range value object |
MergeResult / SplitResult | isValid(), count(), document(), totalOutputSize() | Composition result objects |
VendorExtensionRegistry | extension registration | Developer-extensions registry (@since 2.2.0) |
ExtensionsDictionary | withEntry(), entries(), isEmpty(), toPdfDictionary() | Immutable extensions-dictionary builder (@since 2.0.0) |
CollectionDictionary | toPdfDictionary() | Portable-collection catalog entry (@since 2.0.0) |
Run composer docs:generate-api-php -- --module=Document to generate the full
PHPDoc table.
Code sample — Quick start
Section titled “Code sample — Quick start”Split a PDF into single-page documents, then inspect the result.
<?php
declare(strict_types=1);
require_once __DIR__ . '/../vendor/autoload.php';
use NextPDF\Document\PageRange;use NextPDF\Document\PdfSplitter;
$splitter = new PdfSplitter();$result = $splitter->splitEvery(file_get_contents('/srv/in/report.pdf'), 1);
foreach (range(0, $result->count() - 1) as $index) { $segment = $result->document($index); file_put_contents("/srv/out/page-{$index}.pdf", $segment->pdfData);}
$singlePage = $splitter->extractPages( file_get_contents('/srv/in/report.pdf'), new PageRange(2, 4),);Code sample — Production
Section titled “Code sample — Production”Merge several PDFs under an explicit input budget, then validate the result before writing the combined output.
<?php
declare(strict_types=1);
require_once __DIR__ . '/../vendor/autoload.php';
use NextPDF\Document\PdfMerger;use NextPDF\Exception\PageLayoutException;
/** @var list<string> $sources Raw PDF byte strings to combine. */$sources = array_map( static fn (string $path): string => file_get_contents($path), glob('/srv/batch/*.pdf') ?: [],);
$merger = new PdfMerger();
try { // Bound the merge: at most 50 files, 100 MB total. $merged = $merger->merge($sources, maxFiles: 50, maxTotalBytes: 100_000_000);} catch (PageLayoutException $e) { throw new \RuntimeException('Merge rejected: empty or invalid input set.', previous: $e);}
if (!$merged->isValid()) { throw new \RuntimeException('Merged document failed structural validation.');}
file_put_contents('/srv/out/combined.pdf', $merged->pdfData);Edge cases & gotchas
Section titled “Edge cases & gotchas”PdfMerger::merge()andPdfSplitter::split()enforce input bounds throughResourceGuard. Inputs with too many files or too many bytes raise an exception instead of silently truncating. SetmaxFiles/maxTotalBytesdeliberately for your workload.- An empty file list or empty range list raises
PageLayoutException. Treat these as configuration errors, not empty results. PageRangeis 1-based and inclusive. A leafDPart’spageslist is 0-based page indices. The two abstractions use different index bases. Convert explicitly when you cross them.DPartisreadonly. To build a different tree, construct new nodes instead of mutating an existing one.resolveWithPageObjects()returns the integer-index fallback form only when the page-object map is empty. Do not rely on that path in production output.VendorExtensionRegistryraisesVendorExtensionRegistryConflictExceptionfor a duplicate vendor prefix. Register each prefix once.
Performance
Section titled “Performance”Split and merge scale linearly with page count and are dominated by parsing and
object renumbering, not the module’s own bookkeeping. The default reference
workload fits within a 1500 ms wall / 64 MB peak budget. Large merges are
constrained mainly by total input bytes. The maxTotalBytes guard keeps peak
memory bounded. The reproducibility profile is structural: a merged or split
PDF carries a fresh trailer and /ID, so two runs are structurally equal but
not byte-identical.
Security notes
Section titled “Security notes”PdfMerger::merge() and PdfSplitter::split() consume untrusted PDF bytes.
Before parsing, both pass input through ResourceGuard::assertSize() /
assertCount(), which bounds a decompression- or object-count-amplification
denial of service. Keep the maxFiles, maxTotalBytes, and maxBytes
arguments tight for the deployment rather than relying on the defaults. Treat
every input PDF as hostile. When sources are
user-supplied, run batch composition in a constrained worker. See the engine
threat model in /modules/core/security/ for the trust boundary.
Conformance
Section titled “Conformance”The DPart tree this module builds follows the Document Part model in ISO
32000-2 §14.12, with leaf /Start and /End entries emitted as indirect
references to page objects under the same clause.
Merged output uses the page-tree node structure defined in §7.7.3.
These are implementation facts produced by src/Document/ and exercised by
tests/Unit/Document/ (DPartTest, DPartRootTest, DPartPageRefTest,
DocumentPdfMergerDeepTest, DocumentPageRangeParseDeepTest). They are not a
statement of end-to-end PDF 2.0 conformance. Full-document conformance is
validated by the oracle and golden suites described in
/modules/core/conformance/.
See also
Section titled “See also”- Core module
- Writer module — serializes the DPart tree and page tree.
- Metadata module — Extensible Metadata Platform (XMP) that pairs with DPM.
- Navigation module
- Conformance overview
- Engine security model