Document: DParts, split / merge, and vendor extensions

At a glance

The Document module works with whole Portable Document Format (PDF) files, not page content. It builds the Document Part hierarchy that regulated workflows use to attach metadata. It splits a PDF into page-range segments, merges several PDFs into one, and registers developer extensions in the document catalog.

Install

composer require nextpdf/core:^3

Conceptual overview

This module sits above page content. Where Graphics and Content emit operators, Document works at the structural level: page trees, the document catalog, and the Document Part tree.

A Document Part (DPart) is a logical partition of a PDF. ISO 32000-2 defines a DPart hierarchy whose nodes carry Document Part Metadata (DPM). A regulated workflow, such as a pharmaceutical, legal, or archival workflow, can associate metadata with a sub-range of pages instead of the whole file — §14.12. DPart is an immutable readonly node: a leaf references a contiguous run of page indices, and an intermediate node groups child DPart nodes into a tree. DPartRoot is the tree root that the Writer serializes. A leaf node’s /Start and /End entries are indirect references to page objects, not page-index integers — §14.12. DPart::resolveWithPageObjects() resolves those entries against a writer-supplied page-index→object-number map and returns the /Start (and optional /End) reference form. It falls back to the integer form only on test paths where the map is unavailable.

PdfMerger and PdfSplitter are the document-composition surface. PdfMerger combines page objects from multiple input PDFs, renumbers objects to avoid collisions, and rebuilds a single page tree and cross-reference table. The page tree it produces is a balanced Pages node with Kids and Count, plus the inheritable attribute model that PDF defines for page-tree nodes — §7.7.3. PdfSplitter does the inverse: it extracts page ranges into standalone SplitDocument objects. PageRange is the value object both classes consume. It is 1-based, validates its bounds, and answers contains(), count(), and toArray().

VendorExtensionRegistry, ExtensionsDictionary, and DeveloperExtensionEntry model the developer-extensions dictionary in the document catalog. An engine uses that dictionary to declare a vendor extension level beyond the base specification. The registry rejects conflicting re-registration of the same vendor prefix with VendorExtensionRegistryConflictException. CollectionDictionary and CollectionSort model a PDF collection (portable collection or portfolio) catalog entry.

API surface

Class	Key methods	Role
`DPart`	`isLeaf()`, `hasMetadata()`, `resolveWithPageObjects()`, `write()`	Immutable Document Part node (`@since 1.12.0`)
`DPartRoot`	`isEmpty()`, `write()`	DPart tree root the Writer serializes (`@since 1.12.0`)
`PdfMerger`	`merge(array $pdfFiles, int $maxFiles = 100, int $maxTotalBytes = 200_000_000)`, `append()`	Multi-PDF merge with object renumbering (`@since 1.9.0`)
`PdfSplitter`	`split()`, `splitEvery()`, `extractPages()`	Page-range split into `SplitDocument` (`@since 1.9.0`)
`PageRange`	`contains(int $page)`, `count()`, `toArray()`	1-based page-range value object
`MergeResult` / `SplitResult`	`isValid()`, `count()`, `document()`, `totalOutputSize()`	Composition result objects
`VendorExtensionRegistry`	extension registration	Developer-extensions registry (`@since 2.2.0`)
`ExtensionsDictionary`	`withEntry()`, `entries()`, `isEmpty()`, `toPdfDictionary()`	Immutable extensions-dictionary builder (`@since 2.0.0`)
`CollectionDictionary`	`toPdfDictionary()`	Portable-collection catalog entry (`@since 2.0.0`)

Run composer docs:generate-api-php -- --module=Document to generate the full PHPDoc table.

Code sample — Quick start

Split a PDF into single-page documents, then inspect the result.

<?php

declare(strict_types=1);

require_once __DIR__ . '/../vendor/autoload.php';

use NextPDF\Document\PageRange;
use NextPDF\Document\PdfSplitter;

$splitter = new PdfSplitter();
$result = $splitter->splitEvery(file_get_contents('/srv/in/report.pdf'), 1);

foreach (range(0, $result->count() - 1) as $index) {
    $segment = $result->document($index);
    file_put_contents("/srv/out/page-{$index}.pdf", $segment->pdfData);
}

$singlePage = $splitter->extractPages(
    file_get_contents('/srv/in/report.pdf'),
    new PageRange(2, 4),
);

Code sample — Production

Merge several PDFs under an explicit input budget, then validate the result before writing the combined output.

<?php

declare(strict_types=1);

require_once __DIR__ . '/../vendor/autoload.php';

use NextPDF\Document\PdfMerger;
use NextPDF\Exception\PageLayoutException;

/** @var list<string> $sources Raw PDF byte strings to combine. */
$sources = array_map(
    static fn (string $path): string => file_get_contents($path),
    glob('/srv/batch/*.pdf') ?: [],
);

$merger = new PdfMerger();

try {
    // Bound the merge: at most 50 files, 100 MB total.
    $merged = $merger->merge($sources, maxFiles: 50, maxTotalBytes: 100_000_000);
} catch (PageLayoutException $e) {
    throw new \RuntimeException('Merge rejected: empty or invalid input set.', previous: $e);
}

if (!$merged->isValid()) {
    throw new \RuntimeException('Merged document failed structural validation.');
}

file_put_contents('/srv/out/combined.pdf', $merged->pdfData);

Edge cases & gotchas

PdfMerger::merge() and PdfSplitter::split() enforce input bounds through ResourceGuard. Inputs with too many files or too many bytes raise an exception instead of silently truncating. Set maxFiles / maxTotalBytes deliberately for your workload.
An empty file list or empty range list raises PageLayoutException. Treat these as configuration errors, not empty results.
PageRange is 1-based and inclusive. A leaf DPart’s pages list is 0-based page indices. The two abstractions use different index bases. Convert explicitly when you cross them.
DPart is readonly. To build a different tree, construct new nodes instead of mutating an existing one. resolveWithPageObjects() returns the integer-index fallback form only when the page-object map is empty. Do not rely on that path in production output.
VendorExtensionRegistry raises VendorExtensionRegistryConflictException for a duplicate vendor prefix. Register each prefix once.

Performance

Split and merge scale linearly with page count and are dominated by parsing and object renumbering, not the module’s own bookkeeping. The default reference workload fits within a 1500 ms wall / 64 MB peak budget. Large merges are constrained mainly by total input bytes. The maxTotalBytes guard keeps peak memory bounded. The reproducibility profile is structural: a merged or split PDF carries a fresh trailer and /ID, so two runs are structurally equal but not byte-identical.

Security notes

PdfMerger::merge() and PdfSplitter::split() consume untrusted PDF bytes. Before parsing, both pass input through ResourceGuard::assertSize() / assertCount(), which bounds a decompression- or object-count-amplification denial of service. Keep the maxFiles, maxTotalBytes, and maxBytes arguments tight for the deployment rather than relying on the defaults. Treat every input PDF as hostile. When sources are user-supplied, run batch composition in a constrained worker. See the engine threat model in /modules/core/security/ for the trust boundary.

Conformance

The DPart tree this module builds follows the Document Part model in ISO 32000-2 §14.12, with leaf /Start and /End entries emitted as indirect references to page objects under the same clause. Merged output uses the page-tree node structure defined in §7.7.3. These are implementation facts produced by src/Document/ and exercised by tests/Unit/Document/ (DPartTest, DPartRootTest, DPartPageRefTest, DocumentPdfMergerDeepTest, DocumentPageRangeParseDeepTest). They are not a statement of end-to-end PDF 2.0 conformance. Full-document conformance is validated by the oracle and golden suites described in /modules/core/conformance/.