Advanced PDF parser diagnostics
At a glance
Section titled “At a glance”The Artisan import path reads a Chrome-generated Portable Document Format (PDF) file and brings one page into a NextPDF document. When a difficult input breaks that import, look below PageImporter::import() to the parser classes that read the file byte by byte.
This guide covers the low-level parser surface in the NextPDF\Parser namespace: PdfReader, PdfTokenizer, CrossRefParser, StreamDecoder, ResourceCollector, RevisionExtractor, and the value objects PdfObject and RevisionXRefTable. Every symbol shown here exists in nextpdf/artisan. The guide describes the parser as it is built, not an idealized interface.
Use this guide as both explanation and how-to. It shows how the pieces fit, then walks you through inspecting an incremental-update revision. For the import boundary above this layer, see the Artisan developer guide.
When you need this
Section titled “When you need this”Use the parser surface only when the normal import path has already failed and you need to find the cause. Typical triggers include:
PageImporter::import()throwsNextPDF\Artisan\Exception\PdfParseException, and you need to know whether the cross-reference table, a stream filter, or the page tree is at fault.- A Chrome upgrade changes the output format, such as when a traditional cross-reference table becomes a cross-reference stream, or vice versa, and your fixtures stop matching.
- You receive a third-party PDF that Chrome did not produce, and you want to confirm whether the parser can read it at all.
- You are analyzing an incrementally updated document and need per-revision byte ranges or object visibility.
If you are writing a normal renderer integration, you do not need this surface. The parser is an internal diagnostic tool, not a general-purpose PDF library. It does not support encrypted PDFs, linearized hint tables, or incremental updates with conflicting object redefinitions.
Parser surface
Section titled “Parser surface”The parser is a small set of single-responsibility classes. PdfReader is the entry point. The other classes are collaborators it constructs or calls.
| Class | Responsibility | Key methods |
|---|---|---|
PdfReader | Read the file structure, resolve objects, and traverse the page tree. | parse(), getObject(), getTrailer(), getObjectNumbers(), getPage(), getPageContentStream(), getPageResources(), getPageMediaBox(), resolveRef(), collectPageResources(), getRevisionCount(), getRevisionXRef(), getRevisions() |
PdfTokenizer | Analyze lexical syntax per ISO 32000-2:2020 §7.2: names, strings, numbers, dictionaries, arrays, and references. | readToken(), readValue(), readName(), readNumber(), readDictionary(), readArray(), readStreamData(), peek(), skipWhitespace(), getOffset(), setOffset() |
CrossRefParser | Parse traditional cross-reference tables and cross-reference streams. | parseXRefTable(), parseXRefStream() |
StreamDecoder | Decode stream bytes by /Filter. | decode() (static) |
ResourceCollector | Traverse a Resources tree recursively and collect every reachable indirect object. | traverse(), getCollected() |
RevisionExtractor | Slice an incrementally updated file into per-revision byte ranges. | extractRevision() (static), getRevisionBoundaries() (static) |
PdfObject | Immutable parsed indirect object (dictionary plus optional stream). | get(), getRef(), getArray(), getType(), getSubtype(), hasStream(), getDictionary(), getRawStreamData(), getRawDictionaryBytes() |
RevisionXRefTable | Immutable per-revision cross-reference snapshot. | getObjectNumbers(), getActiveObjectCount(), hasRootUpdate(), getSize() |
PdfReader — the entry point
Section titled “PdfReader — the entry point”Construct \NextPDF\Parser\PdfReader with the raw PDF bytes, then call parse() before you call any other method. parse() checks the %PDF- header, finds startxref in the file tail, and walks the cross-reference chain by following /Prev links.
After parse(), the reader exposes three method groups:
- Object access.
getObject(int $objNum)returns aPdfObject, resolving Type 2 entries (objects stored inside an object stream) automatically.getObjectNumbers()returns a sortedlist<int>of every non-free object number.resolveRef(mixed $value)follows one indirect reference. A direct value passes through unchanged. - Page access.
getPage(int $pageIndex)resolves the catalog, walks/Pages, and returns the page at the zero-based index.getPageContentStream(),getPageResources(), andgetPageMediaBox()extract the partsPageImporterneeds.collectPageResources()returnsarray<int, PdfObject>for every object reachable from the page’s Resources and Contents. - Revision access.
getRevisionCount()returns the number of incremental revisions. A single-revision file returns1.getRevisionXRef(int $index)returns oneRevisionXRefTable(index0is the most recent).getRevisions()returns the fulllist<RevisionXRefTable>.
PdfTokenizer — lexical analysis
Section titled “PdfTokenizer — lexical analysis”PdfTokenizer reads the byte stream. You rarely construct it yourself because PdfReader and CrossRefParser own their instances. Inspect this layer when a parse fails on a malformed token. Two behaviors matter for diagnostics:
- Security limits are constants, not configuration. The tokenizer caps literal-string nesting, dictionary and array nesting, keyword length, and array element count. When input exceeds a limit, it throws
PdfParseExceptionand names the limit in the message. A crafted input that trips one of these limits is a defense working as designed, not a parser bug. readValue()routes parsing. It inspects the next byte and delegates toreadName(),readLiteralString(),readHexString(),readArray(),readDictionary(), or a number/reference reader. An indirect referenceN G Ris returned as the array shape['type' => 'ref', 'num' => N, 'gen' => G].PdfObject::getRef()andPdfReader::resolveRef()recognize this shape.
CrossRefParser — cross-reference resolution
Section titled “CrossRefParser — cross-reference resolution”CrossRefParser parses both formats Chrome can emit:
parseXRefTable()reads a traditionalxreftable (PDF 1.x style): subsection headers, fixed-width 20-byte entries, and then atrailerdictionary.parseXRefStream()reads a cross-reference stream (PDF 2.0, ISO 32000-2:2020 §7.5.8): an indirect object with/Type /XRef, a/Wfield-width array, and a binary stream of entries.
Both return the same shape: array{xref: array<int, ...>, trailer: array<string, mixed>, prevOffset: int|null}. PdfReader::parse() decides which parser to call by peeking at the four bytes at the cross-reference offset: xref selects the table parser, and anything else is treated as a stream object. Both parsers enforce a one-million-entry ceiling per subsection to reject forged counts that would otherwise make the parser run excessively.
StreamDecoder — stream filters
Section titled “StreamDecoder — stream filters”StreamDecoder::decode(string $data, string|array $filter) is static and applies one filter or a chained list of filters. It supports exactly the filters Chrome’s printToPDF emits:
FlateDecode(zlib, with a raw-deflate fallback)ASCIIHexDecodeASCII85Decode
Any other filter name throws PdfParseException with Unsupported stream filter. The decoder caps decompressed output at 16 MiB to bound decompression-bomb risk. Oversized output throws rather than allocating without limit. When PdfReader reads a stream and decoding throws, it falls back to the raw stream bytes, so one bad filter does not abort the whole parse.
ResourceCollector — deep resource traversal
Section titled “ResourceCollector — deep resource traversal”ResourceCollector is constructed with the PdfReader and called through PdfReader::collectPageResources(). Its traverse() method walks a value recursively, follows every ['type' => 'ref'] reference through getObject(), and records each resolved object once in an array<int, PdfObject> keyed by object number. It caps recursion depth and silently skips references it cannot resolve, so one dangling reference yields a partial collection instead of a hard failure.
RevisionExtractor — incremental updates and revisions
Section titled “RevisionExtractor — incremental updates and revisions”A PDF that was signed, annotated, or otherwise edited after creation carries incremental updates. Each edit appends a new cross-reference section and trailer, ending in a %%EOF marker. RevisionExtractor works entirely from static methods over a parsed PdfReader:
extractRevision(string $pdfData, PdfReader $reader, int $revision)returns the file truncated at the requested revision’s%%EOFboundary. Revision0(most recent) returns the whole file; higher indices return progressively older snapshots.getRevisionBoundaries(string $pdfData, PdfReader $reader)returns alist<array{revision, startByte, endByte, sizeBytes}>describing the byte range each revision contributed.
This isolation is deliberate. Extracting an older revision exposes only the objects visible up to that point, which blocks hybrid cross-reference attacks where a later revision redefines an earlier object.
Walkthrough: inspecting a revision
Section titled “Walkthrough: inspecting a revision”This procedure inspects the revision history of a PDF that may have been edited after Chrome produced it. The example is shaped for production: it declares strict types, uses full type hints, validates its input, and catches the most specific exception.
- Read the PDF bytes into memory, and reject empty input before constructing the reader.
- Construct
\NextPDF\Parser\PdfReaderand callparse(). - Read
getRevisionCount(). A value of1means a single-revision file with no incremental updates. - For each revision, read its
RevisionXRefTableand inspectgetActiveObjectCount(),hasRootUpdate(), andgetSize(). - Compute per-revision byte ranges with
RevisionExtractor::getRevisionBoundaries(). - Catch
PdfParseException, the most specific exception the parser raises, and surface a diagnostic message.
<?php
declare(strict_types=1);
namespace App\Pdf\Diagnostics;
use NextPDF\Artisan\Exception\PdfParseException;use NextPDF\Parser\PdfReader;use NextPDF\Parser\RevisionExtractor;use NextPDF\Parser\RevisionXRefTable;
/** * Inspect the incremental-update history of a PDF file. * * @return list<array{revision: int, activeObjects: int, rootUpdate: bool, size: int, startByte: int, endByte: int, sizeBytes: int}> * * @throws PdfParseException If the file is not a readable PDF. */function inspectRevisions(string $path): array{ $pdfData = \file_get_contents($path);
if ($pdfData === false || $pdfData === '') { throw new PdfParseException("Cannot read PDF bytes from path: {$path}"); }
$reader = new PdfReader($pdfData); $reader->parse();
$boundaries = RevisionExtractor::getRevisionBoundaries($pdfData, $reader); $report = [];
foreach ($reader->getRevisions() as $table) { \assert($table instanceof RevisionXRefTable);
$index = $table->index; $boundary = $boundaries[$index];
$report[] = [ 'revision' => $index, 'activeObjects' => $table->getActiveObjectCount(), 'rootUpdate' => $table->hasRootUpdate(), 'size' => $table->getSize(), 'startByte' => $boundary['startByte'], 'endByte' => $boundary['endByte'], 'sizeBytes' => $boundary['sizeBytes'], ]; }
return $report;}The reader orders revisions from newest (index0) to oldest. To extract one older snapshot as standalone bytes, for example, to diff what an edit changed, call the extractor directly:
<?php
declare(strict_types=1);
namespace App\Pdf\Diagnostics;
use NextPDF\Artisan\Exception\PdfParseException;use NextPDF\Parser\PdfReader;use NextPDF\Parser\RevisionExtractor;
/** * Extract one revision of a PDF as standalone bytes. * * @throws PdfParseException If the file is unreadable or the revision index is out of range. */function extractRevision(string $pdfData, int $revision): string{ if ($pdfData === '') { throw new PdfParseException('Empty PDF input'); }
$reader = new PdfReader($pdfData); $reader->parse();
// Throws PdfParseException with an "out of range" message for an invalid index. return RevisionExtractor::extractRevision($pdfData, $reader, $revision);}Failure handling
Section titled “Failure handling”Every parser failure surfaces as NextPDF\Artisan\Exception\PdfParseException. The message identifies the cause. Use the table below to map a message fragment to the stage that raised it.
| Message fragment | Stage | What it means |
|---|---|---|
missing %PDF- header | PdfReader::parse() | The bytes are not a PDF, or the input was truncated at the beginning. |
Cannot find startxref marker / Invalid startxref offset | PdfReader::parse() | The file tail is corrupt, or the cross-reference pointer is out of bounds. |
Expected 'xref' keyword / Invalid xref subsection header | CrossRefParser::parseXRefTable() | A traditional cross-reference table is malformed. |
XRef stream ... /Type /XRef / invalid /W array | CrossRefParser::parseXRefStream() | A cross-reference stream is missing required dictionary entries. |
exceeds limit of (xref or object-stream count) | CrossRefParser / PdfReader | A forged count tripped a denial-of-service guard. |
Unsupported stream filter | StreamDecoder::decode() | The stream uses a filter outside the supported FlateDecode / ASCIIHexDecode / ASCII85Decode set. |
FlateDecode decompression failed / output exceeds ... bytes limit | StreamDecoder | The compressed data is invalid or expands past the 16 MiB cap. |
Maximum nesting depth ... exceeded / Keyword exceeds maximum length | PdfTokenizer | A crafted or pathological structure tripped a tokenizer limit. |
Page index ... not found / out of range in subtree | PdfReader::getPage() | The requested page index does not exist in the page tree. |
Revision index ... out of range | PdfReader / RevisionExtractor | The revision index is outside 0 to getRevisionCount() - 1. |
When you catch the exception, log the message and the source path, then either rethrow or return a defined error. Do not discard it silently. An empty catch block hides the one piece of information the parser worked to produce.
<?php
declare(strict_types=1);
namespace App\Pdf\Diagnostics;
use NextPDF\Artisan\Exception\PdfParseException;use NextPDF\Parser\PdfReader;use Psr\Log\LoggerInterface;
/** * Parse a PDF, logging the precise parser-stage message on failure. * * @throws PdfParseException Rethrown after logging so the caller can decide policy. */function parseWithDiagnostics(string $pdfData, LoggerInterface $logger): PdfReader{ if ($pdfData === '') { throw new PdfParseException('Empty PDF input'); }
$reader = new PdfReader($pdfData);
try { $reader->parse(); } catch (PdfParseException $exception) { $logger->error('PDF parse failed', [ 'reason' => $exception->getMessage(), 'bytes' => \strlen($pdfData), ]);
throw $exception; }
return $reader;}Safe defaults
Section titled “Safe defaults”- Always call
parse()first. Every accessor onPdfReaderassumes the cross-reference chain is loaded. CallinggetObject()orgetPage()beforeparse()returns nothing useful. - Treat the parser as read-only and Chrome-shaped. It targets the subset of PDF syntax that Chrome’s
printToPDFemits. Encrypted PDFs, linearized hint tables, and conflicting incremental updates are out of scope by design. Do not extend it into a general PDF repair tool. - Keep the security limits in place. The nesting, keyword-length, array-size, cross-reference-count, and decompression caps bound resource use on hostile input. A
PdfParseExceptionfrom a limit is the correct outcome for a crafted file. Raising a limit to accept such a file widens the attack surface. - Default to page
0.getPage()andPageImporter::import()default to the first page. Choose another index only when the workflow deliberately needs it. - Validate input before constructing the reader. Reject empty or unreadable bytes early, as the examples above do, so a clear application-level error appears before any parser exception.
- Catch
PdfParseException, never bare\Exception. It is the single, specific type the parser raises. Catching it keeps unrelated failures from being masked.
See also
Section titled “See also”- Artisan developer guide — the import boundary above the parser, including
ChromeHtmlRenderer,PageImporter, and the architecture layers. - Artisan API reference — the published method tables for the package’s public surface.
- Artisan troubleshooting — symptom-first guidance for renderer and import failures.
- Chrome renderer setup — configuring the renderer that produces the PDFs this parser reads.
- ISO 32000-2:2020 §7.5 (file structure, cross-reference, incremental updates) and §7.2 (lexical conventions) — the specification the tokenizer and cross-reference parser implement. Consult the published standard for the authoritative byte-level format.