Skip to content

Parse and inspect a PDF for structural facts

This recipe uses the Core inspector Quick fallback to read structural facts from a Portable Document Format (PDF) file. You get the version, page count, encryption flag, signature flag, attachment flag, file size, and risk flags. Quick runs entirely in process, with no Spectrum sidecar and no network access. Use it for fast triage, not as a validator.

Terminal window
composer require nextpdf/core:^3

A PDF file records its version in the file header (ISO 32000-2 §7.5.2). The trailer carries a file identifier (/ID) as two byte strings (ISO 32000-2 §7.5.5). When a signature is present, a signature dictionary stores Distinguished Encoding Rules (DER)-encoded Cryptographic Message Syntax (CMS) SignedData in Contents (ISO 32000-2 §12.8.1). The Quick fallback uses a bounded scan of the document bytes to derive the version, a page-count estimate, and the encryption, signature, and attachment presence flags.

Create new Inspector(), then call ->inspect(string $pdfData, InspectConfig::quick()). It returns an InspectResult with $pdfVersion, $pageCount, $isEncrypted, $hasSigned, $hasAttachments, $fileSizeBytes, $riskFlags, and the hasRisks() helper.

<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use NextPDF\Inspect\InspectConfig;
use NextPDF\Inspect\Inspector;
$pdf = file_get_contents(__DIR__ . '/document.pdf');
$result = (new Inspector())->inspect($pdf, InspectConfig::quick());
printf(
"v%s, %d page(s), encrypted=%s, signed=%s\n",
$result->pdfVersion ?? '?',
$result->pageCount,
$result->isEncrypted ? 'yes' : 'no',
$result->hasSigned ? 'yes' : 'no',
);

This self-contained program runs in the cookbook harness. It mirrors examples/39-parse-and-inspect-pdf.php: it builds a small multi-page PDF in memory, reads its structural facts with the Quick fallback, and routes on those facts, never on a trust verdict. The routing branch is illustrative. Replace it with your own pipeline, verifier queue, and quarantine.

<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use NextPDF\Core\Document;
use NextPDF\Inspect\InspectConfig;
use NextPDF\Inspect\Inspector;
// A self-contained input so the program runs with no external file.
$doc = Document::createStandalone();
$doc->setTitle('Parse-and-inspect demo');
$doc->setAuthor('NextPDF Cookbook');
$doc->addPage();
$doc->setFont('helvetica', '', 12);
$doc->cell(0, 10, 'Page one of the parse-and-inspect demonstration.', newLine: true);
$doc->addPage();
$doc->cell(0, 10, 'Page two.', newLine: true);
$pdf = $doc->getPdfData();
$result = (new Inspector())->inspect($pdf, InspectConfig::quick());
echo 'PDF version : ' . ($result->pdfVersion ?? 'unknown') . "\n";
echo 'Pages : ' . $result->pageCount . "\n";
echo 'Encrypted : ' . ($result->isEncrypted ? 'yes' : 'no') . "\n";
echo 'Signed : ' . ($result->hasSigned ? 'yes' : 'no') . "\n";
echo 'Attachments : ' . ($result->hasAttachments ? 'yes' : 'no') . "\n";
echo 'File size : ' . $result->fileSizeBytes . " bytes\n";
echo 'Risk flags : ' . ($result->hasRisks() ? count($result->riskFlags) : 0) . "\n";
// Route on structural facts, not trust verdicts. Replace these calls with
// your own pipeline / verifier queue / quarantine.
if ($result->isEncrypted) {
// $pipeline->decryptThenContinue($pdf);
echo "Route: decrypt-then-continue\n";
} elseif ($result->hasSigned) {
// $verifierQueue->enqueue($pdf); // see the signature-inspect recipe
echo "Route: enqueue for cryptographic verification\n";
} elseif ($result->hasRisks()) {
// $quarantine->hold($pdf, $result->riskFlags);
echo "Route: quarantine (risk flags present)\n";
} else {
// $pipeline->continue($pdf);
echo "Route: continue (no risks, unsigned, unencrypted)\n";
}
// The harness sets NEXTPDF_COOKBOOK_OUTPUT and runs this script under the
// semantic profile; emit the document to the side-channel.
$out = getenv('NEXTPDF_COOKBOOK_OUTPUT');
file_put_contents($out !== false && $out !== '' ? $out : __DIR__ . '/inspected.pdf', $pdf);

Expected standard output (STDOUT) (version and size depend on the build; the demo PDF is unencrypted, unsigned, and risk-free):

PDF version : <version>
Pages : 2
Encrypted : no
Signed : no
Attachments : no
File size : <n> bytes
Risk flags : 0
Route: continue (no risks, unsigned, unencrypted)
  • Quick is triage, not validation. It reports what is present and what is absent. It does not verify signatures, decrypt content, or assert conformance. Treat the result as routing input.
  • Page count is an estimate. The Quick fallback counts page-object markers. A deliberately malformed object graph can skew the count. Use the Spectrum-backed depths when you need an exact count.
  • Standard/Full need the sidecar. new InspectConfig() (depth Standard) and InspectConfig::full() require the Spectrum sidecar. They throw INSPECT-SIDECAR-001 when it is unavailable and do not silently degrade to Quick.
  • Empty input. Passing an empty string throws an inspect exception with “PDF data must not be empty”.
  • Encryption flag scope. The flag reflects an /Encrypt trailer entry. A flagged file is not decrypted by the inspector.

The Quick fallback uses a bounded scan, not a full parse. Use it to pre-route high volumes of incoming files before heavier processing.

The inspector runs in process and reads only structural markers. No document bytes leave the host, and no document text is extracted. A risk flag, such as embedded JavaScript, is an advisory routing signal. It is not an assertion that the file is safe or unsafe.

StatementSpecClausereference_id
The file header records the PDF version.ISO 32000-2§7.5.2
The trailer /ID is a file identifier of two byte strings.ISO 32000-2§7.5.5
A signature dictionary Contents holds DER CMS SignedData.ISO 32000-2§12.8.1

This recipe reports structural facts only. It does not assert that the file is valid, safe, or conformant.

Standard and Full inspection depths run through the Spectrum sidecar. They add richer object, font, and image analysis. The Quick fallback documented here is Core and offline.