Contracts / Extraction
At a glance
Section titled “At a glance”The extraction domain defines the contracts you use to read and validate Portable Document Format (PDF) files and turn their content into structured data. It includes the inspector, compliance validators, PDF/A manager, imported-object contracts, embedding and vector-index contracts, and the e-invoice validator sub-namespace.
Install
Section titled “Install”composer require nextpdf/core:^3Conceptual overview
Section titled “Conceptual overview”InspectorInterface reads raw PDF bytes and returns a structured InspectResult. The result lists the objects in the file. Use this contract for any tool that reads a PDF the engine did not write.
ExternalComplianceValidatorInterface connects the engine to an external checker such as veraPDF. The checker tests PDF/A and PDF/Universal Accessibility (PDF/UA). When no checker is configured, the null implementation returns an “unavailable” result. A site without veraPDF still runs. ProfileValidatorInterface checks the runtime against a deployment profile, including required and advised extensions. It returns a typed verdict.
PdfAManagerInterface keeps a PDF/A file in spec while the writer builds it. It blocks JavaScript, JavaScript form actions, and built-in encryption. PDF/A bars all three. It also checks that every font is embedded, sets in-spec metadata, and writes the needed objects before the catalog. The real class ships in the Pro edition. Core finds it with class_exists() and casts it to the contract. The open-source engine has no paid dependency.
Two contracts cover imported objects: ImportedFormObjectInterface and EmbeddedPdfObjectInterface. They provide typed access to objects read from an existing PDF so the engine can re-embed them. The lossless path keeps raw dictionary bytes. The fallback path provides a parsed dictionary array for objects taken from object streams. Each re-embedded object is a PDF indirect object. An object number and a generation number identify it, as defined by ISO 32000-2 §7.3.10.
The embedding contracts support search. EmbeddingServiceInterface turns text into a dense vector and reports the model size and name, so callers can adapt at runtime. The Pro edition runs a central processing unit (CPU) model. The Enterprise edition runs a graphics processing unit (GPU) model. VectorIndexInterface builds and searches a nearest-neighbor index. It is the small in-process index for core use. Larger search stays in an Enterprise-only contract.
The EInvoice group holds the cross-tier e-invoice checker. ValidatorInterface runs preflight checks on a Cross Industry Invoice (CII) or Universal Business Language (UBL) payload. SchematronRunnerInterface runs the business-rule pass. ValidationResult collects findings and rule violations. The checker must reject bad input with a result, not an exception. It must also guard against payloads with a Document Type Declaration (DOCTYPE) and against oversized payloads.
API surface
Section titled “API surface”| Type | Kind | Key members | Stability | Since |
|---|---|---|---|---|
InspectorInterface | interface | inspect(string, InspectConfig): InspectResult | experimental | 2.2.0 |
ExternalComplianceValidatorInterface | interface | validate(string, ComplianceFlavour), isAvailable() | experimental | 2.4.0 |
ProfileValidatorInterface | interface | validate(DeploymentProfile): DeploymentProfileResult | experimental | 2.4.0 |
PdfAManagerInterface | interface | validateNoJavaScript(), validateFont(), validateNoEncryption(), applyOutputProfile(), writeRequiredObjects() | stable | 1.10.0 |
ImportedFormObjectInterface | interface | getWidth(), getHeight(), getEmbeddedObjects(), getResourcesDict(), getMediaBox(), getContentStream() | stable | 1.8.0 |
EmbeddedPdfObjectInterface | interface | getRawDictionaryBytes(), getRawStreamData(), getDictionary() | stable | 1.8.0 |
EmbeddingServiceInterface | interface | embed(), batchEmbed(), getDimension(), getModelName() | experimental | 2.1.0 |
VectorIndexInterface | interface | build(), search(), delete(), count() | experimental | 2.1.0 |
EInvoice\ValidatorInterface | interface | validate(string, ValidatorContext): ValidationResult | experimental | 5.1.0 |
EInvoice\ValidationResult | final readonly class | $isValid, getErrors(), getWarnings(), fail() | experimental | 5.1.0 |
The EInvoice namespace also publishes SchematronRunnerInterface, ProfileInterface, ValidationFinding, RuleViolation, and the ProfileType, RuleSeverity, and ValidationFindingLevel enums.
Code sample — Quick start
Section titled “Code sample — Quick start”<?php
declare(strict_types=1);
require_once __DIR__ . '/../../vendor/autoload.php';
use NextPDF\Contracts\InspectorInterface;use NextPDF\Inspect\InspectConfig;
/** * Inspect a PDF and report its object count. * * @param InspectorInterface $inspector A configured inspector. * @param string $pdfData Raw PDF bytes. */function describe(InspectorInterface $inspector, string $pdfData): \NextPDF\Inspect\InspectResult{ return $inspector->inspect($pdfData, new InspectConfig());}The function depends on the contract. Any inspector implementation can satisfy it.
Code sample — Production
Section titled “Code sample — Production”<?php
declare(strict_types=1);
require_once __DIR__ . '/../../vendor/autoload.php';
use NextPDF\Contracts\EInvoice\ValidatorInterface;use NextPDF\Contracts\EInvoice\ValidatorContext;use NextPDF\Contracts\ExternalComplianceValidatorInterface;use NextPDF\ValueObjects\ComplianceFlavour;use Psr\Log\LoggerInterface;
final readonly class InvoiceConformanceService{ public function __construct( private ValidatorInterface $invoiceValidator, private ExternalComplianceValidatorInterface $pdfaValidator, private LoggerInterface $logger, ) {}
/** * Validate the invoice XML, then the PDF/A-3 carrier. * * @param string $xml The CII or UBL invoice payload. * @param string $pdfPath Absolute path to the PDF/A-3 carrier. */ public function validate(string $xml, string $pdfPath, ValidatorContext $ctx): bool { $result = $this->invoiceValidator->validate($xml, $ctx);
if (!$result->isValid) { $this->logger->warning('Invoice XML invalid', [ 'errors' => \count($result->getErrors()), ]);
return false; }
if (!$this->pdfaValidator->isAvailable()) { $this->logger->info('PDF/A validator unavailable; skipping carrier check.');
return true; }
$carrier = $this->pdfaValidator->validate($pdfPath, ComplianceFlavour::PdfA3b);
return $carrier->isConformant(); }}The service handles the unavailable-validator case explicitly instead of assuming a validator is present.
Edge cases & gotchas
Section titled “Edge cases & gotchas”EInvoice\ValidatorInterface::validate()returns a failingValidationResultfor malformed input. It does not throw for well-formedness violations. Check$isValid; do not wrap the call in a try/catch for that case.ExternalComplianceValidatorInterface::isAvailable()must be checked before you rely on a verdict. The null implementation returns “unavailable”. Treating that as “non-conformant” produces false negatives.EmbeddedPdfObjectInterface::getRawDictionaryBytes()returnsnullfor objects taken from an object stream. Fall back togetDictionary(). Do not assume raw bytes exist.EmbeddingServiceInterface::getDimension()differs by tier. Code that allocates a fixed-width vector must read the dimension at runtime, not hard-code it.VectorIndexInterface::build()requires vector and id lists with equal length and consistent dimensions. A mismatch raisesInvalidArgumentException. Validate the lists before you build the index.
Performance
Section titled “Performance”Inspection and validation cost scale with document size and object count. The performance_budget of 1500 ms wall and 64 MB peak covers one moderate document. An external veraPDF call adds its own process time. That time is outside the engine budget and should run off the request path. Embedding cost scales with text length and is far cheaper in a batch than in a loop, especially on a GPU model. Prefer batchEmbed(). Vector search is sublinear in index size for the in-process index. The reproducibility profile is structural. A validation report records a timestamp and an environment fingerprint. Two runs differ in those fields while the conformance verdict stays identical.
Security notes
Section titled “Security notes”Extraction reads documents the engine did not create, so every input is untrusted. The inspector and e-invoice validator both parse externally supplied bytes. The e-invoice validator must block Document Type Declaration (DOCTYPE), oversized, and forbidden-control-character payloads before parsing to prevent Extensible Markup Language (XML) external-entity and billion-laughs attacks. Imported-object re-embedding copies bytes from a foreign PDF. A malicious source object can carry hostile content, so re-embedding preserves bytes without executing them. PDF/A enforcement removes JavaScript and actions. The PDF/A manager rejects JavaScript and encryption because both are prohibited in the profile and both are abuse vectors in a long-lived archival document. Treat inspected content, imported objects, and invoice XML as hostile input throughout.
Conformance
Section titled “Conformance”| Claim | Standard | Clause | Evidence |
|---|---|---|---|
| PDF/A-4 prohibits JavaScript and JavaScript form actions; the PDF/A manager rejects both. | ISO 19005-4 | §6.7.1 | cited by clause (not in corpus) |
| Every re-embedded object is a PDF indirect object identified by object number and generation. | ISO 32000-2 | §7.3.10 |
ISO 19005-4 is cited by clause. It is not in the verifiable citation corpus, so no reference_id is recorded. The ISO 32000-2 indirect-object claim is glossary-pinned. Both claims are paraphrased. The engine reproduces no normative text.
Commercial context
Section titled “Commercial context”Core defines and freezes the extraction contracts. The production code behind PdfAManagerInterface, EmbeddingServiceInterface, and VectorIndexInterface ships in the Pro and Enterprise editions, including CPU and GPU embedding models and the full PDF/A enforcement path. Core resolves these at runtime with class_exists(). The open-source engine therefore carries no commercial dependency, and the application programming interface (API) does not change on upgrade.
See also
Section titled “See also”- Contracts: 41 public interfaces — the Service Provider Interface (SPI) overview and stability tiers.
- Contracts / Document — the document contracts that produce the PDF/A carrier.
- Contracts / Signing — signed archival that pairs with PDF/A enforcement.
- Inspect — the inspector implementation behind
InspectorInterface. - Text — text extraction that consumes inspected objects.
- Metadata — PDF/A metadata configured by the manager.