Produce text content that downstream tools can extract
At a glance
Section titled “At a glance”NextPDF Core is a PDF producer. It ships no public PDF-to-text reader. In the
Core context, “extract text content” means you create the document so its text
is extractable. The glyphs carry a /ToUnicode CMap, and the document has a
tagged logical structure. A conforming reader or downstream extraction tool can
then recover the Unicode text in reading order.
Reading text from an arbitrary third-party PDF is consumer work. Use the Inspect module sidecar or an external tool for that task, not the Core producer surface.
Install
Section titled “Install”composer require nextpdf/core:^3Conceptual overview
Section titled “Conceptual overview”Text-showing operators in the content stream place text on a page
(ISO 32000-2 §9.4.3). Glyph codes are not Unicode. A /ToUnicode CMap lets a
reader map those codes back to Unicode for extraction (ISO 32000-2 §9.10.2). A
tagged structure tree records the logical reading order, so extraction can
recover text in document order instead of paint order (ISO 32000-2 §14.8).
enableTaggedPdf() builds that structure tree and keeps the /ToUnicode CMap
for embedded-subset fonts. Together, those features make the output reliably
extractable.
API surface
Section titled “API surface”Document::enableTaggedPdf(string $lang = 'en') builds the structure tree and
sets the conformance mode that preserves the /ToUnicode CMap.
Document::setLanguage(string $lang) records the BCP 47 language tag. Call both
methods before you write content. Then write the text with the usual setFont() /
cell() / multiCell() surface.
Code sample — Quick start
Section titled “Code sample — Quick start”<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use NextPDF\Core\Document;
$doc = Document::createStandalone();$doc->setLanguage('en');$doc->enableTaggedPdf('en'); // structure tree + ToUnicode retention$doc->addPage();$doc->setFont('helvetica', '', 12);$doc->multiCell(0, 7, 'This text is extractable by a downstream reader.');
file_put_contents(__DIR__ . '/extractable.pdf', $doc->getPdfData());Code sample — Production
Section titled “Code sample — Production”This self-contained program runs in the harness. It mirrors
examples/38-extract-text-content.php.
It creates a tagged document whose text carries a /ToUnicode CMap and a
logical reading order. A downstream extractor can then recover the Unicode text
in order.
<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use NextPDF\Core\Document;
$paragraphs = [ 'NextPDF produces documents whose text content is extractable.', 'A tagged structure tree records the logical reading order.', 'The ToUnicode CMap lets a reader recover Unicode from glyph codes.',];
$doc = Document::createStandalone();$doc->setTitle('Extractable text content');$doc->setAuthor('NextPDF Cookbook');$doc->setLanguage('en'); // BCP 47; validated on enableTaggedPdf()
// Configure tagged mode BEFORE content so the structure tree captures the// text in reading order and the /ToUnicode CMap is retained.$doc->enableTaggedPdf('en');
$doc->addPage();$doc->setFont('helvetica', '', 12);foreach ($paragraphs as $p) { $doc->multiCell(0, 7, $p); // captured in reading order $doc->ln(2);}
$pdf = $doc->getPdfData();// $pdf contains a /StructTreeRoot and per-font /ToUnicode CMaps; an external// extractor (or the Inspect sidecar) recovers the Unicode text in order.
echo "Wrote a tagged PDF with extractable text content\n";echo 'Paragraphs authored: ' . count($paragraphs) . "\n";echo "Text is recoverable via the /ToUnicode CMap + tagged reading order.\n";
// The harness sets NEXTPDF_COOKBOOK_OUTPUT and runs this script under the// semantic profile; emit the document to the side-channel.$out = getenv('NEXTPDF_COOKBOOK_OUTPUT');file_put_contents($out !== false && $out !== '' ? $out : __DIR__ . '/extractable.pdf', $pdf);Expected STDOUT:
Wrote a tagged PDF with extractable text contentParagraphs authored: 3Text is recoverable via the /ToUnicode CMap + tagged reading order.Edge cases & gotchas
Section titled “Edge cases & gotchas”- Producer, not reader. Core has no public
extractText(). Reading text out of an existing third-party PDF is a consumer task. Use the Inspect module with the Spectrum sidecar, or use an external extraction tool. This recipe makes your output extractable. - Configure tagging first. Call
enableTaggedPdf()before you write content, so the structure tree captures the text in reading order. A call made after content is added does not tag the prior content. - Invalid language tag.
enableTaggedPdf()validates the BCP 47 tag and throwsInvalidConfigExceptionwhen the tag is invalid. Use a registered tag, for exampleen,zh-Hant-TW, orja. - Plain (untagged) output. Without
enableTaggedPdf(), plain output may suppress the/ToUnicodeCMap for predefined CMap fonts to reduce size. Extraction is then unreliable for those fonts. Tag the document when you need extractable text.
Performance
Section titled “Performance”Tagging adds the structure tree and retains /ToUnicode CMaps, so output size
increases modestly. The cost scales with content volume and does not change the
single-pass rendering model.
Security notes
Section titled “Security notes”Tagged text content is machine-readable by design. Do not place secrets in document text and expect them to stay hidden. Anyone with the file can extract extractable text. This is a producer-correctness recipe, not a confidentiality control. For confidentiality, see the encryption recipe.
Conformance
Section titled “Conformance”| Statement | Spec | Clause | reference_id |
|---|---|---|---|
| A ToUnicode CMap maps character codes to Unicode for text extraction. | ISO 32000-2 | §9.10.2 | |
| Text-showing operators place strings on the page in the content stream. | ISO 32000-2 | §9.4.3 | |
| A tagged structure tree records the logical reading order for extraction. | ISO 32000-2 | §14.8 |
This recipe produces extractable text content. It does not assert PDF/UA-2 conformance, which a checker determines. See the accessibility recipe.