Produce text content that downstream tools can extract

At a glance

NextPDF Core is a PDF producer. It ships no public PDF-to-text reader. In the Core context, “extract text content” means you create the document so its text is extractable. The glyphs carry a /ToUnicode CMap, and the document has a tagged logical structure. A conforming reader or downstream extraction tool can then recover the Unicode text in reading order.

Reading text from an arbitrary third-party PDF is consumer work. Use the Inspect module sidecar or an external tool for that task, not the Core producer surface.

Install

composer require nextpdf/core:^3

Conceptual overview

Text-showing operators in the content stream place text on a page (ISO 32000-2 §9.4.3). Glyph codes are not Unicode. A /ToUnicode CMap lets a reader map those codes back to Unicode for extraction (ISO 32000-2 §9.10.2). A tagged structure tree records the logical reading order, so extraction can recover text in document order instead of paint order (ISO 32000-2 §14.8).

enableTaggedPdf() builds that structure tree and keeps the /ToUnicode CMap for embedded-subset fonts. Together, those features make the output reliably extractable.

API surface

Document::enableTaggedPdf(string $lang = 'en') builds the structure tree and sets the conformance mode that preserves the /ToUnicode CMap. Document::setLanguage(string $lang) records the BCP 47 language tag. Call both methods before you write content. Then write the text with the usual setFont() / cell() / multiCell() surface.

Code sample — Quick start

<?php

declare(strict_types=1);

require_once __DIR__ . '/vendor/autoload.php';

use NextPDF\Core\Document;

$doc = Document::createStandalone();
$doc->setLanguage('en');
$doc->enableTaggedPdf('en');     // structure tree + ToUnicode retention
$doc->addPage();
$doc->setFont('helvetica', '', 12);
$doc->multiCell(0, 7, 'This text is extractable by a downstream reader.');

file_put_contents(__DIR__ . '/extractable.pdf', $doc->getPdfData());

Code sample — Production

This self-contained program runs in the harness. It mirrors examples/38-extract-text-content.php. It creates a tagged document whose text carries a /ToUnicode CMap and a logical reading order. A downstream extractor can then recover the Unicode text in order.

<?php

declare(strict_types=1);

require_once __DIR__ . '/vendor/autoload.php';

use NextPDF\Core\Document;

$paragraphs = [
    'NextPDF produces documents whose text content is extractable.',
    'A tagged structure tree records the logical reading order.',
    'The ToUnicode CMap lets a reader recover Unicode from glyph codes.',
];

$doc = Document::createStandalone();
$doc->setTitle('Extractable text content');
$doc->setAuthor('NextPDF Cookbook');
$doc->setLanguage('en');             // BCP 47; validated on enableTaggedPdf()

// Configure tagged mode BEFORE content so the structure tree captures the
// text in reading order and the /ToUnicode CMap is retained.
$doc->enableTaggedPdf('en');

$doc->addPage();
$doc->setFont('helvetica', '', 12);
foreach ($paragraphs as $p) {
    $doc->multiCell(0, 7, $p);       // captured in reading order
    $doc->ln(2);
}

$pdf = $doc->getPdfData();
// $pdf contains a /StructTreeRoot and per-font /ToUnicode CMaps; an external
// extractor (or the Inspect sidecar) recovers the Unicode text in order.

echo "Wrote a tagged PDF with extractable text content\n";
echo 'Paragraphs authored: ' . count($paragraphs) . "\n";
echo "Text is recoverable via the /ToUnicode CMap + tagged reading order.\n";

// The harness sets NEXTPDF_COOKBOOK_OUTPUT and runs this script under the
// semantic profile; emit the document to the side-channel.
$out = getenv('NEXTPDF_COOKBOOK_OUTPUT');
file_put_contents($out !== false && $out !== '' ? $out : __DIR__ . '/extractable.pdf', $pdf);

Expected STDOUT:

Wrote a tagged PDF with extractable text content
Paragraphs authored: 3
Text is recoverable via the /ToUnicode CMap + tagged reading order.

Edge cases & gotchas

Producer, not reader. Core has no public extractText(). Reading text out of an existing third-party PDF is a consumer task. Use the Inspect module with the Spectrum sidecar, or use an external extraction tool. This recipe makes your output extractable.
Configure tagging first. Call enableTaggedPdf() before you write content, so the structure tree captures the text in reading order. A call made after content is added does not tag the prior content.
Invalid language tag. enableTaggedPdf() validates the BCP 47 tag and throws InvalidConfigException when the tag is invalid. Use a registered tag, for example en, zh-Hant-TW, or ja.
Plain (untagged) output. Without enableTaggedPdf(), plain output may suppress the /ToUnicode CMap for predefined CMap fonts to reduce size. Extraction is then unreliable for those fonts. Tag the document when you need extractable text.

Performance

Tagging adds the structure tree and retains /ToUnicode CMaps, so output size increases modestly. The cost scales with content volume and does not change the single-pass rendering model.

Security notes

Tagged text content is machine-readable by design. Do not place secrets in document text and expect them to stay hidden. Anyone with the file can extract extractable text. This is a producer-correctness recipe, not a confidentiality control. For confidentiality, see the encryption recipe.

Conformance

Statement	Spec	Clause
A ToUnicode CMap maps character codes to Unicode for text extraction.	ISO 32000-2	§9.10.2
Text-showing operators place strings on the page in the content stream.	ISO 32000-2	§9.4.3
A tagged structure tree records the logical reading order for extraction.	ISO 32000-2	§14.8

This recipe produces extractable text content. It does not assert PDF/UA-2 conformance, which a checker determines. See the accessibility recipe.