Set CJK text with cmap-aware encoding
At a glance
Section titled “At a glance”This recipe registers a Chinese, Japanese, and Korean (CJK) TrueType face, then encodes Traditional Chinese text through the cmap-aware FontInfo::encodeText() facade. The facade returns an Identity-H two-byte CID byte stream. The recipe follows examples/35-cjk-cmap-demo.php. Read the scope note before you rely on it.
Scope and status (read first)
Section titled “Scope and status (read first)”The cmap-aware text-encoding architecture ships in phases (ADR-013). Phase 1 has landed: the FontInfo::encodeText() facade and cmap-aware encoding strategy are wired and reachable from userland. Phase 2 is in progress: it routes the renderer and writer through the facade. Phases 3 and 4 are pending: per-font /ToUnicode, /CIDSystemInfo, /Encoding, and /CIDToGIDMap emission, and the substitute-font resolver, are not yet wired into the writer.
Plan around these consequences:
- This recipe demonstrates the encoding facade, not a complete vertical-writing mode. The document surface today has no public writing-mode API, so there is no
setWritingModecall and novertical-rlsetter. - The backing example is, by its own header, an integration smoke test, not a conformance fixture. PDF/UA-2 and PDF/A-4 validation will regress for output produced this way until Phases 3 and 4 land. Do not state that output from this path conforms. A checker decides conformance, and it will not pass this output yet.
- The vertical-writing metrics infrastructure exists but is internal. It includes the
CjkVerticalMetricsvalue object and the/W2and/DW2emitters. NextPDF does not expose it as a userland “write vertically” call, and the writer does not yet emit its dictionaries.
Install
Section titled “Install”composer require nextpdf/core:^3The constraint matches the nextpdf/core package. The example runs on PHP 8.4. A bundled Noto Sans TC test fixture keeps this recipe self-contained.
Conceptual overview
Section titled “Conceptual overview”ISO 32000-2 models text emission in three layers: Unicode codepoint, character code, and glyph ID. For a CJK TrueType face, the engine uses a composite Type 0 font with Identity-H encoding. With this encoding, the shown string uses byte pairs that index the CIDFont (ISO 32000-2).
FontRegistry::register() parses the face. FontInfo::encodeText($unicodeText) then resolves an encoding strategy through FontEncodingStrategyResolver. For a registered TrueType CJK face, it dispatches to TrueTypeCmapStrategy. The returned EncodedGlyphRun carries the Identity-H byte stream, the PDF string operand, per-glyph advance widths, the used codepoints, and the GID→Unicode map. CJK subsetting uses the codepoints per ADR-008. A future /ToUnicode stream will use the GID→Unicode map. The selected mode is EncodingMode::TwoByteCid.
Two CIDFont structures define vertical writing in PDF. The first is the /W2 per-glyph vertical-metrics array (ISO 32000-2). The second is the /DW2 default vertical metrics (ISO 32000-2). NextPDF provides the value object and emitters for both through CjkVerticalMetrics::toW2Array(), toW2RangeArray(), and toDw2Array(). They are internal, and the writer does not yet emit them. See the scope note.
API surface
Section titled “API surface”FontRegistry::register(string $fontFile, string $alias = '', int $fontIndex = 0): FontInfo—NextPDF\Typography\FontRegistry.FontInfo::encodeText(string $unicodeText): EncodedGlyphRun—NextPDF\Typography\FontInfo. The Phase 1 facade.EncodedGlyphRun—NextPDF\Typography\Encoding\EncodedGlyphRun(byteStream,pdfStringOperand,mode,advanceWidths,toUnicodeMap,usedCodepoints,glyphCount()).EncodingMode—NextPDF\Typography\Encoding\EncodingMode(SingleByte,TwoByteCid).CjkVerticalMetrics—NextPDF\Typography\CjkVerticalMetrics. Internal vertical-metrics value object. It is documented for transparency, not as a userland writing path.
The full PHPDoc table is generated from source.
Code sample — Quick start
Section titled “Code sample — Quick start”<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use NextPDF\Typography\Encoding\EncodingMode;use NextPDF\Typography\FontRegistry;
$registry = new FontRegistry();$font = $registry->register('/path/to/NotoSansTC-Regular.ttf', alias: 'NotoSansTC');
$encoded = $font->encodeText('PDF 2.0 引擎');
assert($encoded->mode === EncodingMode::TwoByteCid); // cmap-aware branch firedecho $encoded->glyphCount() . " glyph run entries\n";Code sample — Production
Section titled “Code sample — Production”This sample is self-contained and harness-runnable. It mirrors examples/35-cjk-cmap-demo.php. First, register the bundled Noto Sans TC fixture. Next, confirm the cmap-aware facade is reachable. Then render through DocumentFactory so the document uses the registry you populated.
<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use NextPDF\Core\DocumentFactory;use NextPDF\Graphics\ImageRegistry;use NextPDF\Typography\Encoding\EncodingMode;use NextPDF\Typography\FontRegistry;
$cjkFontPath = dirname(__DIR__, 2) . '/fonts/test-fixtures/Noto Sans TC/NotoSansTC-Regular.ttf';if (!is_file($cjkFontPath)) { fwrite(STDERR, "Missing CJK font fixture: {$cjkFontPath}\n"); exit(1);}
$fontRegistry = new FontRegistry();$cjkFont = $fontRegistry->register($cjkFontPath, alias: 'NotoSansTC');
// Phase 1 facade: prove the cmap-aware path is reachable from userland.$cjkSample = 'PDF 2.0 引擎 — 使用 CMap 編碼';$encoded = $cjkFont->encodeText($cjkSample);
if ($encoded->mode !== EncodingMode::TwoByteCid) { fwrite(STDERR, "Expected TwoByteCid (TrueTypeCmapStrategy branch)\n"); exit(2);}
$imageRegistry = new ImageRegistry(maxCacheBytes: 0);$documentFactory = new DocumentFactory($fontRegistry, $imageRegistry);
$doc = $documentFactory->create();$doc->setTitle('NextPDF CJK CMap-Aware Encoding Demo');$doc->setLanguage('zh-Hant');$doc->addPage();
$doc->setFont('helvetica', 'B', 16);$doc->cell(0, 12, 'CJK cmap-aware encoding (Phase 1 facade)', newLine: true);$doc->setFont('helvetica', '', 10);$doc->cell(0, 6, 'Mode: ' . $encoded->mode->name . ' (Identity-H, 2-byte CIDs)', newLine: true);$doc->cell(0, 6, 'Glyphs: ' . $encoded->glyphCount() . ' run entries', newLine: true);$doc->cell(0, 6, 'Bytes: ' . strlen($encoded->byteStream) . ' encoded bytes', newLine: true);$doc->ln(4);
$doc->setFont('NotoSansTC', '', 18);$doc->cell(0, 12, $cjkSample, newLine: true);
$out = getenv('NEXTPDF_COOKBOOK_OUTPUT');$doc->save($out !== false ? $out : __DIR__ . '/cjk-vertical-writing.pdf');
echo "Wrote cjk-vertical-writing.pdf (Phase 1+2 dry-run; not a conformance fixture)\n";Expected STDOUT:
Wrote cjk-vertical-writing.pdf (Phase 1+2 dry-run; not a conformance fixture)Edge cases & gotchas
Section titled “Edge cases & gotchas”- Not a conformance fixture. Per the backing example’s own header, this output is an integration smoke test. PDF/UA-2 and PDF/A-4 checks regress for it until Phases 3 and 4 land. Do not register it as a conformance golden.
- No writing-mode API. No public call switches to vertical writing, which would cover
vertical-rlandvertical-lr. The/W2and/DW2emitters exist internally. They are not exposed and are not yet written into the font dictionary. - Registry ownership.
Document::createStandalone()builds its own registry. UseDocumentFactoryso the document reads the registry you populated with the CJK face. - Final byte-stream path. Until Phase 2 closes, the visible content stream still routes through the legacy text path. The proven, reachable part today is the upstream encoding step: the cmap forward lookup plus the Identity-H byte stream.
- CJK subsetting cost. Large CJK faces subset through an isolated subprocess. That subprocess has a PHP-native fallback and a two-second timeout (ADR-008).
Performance
Section titled “Performance”encodeText() makes a single cmap forward-lookup pass over the input. It is linear in codepoint count, O(n). The budget is wall_ms: 2000, peak_mb: 128. This budget is the highest in this set because CJK faces are large, and subsetting is the dominant cost. ADR-008 isolates that work so it cannot block the caller.
Security notes
Section titled “Security notes”A CJK font file is untrusted binary input. The parser rejects stream-wrapper paths and null bytes. CJK subsetting runs in an isolated subprocess with no inherited state (ADR-008). Validate the provenance of end-user-supplied faces. CJK text content is rendered, not interpreted.
Conformance
Section titled “Conformance”| Statement | Spec | Clause | reference_id |
|---|---|---|---|
| For an Identity-H/Identity-V Type 0 font, the shown string is byte pairs indexing the CIDFont. | ISO 32000-2 | iso32000_2_sec9#x1.x49.p90 | |
| The W2 array gives per-glyph vertical-writing metrics and applies only to CIDFonts used for vertical writing. | ISO 32000-2 | iso32000_2_sec9#x1.x44.p23 | |
| The DW2 array gives the default vertical-writing metrics for a CIDFont. | ISO 32000-2 | iso32000_2_sec9#x1.x44.p22 |
This recipe shows that the cmap-aware CJK encoding facade is reachable from userland (Phase 1). It does not claim vertical-writing output or PDF/UA-2 / PDF/A-4 conformance for the produced file. The writer-side /ToUnicode and vertical-metrics emission (Phases 3 and 4) are pending, and a checker would not pass this output today.
Commercial context
Section titled “Commercial context”Not applicable.