Skip to content

Set CJK text with cmap-aware encoding

This recipe registers a Chinese, Japanese, and Korean (CJK) TrueType face, then encodes Traditional Chinese text through the cmap-aware FontInfo::encodeText() facade. The facade returns an Identity-H two-byte CID byte stream. The recipe follows examples/35-cjk-cmap-demo.php. Read the scope note before you rely on it.

The cmap-aware text-encoding architecture ships in phases (ADR-013). Phase 1 has landed: the FontInfo::encodeText() facade and cmap-aware encoding strategy are wired and reachable from userland. Phase 2 is in progress: it routes the renderer and writer through the facade. Phases 3 and 4 are pending: per-font /ToUnicode, /CIDSystemInfo, /Encoding, and /CIDToGIDMap emission, and the substitute-font resolver, are not yet wired into the writer.

Plan around these consequences:

  • This recipe demonstrates the encoding facade, not a complete vertical-writing mode. The document surface today has no public writing-mode API, so there is no setWritingMode call and no vertical-rl setter.
  • The backing example is, by its own header, an integration smoke test, not a conformance fixture. PDF/UA-2 and PDF/A-4 validation will regress for output produced this way until Phases 3 and 4 land. Do not state that output from this path conforms. A checker decides conformance, and it will not pass this output yet.
  • The vertical-writing metrics infrastructure exists but is internal. It includes the CjkVerticalMetrics value object and the /W2 and /DW2 emitters. NextPDF does not expose it as a userland “write vertically” call, and the writer does not yet emit its dictionaries.
Terminal window
composer require nextpdf/core:^3

The constraint matches the nextpdf/core package. The example runs on PHP 8.4. A bundled Noto Sans TC test fixture keeps this recipe self-contained.

ISO 32000-2 models text emission in three layers: Unicode codepoint, character code, and glyph ID. For a CJK TrueType face, the engine uses a composite Type 0 font with Identity-H encoding. With this encoding, the shown string uses byte pairs that index the CIDFont (ISO 32000-2).

FontRegistry::register() parses the face. FontInfo::encodeText($unicodeText) then resolves an encoding strategy through FontEncodingStrategyResolver. For a registered TrueType CJK face, it dispatches to TrueTypeCmapStrategy. The returned EncodedGlyphRun carries the Identity-H byte stream, the PDF string operand, per-glyph advance widths, the used codepoints, and the GID→Unicode map. CJK subsetting uses the codepoints per ADR-008. A future /ToUnicode stream will use the GID→Unicode map. The selected mode is EncodingMode::TwoByteCid.

Two CIDFont structures define vertical writing in PDF. The first is the /W2 per-glyph vertical-metrics array (ISO 32000-2). The second is the /DW2 default vertical metrics (ISO 32000-2). NextPDF provides the value object and emitters for both through CjkVerticalMetrics::toW2Array(), toW2RangeArray(), and toDw2Array(). They are internal, and the writer does not yet emit them. See the scope note.

  • FontRegistry::register(string $fontFile, string $alias = '', int $fontIndex = 0): FontInfoNextPDF\Typography\FontRegistry.
  • FontInfo::encodeText(string $unicodeText): EncodedGlyphRunNextPDF\Typography\FontInfo. The Phase 1 facade.
  • EncodedGlyphRunNextPDF\Typography\Encoding\EncodedGlyphRun (byteStream, pdfStringOperand, mode, advanceWidths, toUnicodeMap, usedCodepoints, glyphCount()).
  • EncodingModeNextPDF\Typography\Encoding\EncodingMode (SingleByte, TwoByteCid).
  • CjkVerticalMetricsNextPDF\Typography\CjkVerticalMetrics. Internal vertical-metrics value object. It is documented for transparency, not as a userland writing path.

The full PHPDoc table is generated from source.

<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use NextPDF\Typography\Encoding\EncodingMode;
use NextPDF\Typography\FontRegistry;
$registry = new FontRegistry();
$font = $registry->register('/path/to/NotoSansTC-Regular.ttf', alias: 'NotoSansTC');
$encoded = $font->encodeText('PDF 2.0 引擎');
assert($encoded->mode === EncodingMode::TwoByteCid); // cmap-aware branch fired
echo $encoded->glyphCount() . " glyph run entries\n";

This sample is self-contained and harness-runnable. It mirrors examples/35-cjk-cmap-demo.php. First, register the bundled Noto Sans TC fixture. Next, confirm the cmap-aware facade is reachable. Then render through DocumentFactory so the document uses the registry you populated.

<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use NextPDF\Core\DocumentFactory;
use NextPDF\Graphics\ImageRegistry;
use NextPDF\Typography\Encoding\EncodingMode;
use NextPDF\Typography\FontRegistry;
$cjkFontPath = dirname(__DIR__, 2)
. '/fonts/test-fixtures/Noto Sans TC/NotoSansTC-Regular.ttf';
if (!is_file($cjkFontPath)) {
fwrite(STDERR, "Missing CJK font fixture: {$cjkFontPath}\n");
exit(1);
}
$fontRegistry = new FontRegistry();
$cjkFont = $fontRegistry->register($cjkFontPath, alias: 'NotoSansTC');
// Phase 1 facade: prove the cmap-aware path is reachable from userland.
$cjkSample = 'PDF 2.0 引擎 — 使用 CMap 編碼';
$encoded = $cjkFont->encodeText($cjkSample);
if ($encoded->mode !== EncodingMode::TwoByteCid) {
fwrite(STDERR, "Expected TwoByteCid (TrueTypeCmapStrategy branch)\n");
exit(2);
}
$imageRegistry = new ImageRegistry(maxCacheBytes: 0);
$documentFactory = new DocumentFactory($fontRegistry, $imageRegistry);
$doc = $documentFactory->create();
$doc->setTitle('NextPDF CJK CMap-Aware Encoding Demo');
$doc->setLanguage('zh-Hant');
$doc->addPage();
$doc->setFont('helvetica', 'B', 16);
$doc->cell(0, 12, 'CJK cmap-aware encoding (Phase 1 facade)', newLine: true);
$doc->setFont('helvetica', '', 10);
$doc->cell(0, 6, 'Mode: ' . $encoded->mode->name . ' (Identity-H, 2-byte CIDs)', newLine: true);
$doc->cell(0, 6, 'Glyphs: ' . $encoded->glyphCount() . ' run entries', newLine: true);
$doc->cell(0, 6, 'Bytes: ' . strlen($encoded->byteStream) . ' encoded bytes', newLine: true);
$doc->ln(4);
$doc->setFont('NotoSansTC', '', 18);
$doc->cell(0, 12, $cjkSample, newLine: true);
$out = getenv('NEXTPDF_COOKBOOK_OUTPUT');
$doc->save($out !== false ? $out : __DIR__ . '/cjk-vertical-writing.pdf');
echo "Wrote cjk-vertical-writing.pdf (Phase 1+2 dry-run; not a conformance fixture)\n";

Expected STDOUT:

Wrote cjk-vertical-writing.pdf (Phase 1+2 dry-run; not a conformance fixture)
  • Not a conformance fixture. Per the backing example’s own header, this output is an integration smoke test. PDF/UA-2 and PDF/A-4 checks regress for it until Phases 3 and 4 land. Do not register it as a conformance golden.
  • No writing-mode API. No public call switches to vertical writing, which would cover vertical-rl and vertical-lr. The /W2 and /DW2 emitters exist internally. They are not exposed and are not yet written into the font dictionary.
  • Registry ownership. Document::createStandalone() builds its own registry. Use DocumentFactory so the document reads the registry you populated with the CJK face.
  • Final byte-stream path. Until Phase 2 closes, the visible content stream still routes through the legacy text path. The proven, reachable part today is the upstream encoding step: the cmap forward lookup plus the Identity-H byte stream.
  • CJK subsetting cost. Large CJK faces subset through an isolated subprocess. That subprocess has a PHP-native fallback and a two-second timeout (ADR-008).

encodeText() makes a single cmap forward-lookup pass over the input. It is linear in codepoint count, O(n). The budget is wall_ms: 2000, peak_mb: 128. This budget is the highest in this set because CJK faces are large, and subsetting is the dominant cost. ADR-008 isolates that work so it cannot block the caller.

A CJK font file is untrusted binary input. The parser rejects stream-wrapper paths and null bytes. CJK subsetting runs in an isolated subprocess with no inherited state (ADR-008). Validate the provenance of end-user-supplied faces. CJK text content is rendered, not interpreted.

StatementSpecClausereference_id
For an Identity-H/Identity-V Type 0 font, the shown string is byte pairs indexing the CIDFont.ISO 32000-2iso32000_2_sec9#x1.x49.p90
The W2 array gives per-glyph vertical-writing metrics and applies only to CIDFonts used for vertical writing.ISO 32000-2iso32000_2_sec9#x1.x44.p23
The DW2 array gives the default vertical-writing metrics for a CIDFont.ISO 32000-2iso32000_2_sec9#x1.x44.p22

This recipe shows that the cmap-aware CJK encoding facade is reachable from userland (Phase 1). It does not claim vertical-writing output or PDF/UA-2 / PDF/A-4 conformance for the produced file. The writer-side /ToUnicode and vertical-metrics emission (Phases 3 and 4) are pending, and a checker would not pass this output today.

Not applicable.