Text: shaping seam, CJK, and run handling
At a glance
Section titled “At a glance”The text module defines the shaping boundary. It exposes a small interface that turns an 8-bit Unicode Transformation Format (UTF-8) run into positioned glyphs, selects a real OpenType backend when one is available, falls back deterministically when none is, and provides a registry for script-specific shapers.
Install
Section titled “Install”composer require nextpdf/core:^3Conceptual overview
Section titled “Conceptual overview”ShaperInterface connects the text-layout pipeline to an OpenType shaping engine. It stays deliberately small: one shape() method consumes a ShaperInput and returns a ShapingResult. That return type is the only output consumers see. Implementations must not leak shaping-engine internals, and the typed return enforces that boundary. ShapingResult carries the list of GlyphRun records, the echoed source text, the script and direction, and a shaperImpl tag that identifies the backend that produced the result.
Backend selection is explicit and reports capability without guessing. ShaperFactory runs one capability probe. If the host has a working HarfBuzz binding, create() returns the HarfBuzz-backed shaper. Otherwise, it returns NullShaper. NullShaper is a pass-through fallback. It emits one synthetic glyph per Unicode codepoint, with zero advances and zero offsets. It tags the result so observability can detect the fallback, and it leaves advance resolution to the font-metrics module. This path is documented degradation, not full shaping. Substitution, ligatures, mark positioning, and contextual forms require the real backend. wouldUseRealShaper() is a diagnostic predicate. Production code should branch on the result’s shaperImpl tag instead.
Script-specific shaping is a service provider interface (SPI), not a bundled implementation. ScriptShaperRegistry is a PHP Standards Recommendation 11 (PSR-11)-style registry that resolves a MongolianShaperInterface or TibetanShaperInterface by International Organization for Standardization (ISO) 15924 script tag. The registry stores keys case-insensitively and relies on one source of truth for script-code admissibility. The registry and the script-shaper interfaces are a frozen contract, so an extension can register a Phase-12 provider without touching call sites. The engine ships the boundary. Consumers supply complex-script providers.
Chinese, Japanese, and Korean (CJK) run handling sits on the typography encoding boundary. An embedded CJK TrueType face is emitted as a Type 0 font with an Identity-H CMap and a CIDFontType2 descendant, as covered by ISO 32000-2 §9.7.4 (retrieval-augmented generation (RAG) digest truncated by the license cap; recorded in _downgraded-claims-o3.md). When the TrueType program is embedded, the Type 2 CIDFont maps character identifiers to glyph indices through the CIDToGIDMap entry, as covered by ISO 32000-2 §9 (the digest pinned by the B1 contract page). The subsetter preserves original glyph numbering so a /CIDToGIDMap /Identity remains valid for the subset. CjkFontValidator checks whether a candidate font covers the Unicode blocks a script needs before that font is chosen.
API surface
Section titled “API surface”| Type | Kind | Key members | Stability | Since |
|---|---|---|---|---|
ShaperInterface | interface | shape(ShaperInput): ShapingResult | stable | 3.2.0 |
ShaperFactory | final class | default(), create(), wouldUseRealShaper() | stable | 3.2.0 |
NullShaper | final readonly class | pass-through fallback shaper | stable | 3.2.0 |
ShapingResult | final readonly class | $glyphRuns, $originalText, $script, $direction, $shaperImpl | stable | 3.2.0 |
ScriptShaperRegistry | final class | registerMongolian(), getMongolian(), hasMongolian(), and the Tibetan equivalents | stable | 3.1.0 |
CjkFontValidator | final class | validateCoverage(), detectScript(), isCjkCodepoint() | stable | 1.0.0 |
The register*, get*, and has* method shape of ScriptShaperRegistry and the script-shaper interfaces is a frozen contract. By design, ShapingResult is the only shaper output consumers can see.
Code sample — Quick start
Section titled “Code sample — Quick start”<?php
declare(strict_types=1);
require_once __DIR__ . '/../../vendor/autoload.php';
use NextPDF\Font\Shaper\ShaperFactory;use NextPDF\Font\Shaper\ShaperImpl;
$factory = ShaperFactory::default();$shaper = $factory->create();
// Branch on the result tag, not on the concrete class.$wouldShape = $factory->wouldUseRealShaper() ? 'HarfBuzz backend available' : 'NullShaper fallback (degraded — no substitution or positioning)';
echo $wouldShape, "\n";ShaperFactory::default() wires the production capability probe. create() memoizes the selected backend for the lifetime of the factory. Use wouldUseRealShaper() and the shaperImpl tag on each result to inspect capability.
Code sample — Production
Section titled “Code sample — Production”<?php
declare(strict_types=1);
require_once __DIR__ . '/../../vendor/autoload.php';
use NextPDF\Text\Shaping\MongolianShaperInterface;use NextPDF\Text\Shaping\ScriptShaperRegistry;
final readonly class ComplexScriptBootstrap{ public function __construct(private ScriptShaperRegistry $registry) {}
/** * Register a consumer-supplied Mongolian shaper provider at boot so * the layout pipeline can resolve it by ISO 15924 script tag. */ public function register(MongolianShaperInterface $mongolian): void { $this->registry->registerMongolian($mongolian); }
public function hasMongolian(): bool { return $this->registry->hasMongolian(); }}The registry is the integration point for complex-script providers. The engine ships the boundary and the frozen accessor shape. Consumers supply the Mongolian and Tibetan implementations.
Edge cases & gotchas
Section titled “Edge cases & gotchas”- A
NullShaperresult has zero advances and zero offsets. Do not feed those positions directly into text layout. Resolve advances from the font-metrics module, and detect the fallback through theshaperImpltag. - Empty input produces an empty
glyphRunslist, not an empty run. Consumer iteration code does not need a special case for a zero-length run. ScriptShaperRegistrydoes not implementPsr\Container\ContainerInterfacedirectly, so typed accessors keep their narrowed return type under static analysis. UsegetMongolian()andgetTibetan(), not a genericget().- Script tags are matched by canonical ISO 15924 alpha-4 value and stored case-insensitively. Pass
MongorTibt. Casing does not affect lookup. - CJK Extension B characters live in Unicode plane 2 and force a cmap Format 12 subtable in the subset. The encoding path handles this. Do not assume the basic multilingual plane covers all CJK text.
Performance
Section titled “Performance”The capability probe runs once per ShaperFactory instance, and the backend is memoized, so repeated create() calls are free. NullShaper is linear in the input run’s codepoint count and performs no input/output (I/O). ScriptShaperRegistry resolution is a constant-time keyed lookup. CjkFontValidator samples codepoints at a stride instead of testing every one, which keeps coverage checks cheap even against a 20,000-glyph CJK font. The performance_budget of 1500 ms wall and 64 MB peak covers a typical run. In real shaping, the dominant cost is the OpenType backend. That cost is outside this module’s scope when the fallback is active.
Security notes
Section titled “Security notes”The shaper boundary consumes a UTF-8 string. NullShaper tolerates malformed UTF-8 by splitting on a best-effort basis instead of raising, because the documented fallback contract is already “no real shaping”. The caller is prepared for low-quality output. The byte-offset cluster contract uses byte-oriented length, which is correct for multi-byte input and avoids an off-by-codepoint cluster-mapping defect. When present, the real backend is a third-party native library. Treat its input as untrusted, and limit run length upstream. The script-shaper registry stores consumer-supplied providers. Those implementations sit inside the consumer’s trust boundary, not the engine’s.
Conformance
Section titled “Conformance”| Claim | Standard | Clause | Evidence |
|---|---|---|---|
An embedded CJK TrueType face is emitted as a Type 0 font with an Identity-H CMap and a CIDFontType2 descendant. | ISO 32000-2 | §9.7.4 | Retrieval-augmented generation (RAG) digest truncated by license cap; prefix 7a5258772f508e3b, see _downgraded-claims-o3.md |
An embedded Type 2 CIDFont maps character identifiers to glyph indices through CIDToGIDMap. | ISO 32000-2 | §9 |
Both clauses are paraphrased. The second is digest-pinned (reused from the B1 contract page), and the first is corroborated by ADR-013 and the cmap-encoder developer overview. NextPDF does not reproduce normative text. The shaper backend is independent of Portable Document Format (PDF) conformance. The conformance claims here concern the CJK font-dictionary emission produced by the encoding boundary. ADR-013 and the cmap-encoder developer overview document that path in more detail.
Commercial context
Section titled “Commercial context”An advanced text-preprocessing pipeline and extraction services build on the Core shaper boundary and run-handling value types. The Core text module ships the boundary, the fallback, and the script-shaper registry without a license. The missing conversion link is intentional.
See also
Section titled “See also”- Typography: registry, subsetting, CMap, encoding, BiDi — the encoding boundary and bidirectional-text engine.
- Font: value types, embedding, fallback — the
FontInfovalue referenced by shaper input. - Contracts / Typography — the text-preprocessor contract upstream of shaping.