Fonts: the hard part
Esta página aún no está disponible en tu idioma.
ISO 32000-2 §9 Evidence: Mixed evidence
At a glance
Section titled “At a glance”Fonts are where a PDF can look completely correct and still be quietly broken. A page can render the right glyphs while being impossible to search, impossible to copy as text, and non-conformant to an archival profile. All of that can happen at once, with nothing visible to warn you. This page is about the three things that have to go right — embedding, subsetting, encoding — and what NextPDF does for each.
Why this matters
Section titled “Why this matters”“It looks fine” is the most dangerous sentence in PDF work, and fonts are where it does the most damage. Three independent things must hold:
- Embedding — the font program travels in the file, so it renders the same on a machine that does not have the font installed.
- Subsetting — only the glyphs actually used are carried, so a 20 MB CJK font does not bloat every document.
- Encoding — there is a correct map from the character codes on the page back to Unicode, so the text can be searched, copied, indexed, and read by assistive technology.
Visual rendering only partially proves the first one. A document can show perfect glyphs and still fail the third entirely — the text is a picture of words, not words. That is the failure that passes every “looks fine” review and then loses a compliance audit or a discovery request.
The short version
Section titled “The short version”- A font in a PDF is a dictionary plus, usually, an embedded font program stream.
- Subsetting rewrites that program to contain only the used glyphs. A
subset font’s name gets a six-uppercase-letter tag and a
+so readers treat it as distinct. - Encoding is the separate problem of mapping character codes to
Unicode. A
/ToUnicodeCMap is what makes text searchable and copyable — and it is independent of whether the glyphs look right. - Right-looking text with no (or a wrong)
/ToUnicodeis the classic silent failure: perfect on screen, unsearchable in practice. - NextPDF subsets TrueType fonts, preserves glyph identity for correct
rendering, and emits a
/ToUnicodeCMap so extraction works — and can enforce the PDF 2.0 embedding rule rather than only warn.
How NextPDF approaches it
Section titled “How NextPDF approaches it”Subsetting. FontSubsetter (src/Typography/FontSubsetter.php) parses
the original TrueType table directory and reads the cmap to map Unicode
codepoints to glyph IDs. It handles both BMP format 4 and full-Unicode format
12, which CJK needs. Then it does the step naive subsetters miss: it
resolves composite glyph dependencies by transitive closure. An
accented glyph built from a base letter plus a combining mark references
other glyphs as components. If those components are dropped, the glyph
renders wrong. The subsetter walks that graph until no new component
appears, with a cycle guard so a malformed font cannot loop forever.
Two engineering choices in that file are worth naming. First, glyph IDs are
preserved, not remapped — unused slots are zero-filled in glyf/loca so
the content stream’s original glyph indices stay valid under
CIDToGIDMap /Identity. Remapping would be smaller, but it would require
rewriting every glyph reference. Preserving identity is correct by
construction. Second, the traversal is sorted (gid-ascending) so the subset
is byte-deterministic — the same font and the same used glyphs produce the
same subset bytes, which is what reproducible builds require. If subsetting
would save less than ~10% of the file, the original is returned unchanged. The
overhead is not worth a marginal gain.
Embedding. An explicit policy decides whether a font program is carried at
all — never guesswork. Pdf20FontEmbeddingPolicy
(src/Writer/Pdf20FontEmbeddingPolicy.php) has two modes. Under the PDF 2.0
profile, Strict rejects a non-embedded standard Type 1 (“Base14”) reference
with a typed exception — the conformance-correct behaviour. AllowBase14
preserves the historical advisory path. During a migration window, it emits the
minimal font descriptor the standard still requires and dispatches a warning
rather than throwing. The caller makes the choice explicitly on the document;
it is never inferred from the font.
Encoding. For composite (Type 0) fonts, EmbeddedTtfFontDictBuilder
(src/Writer/EmbeddedTtfFontDictBuilder.php) emits the CIDFontType2
descendant, the Type0 parent, and a /ToUnicode CMap stream so character
codes resolve back to Unicode. The /ToUnicode stream is legitimately absent
in one case only: when a self-describing predefined CJK CMap already gives
the reader the character-to-Unicode mapping. There the CMap is the
encoding, so the plain profile omits a redundant /ToUnicode stream to save
bytes. Outside that case, the /ToUnicode stream is what keeps text as text.
| Concern | What it guarantees | What it does not guarantee | Silent failure if wrong |
|---|---|---|---|
| Embedding | Same rendering without the font installed | That text is searchable | Substituted font; wrong metrics on another machine |
| Subsetting | Small file; only used glyphs | Anything about encoding | Missing composite components → broken accented glyphs |
Encoding (/ToUnicode) | Searchable, copyable, accessible text | That glyphs render correctly | Perfect-looking page, unsearchable / garbled on copy |
The three font concerns are independent. Embedding and subsetting are about appearance and size; encoding is about meaning. A page can pass the first two and fail the third with nothing visible to show it.
What the evidence says
Section titled “What the evidence says”The subset-naming rule is normative and precise.
Spec: ISO 32000-2, §9.9.2 ISO 32000-2 §9.9.2 requires that a font
subset’s PostScript name — the BaseFont and the descriptor’s FontName —
begin with a tag of exactly six uppercase letters, then a plus sign,
then the original font’s PostScript name. It also requires that different
subsets of the same font in one file use different tags. That rule is what
lets a reader tell two subsets apart and merge documents correctly. Evidence: Standard-backed
Encoding is a separate clause from rendering.
Spec: ISO 32000-2, §9.10.3 ISO 32000-2 §9.10.3 defines /ToUnicode
as a stream containing a CMap that maps character codes to Unicode values,
and the text-extraction procedure in
Spec: ISO 32000-2, §9.10.2 ISO 32000-2 §9.10.2 uses that CMap to
convert character codes to Unicode for searching and indexing. Nothing in
the glyph-painting machinery touches /ToUnicode — which is precisely why
text can look right and extract wrong.
On embedding, the standard states that most font dictionaries carry a font
descriptor whose embedded font-file stream is optional but strongly
recommended. PDF 2.0 tightens this for the fourteen standard Type 1 fonts
specifically. NextPDF’s Strict policy is the conformance-correct reading
of that tightening. AllowBase14 is the explicit, opt-in
backwards-compatibility escape — the engine never silently downgrades.
| Edition | Availability |
|---|---|
| Core | Available. Subsetting, |
| Pro | Adds deeper conformance enforcement and reporting around font embedding at the profile level. |
| Enterprise | Adds the same conformance enforcement under the enterprise operational surface. |
Practical example
Section titled “Practical example”Here are the two halves of a correctly embedded, subsetted, searchable
composite font. The subset tag follows the standard’s six-letter rule; the
/ToUnicode reference keeps the text extractable.
% The Type 0 (composite) font dictionary20 0 obj<< /Type /Font /Subtype /Type0 /BaseFont /ABCDEF+NotoSans % six-letter subset tag + '+' /Encoding /Identity-H /DescendantFonts [21 0 R] /ToUnicode 23 0 R >> % the map that makes text searchableendobj
% The descendant CIDFontType2 (carries the subsetted program)21 0 obj<< /Type /Font /Subtype /CIDFontType2 /BaseFont /ABCDEF+NotoSans /CIDToGIDMap /Identity % glyph IDs preserved, not remapped /FontDescriptor 22 0 R >>endobjObject 20’s /ToUnicode 23 0 R is the difference between a searchable
document and a picture of one. Drop it (outside the predefined-CMap case), and
every glyph still paints perfectly, yet a search for any word on the page finds
nothing.
Common misconception
Section titled “Common misconception”The trap, stated plainly: glyphs rendering correctly says nothing about
whether the text is text. Rendering follows the encoding-to-glyph path.
Search and copy follow the code-to-Unicode path (/ToUnicode). They are
different mechanisms reading different parts of the font dictionary. A document
can therefore have flawless visual output and an absent or wrong /ToUnicode.
The result is a page that looks authoritative and is functionally unsearchable
— the failure that survives every visual review, because by definition there
is nothing to see.
A related trap: assuming “the font is embedded, so we are fine for archival.” Embedding is necessary but not sufficient. A profile such as PDF/A also expects subsets named per the six-letter rule and correct encoding. Embedded-but-unsearchable still fails.
Limits and boundaries
Section titled “Limits and boundaries”NextPDF’s subsetter is specifically a TrueType subsetter. It requires
the essential TrueType tables and returns the original font unchanged when
they are missing or the gain is below the ~10% threshold. Subsetting and a
/ToUnicode CMap make text extractable, but they cannot rescue a source
font that lacks the information to map a glyph back to a meaningful
character. Where no Unicode value can be determined, no amount of CMap
emission invents one.
This page is about producing correct font structure in documents NextPDF writes. It is not a font-repair tool for arbitrary inbound PDFs. And emitting a conformant subset and encoding does not by itself certify a document against a full archival profile — that is a separate, broader check.
Mini-FAQ
Section titled “Mini-FAQ”Why the six-letter tag — why not the font name? So a reader can tell two different subsets of the same font apart and merge documents without colliding their glyph sets. Different subsets, different tags, by rule.
When is it acceptable to have no /ToUnicode?
When a self-describing predefined CJK CMap already provides the
character-to-Unicode mapping. There the CMap is the encoding. A separate
/ToUnicode would be redundant. Outside that, its absence is a defect.
Does subsetting ever hurt? Only if done wrong. Dropping composite-glyph components breaks accented glyphs. Remapping glyph IDs without rewriting references breaks rendering. NextPDF avoids both by resolving the component closure and preserving glyph identity.
Related docs
Section titled “Related docs”- Streams and filters — embedded font programs are filtered stream objects with their own decode contract.
- What a PDF actually is — the object model the font dictionaries and program streams live in.
- PDF 2.0: what changed — including the tightened font-embedding expectations of the 2.0 baseline.
Glossary
Section titled “Glossary”- Embedded font program — the actual font file (TrueType/CFF/Type 1) carried inside the PDF as a stream, so rendering does not depend on the reader’s installed fonts.
- Subsetting — rewriting a font program to contain only the glyphs the document uses, to reduce size.
- Subset tag — the mandatory six-uppercase-letter prefix plus
+on a subset font’s name (for example,ABCDEF+NotoSans). /ToUnicode— a CMap stream mapping character codes to Unicode values; what makes PDF text searchable, copyable, and accessible.- Composite glyph — a glyph built by referencing other glyphs as components; its components must be kept when subsetting.
CIDToGIDMap /Identity— the mode where content-stream glyph indices are the font’s own glyph IDs unchanged; NextPDF preserves glyph identity to keep this valid.- Base14 — the fourteen standard Type 1 fonts; PDF 2.0 expects fonts to be embedded rather than referenced by name.