Fonts: the hard part

Esta página aún no está disponible en tu idioma.

Evidence: Mixed evidence

At a glance

Fonts are where a PDF can look completely correct and still be quietly broken. A page can render the right glyphs while being impossible to search, impossible to copy as text, and non-conformant to an archival profile. All of that can happen at once, with nothing visible to warn you. This page is about the three things that have to go right — embedding, subsetting, encoding — and what NextPDF does for each.

Why this matters

“It looks fine” is the most dangerous sentence in PDF work, and fonts are where it does the most damage. Three independent things must hold:

Embedding — the font program travels in the file, so it renders the same on a machine that does not have the font installed.
Subsetting — only the glyphs actually used are carried, so a 20 MB CJK font does not bloat every document.
Encoding — there is a correct map from the character codes on the page back to Unicode, so the text can be searched, copied, indexed, and read by assistive technology.

Visual rendering only partially proves the first one. A document can show perfect glyphs and still fail the third entirely — the text is a picture of words, not words. That is the failure that passes every “looks fine” review and then loses a compliance audit or a discovery request.

The short version

A font in a PDF is a dictionary plus, usually, an embedded font program stream.
Subsetting rewrites that program to contain only the used glyphs. A subset font’s name gets a six-uppercase-letter tag and a + so readers treat it as distinct.
Encoding is the separate problem of mapping character codes to Unicode. A /ToUnicode CMap is what makes text searchable and copyable — and it is independent of whether the glyphs look right.
Right-looking text with no (or a wrong) /ToUnicode is the classic silent failure: perfect on screen, unsearchable in practice.
NextPDF subsets TrueType fonts, preserves glyph identity for correct rendering, and emits a /ToUnicode CMap so extraction works — and can enforce the PDF 2.0 embedding rule rather than only warn.

How NextPDF approaches it

Subsetting. FontSubsetter (src/Typography/FontSubsetter.php) parses the original TrueType table directory and reads the cmap to map Unicode codepoints to glyph IDs. It handles both BMP format 4 and full-Unicode format 12, which CJK needs. Then it does the step naive subsetters miss: it resolves composite glyph dependencies by transitive closure. An accented glyph built from a base letter plus a combining mark references other glyphs as components. If those components are dropped, the glyph renders wrong. The subsetter walks that graph until no new component appears, with a cycle guard so a malformed font cannot loop forever.

Two engineering choices in that file are worth naming. First, glyph IDs are preserved, not remapped — unused slots are zero-filled in glyf/loca so the content stream’s original glyph indices stay valid under CIDToGIDMap /Identity. Remapping would be smaller, but it would require rewriting every glyph reference. Preserving identity is correct by construction. Second, the traversal is sorted (gid-ascending) so the subset is byte-deterministic — the same font and the same used glyphs produce the same subset bytes, which is what reproducible builds require. If subsetting would save less than ~10% of the file, the original is returned unchanged. The overhead is not worth a marginal gain.

Embedding. An explicit policy decides whether a font program is carried at all — never guesswork. Pdf20FontEmbeddingPolicy (src/Writer/Pdf20FontEmbeddingPolicy.php) has two modes. Under the PDF 2.0 profile, Strict rejects a non-embedded standard Type 1 (“Base14”) reference with a typed exception — the conformance-correct behaviour. AllowBase14 preserves the historical advisory path. During a migration window, it emits the minimal font descriptor the standard still requires and dispatches a warning rather than throwing. The caller makes the choice explicitly on the document; it is never inferred from the font.

Encoding. For composite (Type 0) fonts, EmbeddedTtfFontDictBuilder (src/Writer/EmbeddedTtfFontDictBuilder.php) emits the CIDFontType2 descendant, the Type0 parent, and a /ToUnicode CMap stream so character codes resolve back to Unicode. The /ToUnicode stream is legitimately absent in one case only: when a self-describing predefined CJK CMap already gives the reader the character-to-Unicode mapping. There the CMap is the encoding, so the plain profile omits a redundant /ToUnicode stream to save bytes. Outside that case, the /ToUnicode stream is what keeps text as text.

Concern	What it guarantees	What it does not guarantee	Silent failure if wrong
Embedding	Same rendering without the font installed	That text is searchable	Substituted font; wrong metrics on another machine
Subsetting	Small file; only used glyphs	Anything about encoding	Missing composite components → broken accented glyphs
Encoding (`/ToUnicode`)	Searchable, copyable, accessible text	That glyphs render correctly	Perfect-looking page, unsearchable / garbled on copy

The three font concerns are independent. Embedding and subsetting are about appearance and size; encoding is about meaning. A page can pass the first two and fail the third with nothing visible to show it.

What the evidence says

The subset-naming rule is normative and precise. Spec: ISO 32000-2, §9.9.2 requires that a font subset’s PostScript name — the BaseFont and the descriptor’s FontName — begin with a tag of exactly six uppercase letters, then a plus sign, then the original font’s PostScript name. It also requires that different subsets of the same font in one file use different tags. That rule is what lets a reader tell two subsets apart and merge documents correctly. Evidence: Standard-backed

Encoding is a separate clause from rendering. Spec: ISO 32000-2, §9.10.3 defines /ToUnicode as a stream containing a CMap that maps character codes to Unicode values, and the text-extraction procedure in Spec: ISO 32000-2, §9.10.2 uses that CMap to convert character codes to Unicode for searching and indexing. Nothing in the glyph-painting machinery touches /ToUnicode — which is precisely why text can look right and extract wrong.

On embedding, the standard states that most font dictionaries carry a font descriptor whose embedded font-file stream is optional but strongly recommended. PDF 2.0 tightens this for the fourteen standard Type 1 fonts specifically. NextPDF’s Strict policy is the conformance-correct reading of that tightening. AllowBase14 is the explicit, opt-in backwards-compatibility escape — the engine never silently downgrades.

Strict PDF 2.0 font-embedding enforcement — edition availability
Edition	Availability
Core	Available. Subsetting, `/ToUnicode` emission, and the explicit `Strict` / `AllowBase14` embedding policy are core engine behaviour.
Pro	Adds deeper conformance enforcement and reporting around font embedding at the profile level.
Enterprise	Adds the same conformance enforcement under the enterprise operational surface.

Practical example

Here are the two halves of a correctly embedded, subsetted, searchable composite font. The subset tag follows the standard’s six-letter rule; the /ToUnicode reference keeps the text extractable.

% The Type 0 (composite) font dictionary
20 0 obj
<< /Type /Font /Subtype /Type0
   /BaseFont /ABCDEF+NotoSans            % six-letter subset tag + '+'
   /Encoding /Identity-H
   /DescendantFonts [21 0 R]
   /ToUnicode 23 0 R >>                  % the map that makes text searchable
endobj

% The descendant CIDFontType2 (carries the subsetted program)
21 0 obj
<< /Type /Font /Subtype /CIDFontType2
   /BaseFont /ABCDEF+NotoSans
   /CIDToGIDMap /Identity                % glyph IDs preserved, not remapped
   /FontDescriptor 22 0 R >>
endobj

Object 20’s /ToUnicode 23 0 R is the difference between a searchable document and a picture of one. Drop it (outside the predefined-CMap case), and every glyph still paints perfectly, yet a search for any word on the page finds nothing.

Common misconception

The trap, stated plainly: glyphs rendering correctly says nothing about whether the text is text. Rendering follows the encoding-to-glyph path. Search and copy follow the code-to-Unicode path (/ToUnicode). They are different mechanisms reading different parts of the font dictionary. A document can therefore have flawless visual output and an absent or wrong /ToUnicode. The result is a page that looks authoritative and is functionally unsearchable — the failure that survives every visual review, because by definition there is nothing to see.

A related trap: assuming “the font is embedded, so we are fine for archival.” Embedding is necessary but not sufficient. A profile such as PDF/A also expects subsets named per the six-letter rule and correct encoding. Embedded-but-unsearchable still fails.

Limits and boundaries

NextPDF’s subsetter is specifically a TrueType subsetter. It requires the essential TrueType tables and returns the original font unchanged when they are missing or the gain is below the ~10% threshold. Subsetting and a /ToUnicode CMap make text extractable, but they cannot rescue a source font that lacks the information to map a glyph back to a meaningful character. Where no Unicode value can be determined, no amount of CMap emission invents one.

This page is about producing correct font structure in documents NextPDF writes. It is not a font-repair tool for arbitrary inbound PDFs. And emitting a conformant subset and encoding does not by itself certify a document against a full archival profile — that is a separate, broader check.

Mini-FAQ

Why the six-letter tag — why not the font name? So a reader can tell two different subsets of the same font apart and merge documents without colliding their glyph sets. Different subsets, different tags, by rule.

When is it acceptable to have no /ToUnicode? When a self-describing predefined CJK CMap already provides the character-to-Unicode mapping. There the CMap is the encoding. A separate /ToUnicode would be redundant. Outside that, its absence is a defect.

Does subsetting ever hurt? Only if done wrong. Dropping composite-glyph components breaks accented glyphs. Remapping glyph IDs without rewriting references breaks rendering. NextPDF avoids both by resolving the component closure and preserving glyph identity.

Streams and filters — embedded font programs are filtered stream objects with their own decode contract.
What a PDF actually is — the object model the font dictionaries and program streams live in.
PDF 2.0: what changed — including the tightened font-embedding expectations of the 2.0 baseline.

Glossary

Embedded font program — the actual font file (TrueType/CFF/Type 1) carried inside the PDF as a stream, so rendering does not depend on the reader’s installed fonts.
Subsetting — rewriting a font program to contain only the glyphs the document uses, to reduce size.
Subset tag — the mandatory six-uppercase-letter prefix plus + on a subset font’s name (for example, ABCDEF+NotoSans).
/ToUnicode — a CMap stream mapping character codes to Unicode values; what makes PDF text searchable, copyable, and accessible.
Composite glyph — a glyph built by referencing other glyphs as components; its components must be kept when subsetting.
CIDToGIDMap /Identity — the mode where content-stream glyph indices are the font’s own glyph IDs unchanged; NextPDF preserves glyph identity to keep this valid.
Base14 — the fourteen standard Type 1 fonts; PDF 2.0 expects fonts to be embedded rather than referenced by name.