Extract text content over NextPDF Connect (Pro)

At a glance

Use extract_text to extract text from an existing PDF for indexing, analysis, or downstream processing. The Pro tool provider registers new ExtractTextTool() under the protocol name extract_text, and this page re-verifies that binding. extract_text is a Pro-tier tool. At boot, the server probes it with class_exists() and registers it only when the Pro package is installed. You can request plain output, a page range, or page-segmented structured output.

Install

composer require nextpdf/server
composer require nextpdf/pro

Bind a transport. Before you rely on the tool, confirm it with diagnostic.capabilities.

Conceptual overview

Extraction reads text-showing operators from the content stream in stream order (ISO 32000-2 §9.4). The output reflects the encoded reading order (ISO 32000-2 §9.10). A scanned PDF with no text layer returns little or no text. That reflects the source file, not a tool defect. format: "plain" returns one string. format: "structured" returns per-page objects with character counts. page_range limits the pages processed.

API surface

Tool	Tier	Role	Risk tier
`extract_text`	Pro	Extract text (plain / structured / range)	Safe
`parse_pdf`	Core (env-gated)	Low-level structure (page count, metadata)	Safe

Tool names are registry protocol names. The tool catalog is the catalog of record. The available tools depend on the installed tier.

Code sample — Quick start

extract_text with source (a server-readable path) and format: "plain".
extract_text with page_range: "1-3" for a subset.
extract_text with format: "structured" for page-segmented output.

Code sample — Production

Use parse_pdf (or a prior unbounded extract_text) to get the page count before you request a range. For retrieval-augmented generation (RAG) or indexing, prefer format: "structured" so each page chunks independently. For an encrypted source, supply the password parameter. Character counts are UTF-8 code-point counts, not bytes.

Edge cases & gotchas

Source missing. A bad path returns a file-not-found error. Use absolute paths the server can read.
Scanned PDF. Without a text layer, extraction returns empty or near-empty text. Run optical character recognition (OCR) on the source first.
Out-of-range page. A range beyond the document is rejected with the actual page count.
Encrypted source. Supply the password parameter.
Pro absent. With Core only, extract_text is not registered. Probe with diagnostic.capabilities.

Performance

Extraction scales with document size, and the budget allows large inputs. The profile is structural for any produced artefact because this tool returns text, not a PDF.

Security notes

Extracted text may contain sensitive content. Treat the result as confidential, and return it only over a trusted channel. The tool has no filesystem write. It reads the source path with the server’s privileges, so constrain which paths a caller may pass.

Conformance

Statement	Spec	Clause	reference_id
Text is shown by text operators in stream order.	ISO 32000-2	§9.4
Extraction reflects the encoded reading order.	ISO 32000-2	§9.10

This recipe does not assert that extracted text preserves faithful logical reading order for an untagged document. The order is the encoded order.

Commercial context

extract_text is a Pro-tier tool, registered only when the Pro package resolves at server boot.

Transport availability

Transport	Available	Notes
MCP (stdio)	Yes (Pro)	Large text inflates the stdio frame.
REST	Yes (Pro)	Stream large results where supported.
gRPC	Yes (Pro)	Message-size limits apply to large text.

HITL risk tier

extract_text is Safe (read-only, no side effect) and never gates.

Confirmation gate JSON envelope

Read-only extraction never gates:

{ "allowed": true }