Extract text content over NextPDF Connect (Pro)
At a glance
Section titled “At a glance”Use extract_text to extract text from an existing PDF for indexing, analysis,
or downstream processing. The Pro tool provider registers
new ExtractTextTool() under the protocol name extract_text, and this page
re-verifies that binding. extract_text is a Pro-tier tool. At boot, the
server probes it with class_exists() and registers it only when the Pro
package is installed. You can request plain output, a page range, or
page-segmented structured output.
Install
Section titled “Install”composer require nextpdf/servercomposer require nextpdf/proBind a transport. Before you rely on the tool, confirm it with
diagnostic.capabilities.
Conceptual overview
Section titled “Conceptual overview”Extraction reads text-showing operators from the content stream in stream order
(ISO 32000-2 §9.4). The output reflects the encoded reading order
(ISO 32000-2 §9.10). A scanned PDF with no text layer returns little or no
text. That reflects the source file, not a tool defect. format: "plain"
returns one string. format: "structured" returns per-page objects with
character counts. page_range limits the pages processed.
API surface
Section titled “API surface”| Tool | Tier | Role | Risk tier |
|---|---|---|---|
extract_text | Pro | Extract text (plain / structured / range) | Safe |
parse_pdf | Core (env-gated) | Low-level structure (page count, metadata) | Safe |
Tool names are registry protocol names. The tool catalog is the catalog of record. The available tools depend on the installed tier.
Code sample — Quick start
Section titled “Code sample — Quick start”extract_textwithsource(a server-readable path) andformat: "plain".extract_textwithpage_range: "1-3"for a subset.extract_textwithformat: "structured"for page-segmented output.
Code sample — Production
Section titled “Code sample — Production”Use parse_pdf (or a prior unbounded extract_text) to get the page count
before you request a range. For retrieval-augmented generation (RAG) or
indexing, prefer format: "structured" so each page chunks independently. For
an encrypted source, supply the password parameter. Character counts are UTF-8
code-point counts, not bytes.
Edge cases & gotchas
Section titled “Edge cases & gotchas”- Source missing. A bad path returns a file-not-found error. Use absolute paths the server can read.
- Scanned PDF. Without a text layer, extraction returns empty or near-empty text. Run optical character recognition (OCR) on the source first.
- Out-of-range page. A range beyond the document is rejected with the actual page count.
- Encrypted source. Supply the password parameter.
- Pro absent. With Core only,
extract_textis not registered. Probe withdiagnostic.capabilities.
Performance
Section titled “Performance”Extraction scales with document size, and the budget allows large inputs. The
profile is structural for any produced artefact because this tool returns
text, not a PDF.
Security notes
Section titled “Security notes”Extracted text may contain sensitive content. Treat the result as confidential, and return it only over a trusted channel. The tool has no filesystem write. It reads the source path with the server’s privileges, so constrain which paths a caller may pass.
Conformance
Section titled “Conformance”| Statement | Spec | Clause | reference_id |
|---|---|---|---|
| Text is shown by text operators in stream order. | ISO 32000-2 | §9.4 | |
| Extraction reflects the encoded reading order. | ISO 32000-2 | §9.10 |
This recipe does not assert that extracted text preserves faithful logical reading order for an untagged document. The order is the encoded order.
Commercial context
Section titled “Commercial context”extract_text is a Pro-tier tool, registered only when the Pro package
resolves at server boot.
Transport availability
Section titled “Transport availability”| Transport | Available | Notes |
|---|---|---|
| MCP (stdio) | Yes (Pro) | Large text inflates the stdio frame. |
| REST | Yes (Pro) | Stream large results where supported. |
| gRPC | Yes (Pro) | Message-size limits apply to large text. |
HITL risk tier
Section titled “HITL risk tier”extract_text is Safe (read-only, no side effect) and never gates.
Confirmation gate JSON envelope
Section titled “Confirmation gate JSON envelope”Read-only extraction never gates:
{ "allowed": true }