Skip to content

Extract text content over NextPDF Connect (Pro)

Use extract_text to extract text from an existing PDF for indexing, analysis, or downstream processing. The Pro tool provider registers new ExtractTextTool() under the protocol name extract_text, and this page re-verifies that binding. extract_text is a Pro-tier tool. At boot, the server probes it with class_exists() and registers it only when the Pro package is installed. You can request plain output, a page range, or page-segmented structured output.

Terminal window
composer require nextpdf/server
composer require nextpdf/pro

Bind a transport. Before you rely on the tool, confirm it with diagnostic.capabilities.

Extraction reads text-showing operators from the content stream in stream order (ISO 32000-2 §9.4). The output reflects the encoded reading order (ISO 32000-2 §9.10). A scanned PDF with no text layer returns little or no text. That reflects the source file, not a tool defect. format: "plain" returns one string. format: "structured" returns per-page objects with character counts. page_range limits the pages processed.

ToolTierRoleRisk tier
extract_textProExtract text (plain / structured / range)Safe
parse_pdfCore (env-gated)Low-level structure (page count, metadata)Safe

Tool names are registry protocol names. The tool catalog is the catalog of record. The available tools depend on the installed tier.

  1. extract_text with source (a server-readable path) and format: "plain".
  2. extract_text with page_range: "1-3" for a subset.
  3. extract_text with format: "structured" for page-segmented output.

Use parse_pdf (or a prior unbounded extract_text) to get the page count before you request a range. For retrieval-augmented generation (RAG) or indexing, prefer format: "structured" so each page chunks independently. For an encrypted source, supply the password parameter. Character counts are UTF-8 code-point counts, not bytes.

  • Source missing. A bad path returns a file-not-found error. Use absolute paths the server can read.
  • Scanned PDF. Without a text layer, extraction returns empty or near-empty text. Run optical character recognition (OCR) on the source first.
  • Out-of-range page. A range beyond the document is rejected with the actual page count.
  • Encrypted source. Supply the password parameter.
  • Pro absent. With Core only, extract_text is not registered. Probe with diagnostic.capabilities.

Extraction scales with document size, and the budget allows large inputs. The profile is structural for any produced artefact because this tool returns text, not a PDF.

Extracted text may contain sensitive content. Treat the result as confidential, and return it only over a trusted channel. The tool has no filesystem write. It reads the source path with the server’s privileges, so constrain which paths a caller may pass.

StatementSpecClausereference_id
Text is shown by text operators in stream order.ISO 32000-2§9.4
Extraction reflects the encoded reading order.ISO 32000-2§9.10

This recipe does not assert that extracted text preserves faithful logical reading order for an untagged document. The order is the encoded order.

extract_text is a Pro-tier tool, registered only when the Pro package resolves at server boot.

TransportAvailableNotes
MCP (stdio)Yes (Pro)Large text inflates the stdio frame.
RESTYes (Pro)Stream large results where supported.
gRPCYes (Pro)Message-size limits apply to large text.

extract_text is Safe (read-only, no side effect) and never gates.

Read-only extraction never gates:

{ "allowed": true }