Skip to content

Python API reference

The NextPDF Python Software Development Kit (SDK) exposes two clients, one shared Abstract Syntax Tree (AST) method namespace named ast, Pydantic models for every response, a nextpdf command-line interface (CLI), and a six-class exception hierarchy. Use this page as the reference for public application programming interface (API) symbols that work with Portable Document Format (PDF) documents.

Import public symbols from the top-level package:

from nextpdf import (
AsyncNextPDF,
NextPDF,
AstBuildTimeoutError,
AstNoStructTreeError,
NextPDFAPIError,
NextPDFError,
NextPDFLicenseError,
QuotaExceededError,
)

Every extraction method takes raw PDF bytes (bytes) as its first positional argument and returns a typed Pydantic model. Pass options as keyword-only arguments. The synchronous NextPDF.ast.* methods and asynchronous AsyncNextPDF.ast.* methods have identical signatures. Asynchronous methods are coroutines; call them with await.

The synchronous NextPDF client wraps the asynchronous client and runs each coroutine to completion. AsyncNextPDF is both an asynchronous client and an async context manager. Prefer the context-manager form so the underlying transport closes deterministically.

SymbolParametersDefault behaviorReturnsThrows or fails withNotes
NextPDF(*, base_url, api_key, api_version='v1')Keyword-only base URL, API key, and optional API version.Creates a remote-backed synchronous client.NextPDFValueError when base_url or api_key is empty.Runs async work synchronously; safe inside notebooks and a running event loop.
AsyncNextPDF(*, base_url='', api_key='', api_version='v1', backend=None)Keyword-only base URL, API key, optional API version, and optional injected backend.Creates a remote-backed asynchronous client when no backend is injected.AsyncNextPDFValueError when base_url or api_key is empty and no backend is supplied.Pass backend= to inject a custom or local backend in tests.
AsyncNextPDF.__aenter__()None.Enters the async context and returns the client.AsyncNextPDFNone expected.Use async with AsyncNextPDF(...) as client:.
AsyncNextPDF.__aexit__(*_)Suppressed exception arguments.Calls close() on context exit.NoneNone expected.Releases the transport even when the body raises.
AsyncNextPDF.close()None.Closes the owned remote backend and releases the connection pool.NoneNone expected.Idempotent; injected backends are not closed.

Do not keep the API key in source code. Read base_url and api_key from the environment (NEXTPDF_BASE_URL, NEXTPDF_API_KEY) or a secret manager.

import os
from nextpdf import AsyncNextPDF
async def extract(pdf_bytes: bytes) -> int:
"""Return the page count of a PDF using the async client as a context manager."""
base_url = os.environ["NEXTPDF_BASE_URL"]
api_key = os.environ["NEXTPDF_API_KEY"]
async with AsyncNextPDF(base_url=base_url, api_key=api_key) as client:
document = await client.ast.get_document_ast(pdf_bytes)
return document.page_count
SymbolParametersDefault behaviorReturnsThrows or fails withNotes
NextPDF.ast.get_document_ast()pdf_data: bytes; keyword page_range_start, page_range_end, token_budget.Builds the full Semantic AST for every page.AstDocumentAstNoStructTreeError, AstBuildTimeoutError, NextPDFLicenseError, QuotaExceededError.Reduce the page range when a build times out.
NextPDF.ast.extract_cited_text()pdf_data: bytes; keyword page_index, headings_only.Extracts all text blocks with citation anchors.list[CitedTextBlock]NextPDFAPIError, QuotaExceededError.Set headings_only=True to retrieve only heading nodes.
NextPDF.ast.extract_cited_tables()pdf_data: bytes; keyword page_range (dict with start and end).Extracts all tables with cell-level citation anchors.ExtractCitedTablesResponseNextPDFAPIError, QuotaExceededError.Omit page_range to scan the whole document.
NextPDF.ast.get_ast_node()pdf_data: bytes, node_id: str.Retrieves one node by its identifier.GetAstNodeResponseNextPDFError when the node is not found.node_id format is ast:{hash6}:{pageIdx}:{seq}.
NextPDF.ast.search_ast_nodes()pdf_data: bytes; keyword node_type, page_index, text_query, max_results=100.Returns shallow nodes that match the filters.SearchAstNodesResponseNextPDFAPIError.text_query is a case-insensitive substring match.
NextPDF.ast.get_ast_diff()original_pdf_data: bytes, modified_pdf_data: bytes.Compares two documents by structure.GetAstDiffResponseNextPDFAPIError, QuotaExceededError.Reports added, removed, and changed nodes.

Asynchronous AST methods — AsyncNextPDF.ast.*

Section titled “Asynchronous AST methods — AsyncNextPDF.ast.*”

Each asynchronous method is a coroutine with the same parameters, defaults, return type, and failure modes as its synchronous counterpart. Call it with await inside an asyncio runtime.

SymbolParametersDefault behaviorReturnsThrows or fails withNotes
AsyncNextPDF.ast.get_document_ast()pdf_data: bytes; keyword page_range_start, page_range_end, token_budget.Builds the full Semantic AST for every page.AstDocumentAstNoStructTreeError, AstBuildTimeoutError, NextPDFLicenseError, QuotaExceededError.await the result.
AsyncNextPDF.ast.extract_cited_text()pdf_data: bytes; keyword page_index, headings_only.Extracts all text blocks with citation anchors.list[CitedTextBlock]NextPDFAPIError, QuotaExceededError.await the result.
AsyncNextPDF.ast.extract_cited_tables()pdf_data: bytes; keyword page_range.Extracts all tables with cell-level citation anchors.ExtractCitedTablesResponseNextPDFAPIError, QuotaExceededError.await the result.
AsyncNextPDF.ast.get_ast_node()pdf_data: bytes, node_id: str.Retrieves one node by its identifier.GetAstNodeResponseNextPDFError when the node is not found.await the result.
AsyncNextPDF.ast.search_ast_nodes()pdf_data: bytes; keyword node_type, page_index, text_query, max_results=100.Returns shallow nodes that match the filters.SearchAstNodesResponseNextPDFAPIError.await the result.
AsyncNextPDF.ast.get_ast_diff()original_pdf_data: bytes, modified_pdf_data: bytes.Compares two documents by structure.GetAstDiffResponseNextPDFAPIError, QuotaExceededError.await the result.

Use the async client as a context manager to batch two extractions concurrently:

import asyncio
from nextpdf import AsyncNextPDF
async def extract_pair(first: bytes, second: bytes) -> None:
"""Extract two PDFs concurrently with one shared async client."""
async with AsyncNextPDF(base_url="https://connect.example.com", api_key="set-from-secret") as client:
text_blocks, tables = await asyncio.gather(
client.ast.extract_cited_text(first),
client.ast.extract_cited_tables(second),
)
print(f"text blocks: {len(text_blocks)}; tables: {tables.table_count}")

Every response is a Pydantic model. Import model classes from nextpdf or nextpdf.models.ast.

SymbolKindKey fieldsNotes
AstDocumentDocument rootschema_version, source_hash, page_count, root: AstNode, estimated_tokens (property).Returned by get_document_ast(). Accepts aliases schemaVersion, sourceHash, and pageCount.
AstNodeTree nodeid, type: NodeType, page_index, bbox, text_content, attributes, children: list[AstNode], pdf_object_number, mcid.Recursive node that carries the document tree.
AstNodeMetaResponse metadataetag, pages_processed.Frozen; attached to node and search responses.
AstNodeShallowSearch hitid, type: NodeType, page_index, bbox, text_content, attributes, children_count.Frozen; no deep children.
BoundingBoxValue objectx, y, width, height (each 0.01.0).Normalized coordinates within a page.
CitationAnchorValue objectnode_id, page_index, bbox: BoundingBox, confidence, content_hash.Provenance record for each block.
CitedTextBlockText blocktext, citation: CitationAnchor, node_type, chunk_index, depth.Each item in the extract_cited_text() list.
CitedTableBlockTable blocktable_node_id, page_index, citation_anchor, row_count, col_count, rows.Frozen; one table.
CitedTableCellTable cellrow, col, row_span, col_span, text, bbox, confidence.Frozen; one cell.
NodeTypeEnumdocument, section, heading, paragraph, list, table, figure, and others.String enum for node-type values.
GetAstNodeResponseResponsenode: AstNode, meta: AstNodeMeta.Returned by get_ast_node().
SearchAstNodesResponseResponsenodes: list[AstNodeShallow], total_matches, truncated, meta.Returned by search_ast_nodes().
ExtractCitedTablesResponseResponsetables: list[CitedTableBlock], table_count, pages_processed.Returned by extract_cited_tables().
AstDiffEntryDiff itemtype (added/removed/changed), node_id, node_type, page_index, text_preview.One change in a diff.
AstDiffSummaryDiff totalsadded_node_count, removed_node_count, changed_node_count.Aggregate counts.
GetAstDiffResponseResponseoriginal_page_count, modified_page_count, summary: AstDiffSummary, diff: list[AstDiffEntry], pages_processed.Returned by get_ast_diff().

Read the citation anchor from an extracted text block:

from nextpdf import CitedTextBlock
def describe(block: CitedTextBlock) -> str:
"""Render a text block with its page index and confidence."""
anchor = block.citation
return f"[page {anchor.page_index}, confidence {anchor.confidence:.2f}] {block.text[:80]}"

The nextpdf command runs extraction from the terminal. Pass --base-url and --api-key, or set NEXTPDF_BASE_URL and NEXTPDF_API_KEY in the environment. Every command except version requires connection settings. A PDF_PATH value of - reads PDF bytes from standard input.

SymbolParametersDefault behaviorReturnsThrows or fails withNotes
nextpdf extract textPDF_PATH; --format {json,markdown,plain}, --page, --headings-only.Emits cited text blocks as JavaScript Object Notation (JSON).Writes to standard output or an --output file.Exit code 1 for any NextPDFError.--page selects one 0-based page index.
nextpdf extract tablesPDF_PATH; --format {json,csv}, --page-start, --page-end.Emits tables as JSON.Writes to standard output or an --output file.Exit code 1 for any NextPDFError.--format csv writes one comma-separated values (CSV) block per table.
nextpdf astPDF_PATH; --page-start, --page-end, --token-budget.Emits the full Semantic AST as JSON.Writes to standard output or an --output file.Exit code 1 for any NextPDFError.Use --token-budget to bound the response size.
nextpdf infoPDF_PATH.Emits document metadata: page count, schema version, source hash, estimated tokens, and root summary.Writes JSON to standard output or an --output file.Exit code 1 for any NextPDFError.Lightweight inspection command.
nextpdf versionNone.Prints the installed SDK version.Writes to standard output.None expected.Does not contact a server; needs no credentials.
python -m nextpdf.mcpnone (reads NEXTPDF_BASE_URL, NEXTPDF_API_KEY).Runs the Model Context Protocol server over standard input/output.Long-running server process.RuntimeError when the environment variables are unset.Requires the nextpdf[mcp] extra.

Extract tables as comma-separated values (CSV) from a page range:

Terminal window
nextpdf extract tables invoice.pdf --format csv --page-start 0 --page-end 2 --output tables.csv

The exception hierarchy lives in nextpdf.models.errors and is re-exported from nextpdf. Catch the most specific class your code can handle, then fall back to the base class. The server reports failure with Hypertext Transfer Protocol (HTTP) status semantics aligned with Request for Comments (RFC) 9110. Each exception carries the originating status_code and, when available, an error_code.

SymbolBase classstatus_codeWhen it is raisedNotes
NextPDFErrorExceptionoptionalBase class for every SDK error.Carries an optional status_code. Catch it last as a fallback.
NextPDFAPIErrorNextPDFErrorrequiredThe Connect endpoint returned an HTTP error.Adds error_code.
NextPDFLicenseErrorNextPDFAPIError402The server requires a higher-tier license for the feature.error_code is license/tier-required.
QuotaExceededErrorNextPDFAPIError429A rate limit or quota was exceeded.Carries retry_after; honor it before retrying.
AstNoStructTreeErrorNextPDFAPIError422The PDF is untagged, and heuristic fallback is not enabled.Enable heuristic mode or supply a tagged PDF.
AstBuildTimeoutErrorNextPDFAPIError504The AST build timed out on the server.Reduce the page range and retry.
from nextpdf import (
NextPDF,
AstBuildTimeoutError,
NextPDFAPIError,
NextPDFError,
QuotaExceededError,
)
def extract_text(client: NextPDF, pdf_bytes: bytes) -> int:
"""Extract cited text, handling the most specific failures first."""
try:
blocks = client.ast.extract_cited_text(pdf_bytes)
except QuotaExceededError as error:
raise RuntimeError(f"Quota exceeded (retry after {error.retry_after}s)") from error
except AstBuildTimeoutError as error:
raise RuntimeError("AST build timed out; reduce the page range") from error
except NextPDFAPIError as error:
raise RuntimeError(f"API error {error.status_code}: {error}") from error
except NextPDFError as error:
raise RuntimeError(f"SDK error: {error}") from error
return len(blocks)
  • The synchronous NextPDF client delegates every call to AsyncNextPDF. You can call it from a notebook or a thread that already runs an event loop, because it dispatches the coroutine to a worker thread when it detects a running loop.
  • Prefer the async context-manager form async with AsyncNextPDF(...) as client: so the connection pool closes deterministically. When you construct AsyncNextPDF directly, call close() yourself.
  • The bearer token is never logged or included in error messages, and Transport Layer Security (TLS) verification is enabled by default. Do not embed credentials in source; read them from the environment or a secret manager.
  • All models are Pydantic v2 classes; several response models are frozen (immutable). Treat extracted blocks as read-only values.
  • The CLI exits with status code 1 for any NextPDFError and prints the message to standard error. Wire that exit code into pipelines.
  • Python SDK developer guide — architecture, async batching, and failure handling.
  • Python CLI — terminal extraction and streaming for large files.
  • Python MCP server — expose extraction tools to artificial intelligence (AI) agents.
  • RFC 9110 (HTTP Semantics) and RFC 9457 (Problem Details for HTTP APIs) describe the status semantics and machine-readable error bodies the Connect endpoint returns. See the IETF RFC index for the authoritative text.