Python API reference
At a glance
Section titled “At a glance”The NextPDF Python Software Development Kit (SDK) exposes two clients, one shared Abstract Syntax Tree (AST) method namespace named ast, Pydantic models for every response, a nextpdf command-line interface (CLI), and a six-class exception hierarchy. Use this page as the reference for public application programming interface (API) symbols that work with Portable Document Format (PDF) documents.
Import public symbols from the top-level package:
from nextpdf import ( AsyncNextPDF, NextPDF, AstBuildTimeoutError, AstNoStructTreeError, NextPDFAPIError, NextPDFError, NextPDFLicenseError, QuotaExceededError,)Every extraction method takes raw PDF bytes (bytes) as its first positional argument and returns a typed Pydantic model. Pass options as keyword-only arguments. The synchronous NextPDF.ast.* methods and asynchronous AsyncNextPDF.ast.* methods have identical signatures. Asynchronous methods are coroutines; call them with await.
Client
Section titled “Client”The synchronous NextPDF client wraps the asynchronous client and runs each coroutine to completion. AsyncNextPDF is both an asynchronous client and an async context manager. Prefer the context-manager form so the underlying transport closes deterministically.
Constructors
Section titled “Constructors”| Symbol | Parameters | Default behavior | Returns | Throws or fails with | Notes |
|---|---|---|---|---|---|
NextPDF(*, base_url, api_key, api_version='v1') | Keyword-only base URL, API key, and optional API version. | Creates a remote-backed synchronous client. | NextPDF | ValueError when base_url or api_key is empty. | Runs async work synchronously; safe inside notebooks and a running event loop. |
AsyncNextPDF(*, base_url='', api_key='', api_version='v1', backend=None) | Keyword-only base URL, API key, optional API version, and optional injected backend. | Creates a remote-backed asynchronous client when no backend is injected. | AsyncNextPDF | ValueError when base_url or api_key is empty and no backend is supplied. | Pass backend= to inject a custom or local backend in tests. |
AsyncNextPDF.__aenter__() | None. | Enters the async context and returns the client. | AsyncNextPDF | None expected. | Use async with AsyncNextPDF(...) as client:. |
AsyncNextPDF.__aexit__(*_) | Suppressed exception arguments. | Calls close() on context exit. | None | None expected. | Releases the transport even when the body raises. |
AsyncNextPDF.close() | None. | Closes the owned remote backend and releases the connection pool. | None | None expected. | Idempotent; injected backends are not closed. |
Do not keep the API key in source code. Read base_url and api_key from the environment (NEXTPDF_BASE_URL, NEXTPDF_API_KEY) or a secret manager.
import os
from nextpdf import AsyncNextPDF
async def extract(pdf_bytes: bytes) -> int: """Return the page count of a PDF using the async client as a context manager.""" base_url = os.environ["NEXTPDF_BASE_URL"] api_key = os.environ["NEXTPDF_API_KEY"]
async with AsyncNextPDF(base_url=base_url, api_key=api_key) as client: document = await client.ast.get_document_ast(pdf_bytes) return document.page_countSynchronous AST methods — NextPDF.ast.*
Section titled “Synchronous AST methods — NextPDF.ast.*”| Symbol | Parameters | Default behavior | Returns | Throws or fails with | Notes |
|---|---|---|---|---|---|
NextPDF.ast.get_document_ast() | pdf_data: bytes; keyword page_range_start, page_range_end, token_budget. | Builds the full Semantic AST for every page. | AstDocument | AstNoStructTreeError, AstBuildTimeoutError, NextPDFLicenseError, QuotaExceededError. | Reduce the page range when a build times out. |
NextPDF.ast.extract_cited_text() | pdf_data: bytes; keyword page_index, headings_only. | Extracts all text blocks with citation anchors. | list[CitedTextBlock] | NextPDFAPIError, QuotaExceededError. | Set headings_only=True to retrieve only heading nodes. |
NextPDF.ast.extract_cited_tables() | pdf_data: bytes; keyword page_range (dict with start and end). | Extracts all tables with cell-level citation anchors. | ExtractCitedTablesResponse | NextPDFAPIError, QuotaExceededError. | Omit page_range to scan the whole document. |
NextPDF.ast.get_ast_node() | pdf_data: bytes, node_id: str. | Retrieves one node by its identifier. | GetAstNodeResponse | NextPDFError when the node is not found. | node_id format is ast:{hash6}:{pageIdx}:{seq}. |
NextPDF.ast.search_ast_nodes() | pdf_data: bytes; keyword node_type, page_index, text_query, max_results=100. | Returns shallow nodes that match the filters. | SearchAstNodesResponse | NextPDFAPIError. | text_query is a case-insensitive substring match. |
NextPDF.ast.get_ast_diff() | original_pdf_data: bytes, modified_pdf_data: bytes. | Compares two documents by structure. | GetAstDiffResponse | NextPDFAPIError, QuotaExceededError. | Reports added, removed, and changed nodes. |
Asynchronous AST methods — AsyncNextPDF.ast.*
Section titled “Asynchronous AST methods — AsyncNextPDF.ast.*”Each asynchronous method is a coroutine with the same parameters, defaults, return type, and failure modes as its synchronous counterpart. Call it with await inside an asyncio runtime.
| Symbol | Parameters | Default behavior | Returns | Throws or fails with | Notes |
|---|---|---|---|---|---|
AsyncNextPDF.ast.get_document_ast() | pdf_data: bytes; keyword page_range_start, page_range_end, token_budget. | Builds the full Semantic AST for every page. | AstDocument | AstNoStructTreeError, AstBuildTimeoutError, NextPDFLicenseError, QuotaExceededError. | await the result. |
AsyncNextPDF.ast.extract_cited_text() | pdf_data: bytes; keyword page_index, headings_only. | Extracts all text blocks with citation anchors. | list[CitedTextBlock] | NextPDFAPIError, QuotaExceededError. | await the result. |
AsyncNextPDF.ast.extract_cited_tables() | pdf_data: bytes; keyword page_range. | Extracts all tables with cell-level citation anchors. | ExtractCitedTablesResponse | NextPDFAPIError, QuotaExceededError. | await the result. |
AsyncNextPDF.ast.get_ast_node() | pdf_data: bytes, node_id: str. | Retrieves one node by its identifier. | GetAstNodeResponse | NextPDFError when the node is not found. | await the result. |
AsyncNextPDF.ast.search_ast_nodes() | pdf_data: bytes; keyword node_type, page_index, text_query, max_results=100. | Returns shallow nodes that match the filters. | SearchAstNodesResponse | NextPDFAPIError. | await the result. |
AsyncNextPDF.ast.get_ast_diff() | original_pdf_data: bytes, modified_pdf_data: bytes. | Compares two documents by structure. | GetAstDiffResponse | NextPDFAPIError, QuotaExceededError. | await the result. |
Use the async client as a context manager to batch two extractions concurrently:
import asyncio
from nextpdf import AsyncNextPDF
async def extract_pair(first: bytes, second: bytes) -> None: """Extract two PDFs concurrently with one shared async client.""" async with AsyncNextPDF(base_url="https://connect.example.com", api_key="set-from-secret") as client: text_blocks, tables = await asyncio.gather( client.ast.extract_cited_text(first), client.ast.extract_cited_tables(second), ) print(f"text blocks: {len(text_blocks)}; tables: {tables.table_count}")Models
Section titled “Models”Every response is a Pydantic model. Import model classes from nextpdf or nextpdf.models.ast.
| Symbol | Kind | Key fields | Notes |
|---|---|---|---|
AstDocument | Document root | schema_version, source_hash, page_count, root: AstNode, estimated_tokens (property). | Returned by get_document_ast(). Accepts aliases schemaVersion, sourceHash, and pageCount. |
AstNode | Tree node | id, type: NodeType, page_index, bbox, text_content, attributes, children: list[AstNode], pdf_object_number, mcid. | Recursive node that carries the document tree. |
AstNodeMeta | Response metadata | etag, pages_processed. | Frozen; attached to node and search responses. |
AstNodeShallow | Search hit | id, type: NodeType, page_index, bbox, text_content, attributes, children_count. | Frozen; no deep children. |
BoundingBox | Value object | x, y, width, height (each 0.0–1.0). | Normalized coordinates within a page. |
CitationAnchor | Value object | node_id, page_index, bbox: BoundingBox, confidence, content_hash. | Provenance record for each block. |
CitedTextBlock | Text block | text, citation: CitationAnchor, node_type, chunk_index, depth. | Each item in the extract_cited_text() list. |
CitedTableBlock | Table block | table_node_id, page_index, citation_anchor, row_count, col_count, rows. | Frozen; one table. |
CitedTableCell | Table cell | row, col, row_span, col_span, text, bbox, confidence. | Frozen; one cell. |
NodeType | Enum | document, section, heading, paragraph, list, table, figure, and others. | String enum for node-type values. |
GetAstNodeResponse | Response | node: AstNode, meta: AstNodeMeta. | Returned by get_ast_node(). |
SearchAstNodesResponse | Response | nodes: list[AstNodeShallow], total_matches, truncated, meta. | Returned by search_ast_nodes(). |
ExtractCitedTablesResponse | Response | tables: list[CitedTableBlock], table_count, pages_processed. | Returned by extract_cited_tables(). |
AstDiffEntry | Diff item | type (added/removed/changed), node_id, node_type, page_index, text_preview. | One change in a diff. |
AstDiffSummary | Diff totals | added_node_count, removed_node_count, changed_node_count. | Aggregate counts. |
GetAstDiffResponse | Response | original_page_count, modified_page_count, summary: AstDiffSummary, diff: list[AstDiffEntry], pages_processed. | Returned by get_ast_diff(). |
Read the citation anchor from an extracted text block:
from nextpdf import CitedTextBlock
def describe(block: CitedTextBlock) -> str: """Render a text block with its page index and confidence.""" anchor = block.citation return f"[page {anchor.page_index}, confidence {anchor.confidence:.2f}] {block.text[:80]}"CLI commands
Section titled “CLI commands”The nextpdf command runs extraction from the terminal. Pass --base-url and --api-key, or set NEXTPDF_BASE_URL and NEXTPDF_API_KEY in the environment. Every command except version requires connection settings. A PDF_PATH value of - reads PDF bytes from standard input.
| Symbol | Parameters | Default behavior | Returns | Throws or fails with | Notes |
|---|---|---|---|---|---|
nextpdf extract text | PDF_PATH; --format {json,markdown,plain}, --page, --headings-only. | Emits cited text blocks as JavaScript Object Notation (JSON). | Writes to standard output or an --output file. | Exit code 1 for any NextPDFError. | --page selects one 0-based page index. |
nextpdf extract tables | PDF_PATH; --format {json,csv}, --page-start, --page-end. | Emits tables as JSON. | Writes to standard output or an --output file. | Exit code 1 for any NextPDFError. | --format csv writes one comma-separated values (CSV) block per table. |
nextpdf ast | PDF_PATH; --page-start, --page-end, --token-budget. | Emits the full Semantic AST as JSON. | Writes to standard output or an --output file. | Exit code 1 for any NextPDFError. | Use --token-budget to bound the response size. |
nextpdf info | PDF_PATH. | Emits document metadata: page count, schema version, source hash, estimated tokens, and root summary. | Writes JSON to standard output or an --output file. | Exit code 1 for any NextPDFError. | Lightweight inspection command. |
nextpdf version | None. | Prints the installed SDK version. | Writes to standard output. | None expected. | Does not contact a server; needs no credentials. |
python -m nextpdf.mcp | none (reads NEXTPDF_BASE_URL, NEXTPDF_API_KEY). | Runs the Model Context Protocol server over standard input/output. | Long-running server process. | RuntimeError when the environment variables are unset. | Requires the nextpdf[mcp] extra. |
Extract tables as comma-separated values (CSV) from a page range:
nextpdf extract tables invoice.pdf --format csv --page-start 0 --page-end 2 --output tables.csvExceptions
Section titled “Exceptions”The exception hierarchy lives in nextpdf.models.errors and is re-exported from nextpdf. Catch the most specific class your code can handle, then fall back to the base class. The server reports failure with Hypertext Transfer Protocol (HTTP) status semantics aligned with Request for Comments (RFC) 9110. Each exception carries the originating status_code and, when available, an error_code.
| Symbol | Base class | status_code | When it is raised | Notes |
|---|---|---|---|---|
NextPDFError | Exception | optional | Base class for every SDK error. | Carries an optional status_code. Catch it last as a fallback. |
NextPDFAPIError | NextPDFError | required | The Connect endpoint returned an HTTP error. | Adds error_code. |
NextPDFLicenseError | NextPDFAPIError | 402 | The server requires a higher-tier license for the feature. | error_code is license/tier-required. |
QuotaExceededError | NextPDFAPIError | 429 | A rate limit or quota was exceeded. | Carries retry_after; honor it before retrying. |
AstNoStructTreeError | NextPDFAPIError | 422 | The PDF is untagged, and heuristic fallback is not enabled. | Enable heuristic mode or supply a tagged PDF. |
AstBuildTimeoutError | NextPDFAPIError | 504 | The AST build timed out on the server. | Reduce the page range and retry. |
from nextpdf import ( NextPDF, AstBuildTimeoutError, NextPDFAPIError, NextPDFError, QuotaExceededError,)
def extract_text(client: NextPDF, pdf_bytes: bytes) -> int: """Extract cited text, handling the most specific failures first.""" try: blocks = client.ast.extract_cited_text(pdf_bytes) except QuotaExceededError as error: raise RuntimeError(f"Quota exceeded (retry after {error.retry_after}s)") from error except AstBuildTimeoutError as error: raise RuntimeError("AST build timed out; reduce the page range") from error except NextPDFAPIError as error: raise RuntimeError(f"API error {error.status_code}: {error}") from error except NextPDFError as error: raise RuntimeError(f"SDK error: {error}") from error return len(blocks)Development notes
Section titled “Development notes”- The synchronous
NextPDFclient delegates every call toAsyncNextPDF. You can call it from a notebook or a thread that already runs an event loop, because it dispatches the coroutine to a worker thread when it detects a running loop. - Prefer the async context-manager form
async with AsyncNextPDF(...) as client:so the connection pool closes deterministically. When you constructAsyncNextPDFdirectly, callclose()yourself. - The bearer token is never logged or included in error messages, and Transport Layer Security (TLS) verification is enabled by default. Do not embed credentials in source; read them from the environment or a secret manager.
- All models are Pydantic v2 classes; several response models are frozen (immutable). Treat extracted blocks as read-only values.
- The CLI exits with status code 1 for any
NextPDFErrorand prints the message to standard error. Wire that exit code into pipelines.
See also
Section titled “See also”- Python SDK developer guide — architecture, async batching, and failure handling.
- Python CLI — terminal extraction and streaming for large files.
- Python MCP server — expose extraction tools to artificial intelligence (AI) agents.
- RFC 9110 (HTTP Semantics) and RFC 9457 (Problem Details for HTTP APIs) describe the status semantics and machine-readable error bodies the Connect endpoint returns. See the IETF RFC index for the authoritative text.