Python API reference

At a glance

The NextPDF Python Software Development Kit (SDK) exposes two clients, one shared Abstract Syntax Tree (AST) method namespace named ast, Pydantic models for every response, a nextpdf command-line interface (CLI), and a six-class exception hierarchy. Use this page as the reference for public application programming interface (API) symbols that work with Portable Document Format (PDF) documents.

Import public symbols from the top-level package:

from nextpdf import (
    AsyncNextPDF,
    NextPDF,
    AstBuildTimeoutError,
    AstNoStructTreeError,
    NextPDFAPIError,
    NextPDFError,
    NextPDFLicenseError,
    QuotaExceededError,
)

Every extraction method takes raw PDF bytes (bytes) as its first positional argument and returns a typed Pydantic model. Pass options as keyword-only arguments. The synchronous NextPDF.ast.* methods and asynchronous AsyncNextPDF.ast.* methods have identical signatures. Asynchronous methods are coroutines; call them with await.

Client

The synchronous NextPDF client wraps the asynchronous client and runs each coroutine to completion. AsyncNextPDF is both an asynchronous client and an async context manager. Prefer the context-manager form so the underlying transport closes deterministically.

Constructors

Symbol	Parameters	Default behavior	Returns	Throws or fails with	Notes
`NextPDF(*, base_url, api_key, api_version='v1')`	Keyword-only base URL, API key, and optional API version.	Creates a remote-backed synchronous client.	`NextPDF`	`ValueError` when `base_url` or `api_key` is empty.	Runs async work synchronously; safe inside notebooks and a running event loop.
`AsyncNextPDF(*, base_url='', api_key='', api_version='v1', backend=None)`	Keyword-only base URL, API key, optional API version, and optional injected backend.	Creates a remote-backed asynchronous client when no backend is injected.	`AsyncNextPDF`	`ValueError` when `base_url` or `api_key` is empty and no `backend` is supplied.	Pass `backend=` to inject a custom or local backend in tests.
`AsyncNextPDF.__aenter__()`	None.	Enters the async context and returns the client.	`AsyncNextPDF`	None expected.	Use `async with AsyncNextPDF(...) as client:`.
`AsyncNextPDF.__aexit__(*_)`	Suppressed exception arguments.	Calls `close()` on context exit.	`None`	None expected.	Releases the transport even when the body raises.
`AsyncNextPDF.close()`	None.	Closes the owned remote backend and releases the connection pool.	`None`	None expected.	Idempotent; injected backends are not closed.

Do not keep the API key in source code. Read base_url and api_key from the environment (NEXTPDF_BASE_URL, NEXTPDF_API_KEY) or a secret manager.

import os

from nextpdf import AsyncNextPDF


async def extract(pdf_bytes: bytes) -> int:
    """Return the page count of a PDF using the async client as a context manager."""
    base_url = os.environ["NEXTPDF_BASE_URL"]
    api_key = os.environ["NEXTPDF_API_KEY"]

    async with AsyncNextPDF(base_url=base_url, api_key=api_key) as client:
        document = await client.ast.get_document_ast(pdf_bytes)
        return document.page_count

Synchronous AST methods — `NextPDF.ast.*`

Symbol	Parameters	Default behavior	Returns	Throws or fails with	Notes
`NextPDF.ast.get_document_ast()`	`pdf_data: bytes`; keyword `page_range_start`, `page_range_end`, `token_budget`.	Builds the full Semantic AST for every page.	`AstDocument`	`AstNoStructTreeError`, `AstBuildTimeoutError`, `NextPDFLicenseError`, `QuotaExceededError`.	Reduce the page range when a build times out.
`NextPDF.ast.extract_cited_text()`	`pdf_data: bytes`; keyword `page_index`, `headings_only`.	Extracts all text blocks with citation anchors.	`list[CitedTextBlock]`	`NextPDFAPIError`, `QuotaExceededError`.	Set `headings_only=True` to retrieve only heading nodes.
`NextPDF.ast.extract_cited_tables()`	`pdf_data: bytes`; keyword `page_range` (`dict` with `start` and `end`).	Extracts all tables with cell-level citation anchors.	`ExtractCitedTablesResponse`	`NextPDFAPIError`, `QuotaExceededError`.	Omit `page_range` to scan the whole document.
`NextPDF.ast.get_ast_node()`	`pdf_data: bytes`, `node_id: str`.	Retrieves one node by its identifier.	`GetAstNodeResponse`	`NextPDFError` when the node is not found.	`node_id` format is `ast:{hash6}:{pageIdx}:{seq}`.
`NextPDF.ast.search_ast_nodes()`	`pdf_data: bytes`; keyword `node_type`, `page_index`, `text_query`, `max_results=100`.	Returns shallow nodes that match the filters.	`SearchAstNodesResponse`	`NextPDFAPIError`.	`text_query` is a case-insensitive substring match.
`NextPDF.ast.get_ast_diff()`	`original_pdf_data: bytes`, `modified_pdf_data: bytes`.	Compares two documents by structure.	`GetAstDiffResponse`	`NextPDFAPIError`, `QuotaExceededError`.	Reports added, removed, and changed nodes.

Asynchronous AST methods — `AsyncNextPDF.ast.*`

Each asynchronous method is a coroutine with the same parameters, defaults, return type, and failure modes as its synchronous counterpart. Call it with await inside an asyncio runtime.

Symbol	Parameters	Default behavior	Returns	Throws or fails with	Notes
`AsyncNextPDF.ast.get_document_ast()`	`pdf_data: bytes`; keyword `page_range_start`, `page_range_end`, `token_budget`.	Builds the full Semantic AST for every page.	`AstDocument`	`AstNoStructTreeError`, `AstBuildTimeoutError`, `NextPDFLicenseError`, `QuotaExceededError`.	`await` the result.
`AsyncNextPDF.ast.extract_cited_text()`	`pdf_data: bytes`; keyword `page_index`, `headings_only`.	Extracts all text blocks with citation anchors.	`list[CitedTextBlock]`	`NextPDFAPIError`, `QuotaExceededError`.	`await` the result.
`AsyncNextPDF.ast.extract_cited_tables()`	`pdf_data: bytes`; keyword `page_range`.	Extracts all tables with cell-level citation anchors.	`ExtractCitedTablesResponse`	`NextPDFAPIError`, `QuotaExceededError`.	`await` the result.
`AsyncNextPDF.ast.get_ast_node()`	`pdf_data: bytes`, `node_id: str`.	Retrieves one node by its identifier.	`GetAstNodeResponse`	`NextPDFError` when the node is not found.	`await` the result.
`AsyncNextPDF.ast.search_ast_nodes()`	`pdf_data: bytes`; keyword `node_type`, `page_index`, `text_query`, `max_results=100`.	Returns shallow nodes that match the filters.	`SearchAstNodesResponse`	`NextPDFAPIError`.	`await` the result.
`AsyncNextPDF.ast.get_ast_diff()`	`original_pdf_data: bytes`, `modified_pdf_data: bytes`.	Compares two documents by structure.	`GetAstDiffResponse`	`NextPDFAPIError`, `QuotaExceededError`.	`await` the result.

Use the async client as a context manager to batch two extractions concurrently:

import asyncio

from nextpdf import AsyncNextPDF


async def extract_pair(first: bytes, second: bytes) -> None:
    """Extract two PDFs concurrently with one shared async client."""
    async with AsyncNextPDF(base_url="https://connect.example.com", api_key="set-from-secret") as client:
        text_blocks, tables = await asyncio.gather(
            client.ast.extract_cited_text(first),
            client.ast.extract_cited_tables(second),
        )
        print(f"text blocks: {len(text_blocks)}; tables: {tables.table_count}")

Models

Every response is a Pydantic model. Import model classes from nextpdf or nextpdf.models.ast.

Symbol	Kind	Key fields	Notes
`AstDocument`	Document root	`schema_version`, `source_hash`, `page_count`, `root: AstNode`, `estimated_tokens` (property).	Returned by `get_document_ast()`. Accepts aliases `schemaVersion`, `sourceHash`, and `pageCount`.
`AstNode`	Tree node	`id`, `type: NodeType`, `page_index`, `bbox`, `text_content`, `attributes`, `children: list[AstNode]`, `pdf_object_number`, `mcid`.	Recursive node that carries the document tree.
`AstNodeMeta`	Response metadata	`etag`, `pages_processed`.	Frozen; attached to node and search responses.
`AstNodeShallow`	Search hit	`id`, `type: NodeType`, `page_index`, `bbox`, `text_content`, `attributes`, `children_count`.	Frozen; no deep children.
`BoundingBox`	Value object	`x`, `y`, `width`, `height` (each `0.0`–`1.0`).	Normalized coordinates within a page.
`CitationAnchor`	Value object	`node_id`, `page_index`, `bbox: BoundingBox`, `confidence`, `content_hash`.	Provenance record for each block.
`CitedTextBlock`	Text block	`text`, `citation: CitationAnchor`, `node_type`, `chunk_index`, `depth`.	Each item in the `extract_cited_text()` list.
`CitedTableBlock`	Table block	`table_node_id`, `page_index`, `citation_anchor`, `row_count`, `col_count`, `rows`.	Frozen; one table.
`CitedTableCell`	Table cell	`row`, `col`, `row_span`, `col_span`, `text`, `bbox`, `confidence`.	Frozen; one cell.
`NodeType`	Enum	`document`, `section`, `heading`, `paragraph`, `list`, `table`, `figure`, and others.	String enum for node-type values.
`GetAstNodeResponse`	Response	`node: AstNode`, `meta: AstNodeMeta`.	Returned by `get_ast_node()`.
`SearchAstNodesResponse`	Response	`nodes: list[AstNodeShallow]`, `total_matches`, `truncated`, `meta`.	Returned by `search_ast_nodes()`.
`ExtractCitedTablesResponse`	Response	`tables: list[CitedTableBlock]`, `table_count`, `pages_processed`.	Returned by `extract_cited_tables()`.
`AstDiffEntry`	Diff item	`type` (`added`/`removed`/`changed`), `node_id`, `node_type`, `page_index`, `text_preview`.	One change in a diff.
`AstDiffSummary`	Diff totals	`added_node_count`, `removed_node_count`, `changed_node_count`.	Aggregate counts.
`GetAstDiffResponse`	Response	`original_page_count`, `modified_page_count`, `summary: AstDiffSummary`, `diff: list[AstDiffEntry]`, `pages_processed`.	Returned by `get_ast_diff()`.

Read the citation anchor from an extracted text block:

from nextpdf import CitedTextBlock


def describe(block: CitedTextBlock) -> str:
    """Render a text block with its page index and confidence."""
    anchor = block.citation
    return f"[page {anchor.page_index}, confidence {anchor.confidence:.2f}] {block.text[:80]}"

CLI commands

The nextpdf command runs extraction from the terminal. Pass --base-url and --api-key, or set NEXTPDF_BASE_URL and NEXTPDF_API_KEY in the environment. Every command except version requires connection settings. A PDF_PATH value of - reads PDF bytes from standard input.

Symbol	Parameters	Default behavior	Returns	Throws or fails with	Notes
`nextpdf extract text`	`PDF_PATH`; `--format {json,markdown,plain}`, `--page`, `--headings-only`.	Emits cited text blocks as JavaScript Object Notation (JSON).	Writes to standard output or an `--output` file.	Exit code 1 for any `NextPDFError`.	`--page` selects one 0-based page index.
`nextpdf extract tables`	`PDF_PATH`; `--format {json,csv}`, `--page-start`, `--page-end`.	Emits tables as JSON.	Writes to standard output or an `--output` file.	Exit code 1 for any `NextPDFError`.	`--format csv` writes one comma-separated values (CSV) block per table.
`nextpdf ast`	`PDF_PATH`; `--page-start`, `--page-end`, `--token-budget`.	Emits the full Semantic AST as JSON.	Writes to standard output or an `--output` file.	Exit code 1 for any `NextPDFError`.	Use `--token-budget` to bound the response size.
`nextpdf info`	`PDF_PATH`.	Emits document metadata: page count, schema version, source hash, estimated tokens, and root summary.	Writes JSON to standard output or an `--output` file.	Exit code 1 for any `NextPDFError`.	Lightweight inspection command.
`nextpdf version`	None.	Prints the installed SDK version.	Writes to standard output.	None expected.	Does not contact a server; needs no credentials.
`python -m nextpdf.mcp`	none (reads `NEXTPDF_BASE_URL`, `NEXTPDF_API_KEY`).	Runs the Model Context Protocol server over standard input/output.	Long-running server process.	`RuntimeError` when the environment variables are unset.	Requires the `nextpdf[mcp]` extra.

Extract tables as comma-separated values (CSV) from a page range:

nextpdf extract tables invoice.pdf --format csv --page-start 0 --page-end 2 --output tables.csv

Exceptions

The exception hierarchy lives in nextpdf.models.errors and is re-exported from nextpdf. Catch the most specific class your code can handle, then fall back to the base class. The server reports failure with Hypertext Transfer Protocol (HTTP) status semantics aligned with Request for Comments (RFC) 9110. Each exception carries the originating status_code and, when available, an error_code.

Symbol	Base class	`status_code`	When it is raised	Notes
`NextPDFError`	`Exception`	optional	Base class for every SDK error.	Carries an optional `status_code`. Catch it last as a fallback.
`NextPDFAPIError`	`NextPDFError`	required	The Connect endpoint returned an HTTP error.	Adds `error_code`.
`NextPDFLicenseError`	`NextPDFAPIError`	`402`	The server requires a higher-tier license for the feature.	`error_code` is `license/tier-required`.
`QuotaExceededError`	`NextPDFAPIError`	`429`	A rate limit or quota was exceeded.	Carries `retry_after`; honor it before retrying.
`AstNoStructTreeError`	`NextPDFAPIError`	`422`	The PDF is untagged, and heuristic fallback is not enabled.	Enable heuristic mode or supply a tagged PDF.
`AstBuildTimeoutError`	`NextPDFAPIError`	`504`	The AST build timed out on the server.	Reduce the page range and retry.

from nextpdf import (
    NextPDF,
    AstBuildTimeoutError,
    NextPDFAPIError,
    NextPDFError,
    QuotaExceededError,
)


def extract_text(client: NextPDF, pdf_bytes: bytes) -> int:
    """Extract cited text, handling the most specific failures first."""
    try:
        blocks = client.ast.extract_cited_text(pdf_bytes)
    except QuotaExceededError as error:
        raise RuntimeError(f"Quota exceeded (retry after {error.retry_after}s)") from error
    except AstBuildTimeoutError as error:
        raise RuntimeError("AST build timed out; reduce the page range") from error
    except NextPDFAPIError as error:
        raise RuntimeError(f"API error {error.status_code}: {error}") from error
    except NextPDFError as error:
        raise RuntimeError(f"SDK error: {error}") from error
    return len(blocks)

Development notes

The synchronous NextPDF client delegates every call to AsyncNextPDF. You can call it from a notebook or a thread that already runs an event loop, because it dispatches the coroutine to a worker thread when it detects a running loop.
Prefer the async context-manager form async with AsyncNextPDF(...) as client: so the connection pool closes deterministically. When you construct AsyncNextPDF directly, call close() yourself.
The bearer token is never logged or included in error messages, and Transport Layer Security (TLS) verification is enabled by default. Do not embed credentials in source; read them from the environment or a secret manager.
All models are Pydantic v2 classes; several response models are frozen (immutable). Treat extracted blocks as read-only values.
The CLI exits with status code 1 for any NextPDFError and prints the message to standard error. Wire that exit code into pipelines.