Python SDK developer guide
At a glance
Section titled “At a glance”The NextPDF Python Software Development Kit (SDK) is a thin, typed client for a NextPDF Connect endpoint. Your application owns Portable Document Format (PDF) input validation, credential handling, and concurrency policy. The SDK owns request construction, transport, and response typing. Keep that boundary clear: read the PDF safely, choose a client, call the ast method you need, and handle the specific failure.
Use this guide when you build extraction services, asyncio batch jobs, artificial intelligence (AI) agent tools, or command-line workflows around the SDK. It assumes you have read the overview and quickstart, and that you have Python 3.10 or newer and a NextPDF Connect endpoint.
Architecture boundary
Section titled “Architecture boundary”| Layer | Owned by | Responsibility | Do not put here |
|---|---|---|---|
| Input source | Application | Authorize the caller, validate the PDF source, and choose the extraction policy. | Endpoint Uniform Resource Locator (URL) or credential literals. |
| Client construction | Application | Read base_url and api_key from the environment or a secret manager. | Hard-coded secrets. |
NextPDF / AsyncNextPDF | SDK | Build the request, call Connect, and return typed Pydantic models. | Domain logic or storage policy. |
ast method namespace | SDK | Map a method call to a Connect endpoint and parse the response. | Retry or backoff policy beyond what you configure. |
| NextPDF Connect endpoint | Deployment | Run extraction and enforce authentication, quotas, and licensing. | Application authorization. |
The SDK never performs optical character recognition (OCR). If a PDF is scanned or image-only, run OCR before extraction. Treat that step as an application concern outside this boundary.
Runtime lifecycle
Section titled “Runtime lifecycle”| Stage | Behavior | Developer action |
|---|---|---|
| Client construction | base_url and api_key are validated; either empty value raises ValueError. | Read both from the environment; never inline them. |
| Backend creation | A remote backend opens a pooled connection to Connect. | Reuse one client across calls instead of constructing per request. |
| Method call | The ast method serializes the request, sends PDF bytes, and parses the response into a Pydantic model. | Pass already-validated bytes. |
| Error mapping | The SDK maps a non-success Hypertext Transfer Protocol (HTTP) status to a specific exception subclass. | Catch the most specific class first. |
| Shutdown | AsyncNextPDF.close() releases the connection pool; the async context manager calls it for you. | Use async with or call close() in a finally block. |
Recommended application structure
Section titled “Recommended application structure”| Path | Purpose |
|---|---|
app/pdf/clients.py | Build and cache a configured NextPDF or AsyncNextPDF. |
app/pdf/extraction.py | Application wrapper around the ast method calls. |
app/pdf/validation.py | PDF source validation, size limits, and content checks. |
tests/pdf/ | Extraction, failure-mode, and async-batching tests. |
Keep PDF validation separate from extraction. Pass only authorized, size-checked bytes into the extraction layer, and still rely on the endpoint for defense in depth.
import os
from nextpdf import NextPDF
def build_client() -> NextPDF: """Construct a synchronous client from environment configuration.
Raises: KeyError: When a required environment variable is missing. """ base_url = os.environ["NEXTPDF_BASE_URL"] api_key = os.environ["NEXTPDF_API_KEY"] return NextPDF(base_url=base_url, api_key=api_key)Synchronous client pattern
Section titled “Synchronous client pattern”Use the synchronous NextPDF client for scripts, batch jobs, and notebooks. Validate input before you call the SDK, and handle the specific failures the call can raise.
from pathlib import Path
from nextpdf import ( NextPDF, CitedTextBlock, NextPDFAPIError, NextPDFError, QuotaExceededError,)
MAX_PDF_BYTES = 100 * 1024 * 1024 # Reject documents above 100 MiB for the in-memory path.
def read_pdf(path: Path) -> bytes: """Read and validate a PDF from disk.
Raises: ValueError: When the file is missing, empty, oversized, or not a PDF. """ if not path.is_file(): raise ValueError(f"Not a file: {path}") data = path.read_bytes() if not data: raise ValueError("PDF is empty") if len(data) > MAX_PDF_BYTES: raise ValueError("PDF exceeds the configured size limit; use the CLI streaming path") if not data.startswith(b"%PDF-"): raise ValueError("File does not look like a PDF") return data
def extract_text(client: NextPDF, path: Path) -> list[CitedTextBlock]: """Extract cited text blocks, handling the most specific failures first.""" pdf_bytes = read_pdf(path) try: return client.ast.extract_cited_text(pdf_bytes) except QuotaExceededError as error: raise RuntimeError(f"Quota exceeded; retry after {error.retry_after}s") from error except NextPDFAPIError as error: raise RuntimeError(f"API error {error.status_code}: {error}") from error except NextPDFError as error: raise RuntimeError(f"SDK error: {error}") from errorOne result item has this shape:
block = blocks[0]print(block.text) # the extracted textprint(block.citation.page_index) # 0-based page indexprint(block.citation.confidence) # 0.0 - 1.0Async and batching pattern
Section titled “Async and batching pattern”Use the asynchronous AsyncNextPDF client inside asyncio runtimes such as FastAPI. Construct one client as an async context manager and share it across concurrent calls; do not open a client per document. Limit concurrency with a semaphore so you respect the endpoint’s quota.
import asyncioimport os
from nextpdf import ( AsyncNextPDF, ExtractCitedTablesResponse, NextPDFError, QuotaExceededError,)
async def extract_tables_batch( pdfs: list[bytes], *, max_concurrency: int = 4,) -> list[ExtractCitedTablesResponse | None]: """Extract tables from many PDFs concurrently with one shared client.
Returns one response per input PDF, or None where extraction failed. """ base_url = os.environ["NEXTPDF_BASE_URL"] api_key = os.environ["NEXTPDF_API_KEY"] semaphore = asyncio.Semaphore(max_concurrency)
async with AsyncNextPDF(base_url=base_url, api_key=api_key) as client:
async def one(pdf_bytes: bytes) -> ExtractCitedTablesResponse | None: async with semaphore: try: return await client.ast.extract_cited_tables(pdf_bytes) except QuotaExceededError as error: # Surface the backpressure signal; do not silently drop it. raise RuntimeError(f"Quota exceeded; retry after {error.retry_after}s") from error except NextPDFError: return None
return await asyncio.gather(*(one(pdf) for pdf in pdfs))Never write an empty except. Act on the failure, convert it to a defined result, or re-raise it.
Extension points
Section titled “Extension points”| Extension point | Use it for | Constraint |
|---|---|---|
AsyncNextPDF(backend=...) | Inject a custom or local backend in tests. | The backend must satisfy the PdfBackend protocol. |
api_version argument | Pin a Connect application programming interface (API) version. | Defaults to v1; change only when the endpoint supports the target version. |
| Environment configuration | Supply NEXTPDF_BASE_URL and NEXTPDF_API_KEY to the command-line interface (CLI) and Model Context Protocol (MCP) server. | Treat the key as a secret scoped to the workload. |
MCP server (python -m nextpdf.mcp) | Expose extraction tools to MCP-capable agents. | Requires the nextpdf[mcp] extra and a controlled endpoint. |
Development workflow
Section titled “Development workflow”- Install the SDK with
pip install nextpdf, or usepip install nextpdf[mcp]for the agent server. - Read
NEXTPDF_BASE_URLandNEXTPDF_API_KEYfrom the environment so no secret enters source control. - Validate every PDF source for existence, size, and the
%PDF-magic bytes before calling the SDK. - Build one client per process and reuse it; for asyncio, hold it open with
async with. - Call the narrowest
astmethod for the task:extract_cited_text()for prose,extract_cited_tables()for tables,get_document_ast()only when you need the full tree. - Catch the most specific exception you can act on, then fall back to
NextPDFError. - For documents over 100 MiB, use the CLI streaming path instead of materializing every block in memory.
- Run mypy in strict mode and add a failure-mode test for each exception you handle.
Failure handling
Section titled “Failure handling”| Failure | Exception | Recommended response |
|---|---|---|
| Untagged PDF, heuristics off | AstNoStructTreeError (HTTP 422) | Turn on heuristic mode on the endpoint or supply a tagged PDF. |
| Server-side build timeout | AstBuildTimeoutError (HTTP 504) | Reduce the page range and retry. |
| License tier required | NextPDFLicenseError (HTTP 402) | Upgrade the server license or fall back to a permitted feature. |
| Rate limit or quota | QuotaExceededError (HTTP 429) | Wait for retry_after seconds, then retry with backoff. |
| Other HTTP error | NextPDFAPIError | Inspect status_code and error_code; log and surface a defined error. |
| Any SDK error | NextPDFError | Final fallback; never let it escape as an unhandled exception. |
The endpoint reports failures with HTTP status semantics aligned with Request for Comments (RFC) 9110 and machine-readable error bodies aligned with RFC 9457. Each exception preserves the originating status_code. Map those failures to your own error responses rather than leaking transport detail to callers.
Safe defaults
Section titled “Safe defaults”| Concern | Default | When to override |
|---|---|---|
| API version | v1. | Change only when the endpoint supports a newer version. |
| Transport Layer Security (TLS) verification | Enabled; no insecure switch is exposed. | Never disable for production traffic. |
| Credentials | Read from the environment; never inlined. | Use a secret manager in production. |
| In-memory size limit | Reject PDFs over 100 MiB on the client path. | Lower for multi-tenant services; use the CLI for larger files. |
| Concurrency | Bounded by a semaphore in async batches. | Tune to the endpoint’s quota, not to the host’s core count. |
| Logging | Log filename, size, status, and duration. | Never log PDF bytes or the API key. |
Testing checklist
Section titled “Testing checklist”- Construction tests assert that an empty
base_urlorapi_keyraisesValueError. - Validation tests cover missing, empty, oversized, and non-PDF inputs.
- Extraction tests assert the returned model types and a
CitationAnchoron each block. - Failure-mode tests cover
AstNoStructTreeError,AstBuildTimeoutError,NextPDFLicenseError,QuotaExceededError, andNextPDFAPIError. - Async tests assert the client runs as an
async withcontext manager and that concurrency stays within the semaphore bound. - Lifecycle tests assert that
close()releases the transport and is idempotent. - Inject a fake backend with
AsyncNextPDF(backend=...)so tests run without a live endpoint.
See also
Section titled “See also”- Python API reference — every client method, model, and exception.
- Python CLI — streaming extraction for large documents.
- Python MCP server — extraction tools for AI agents.
- Python SDK overview — backend choices and limitations.