Python CLI
Python CLI
Section titled “Python CLI”Use the nextpdf command to extract content from Portable Document Format (PDF) files in the terminal. Point it at a NextPDF Connect endpoint, pass a PDF, and receive structured output — cited text, tables, the full semantic Abstract Syntax Tree (AST), or a metadata summary — on standard output (stdout) or in a file.
Command structure
Section titled “Command structure”The nextpdf command is a Click command group. Connection and session options — --base-url, --api-key, --log-level, --output/-o, and --strict — belong to the group, so place them before the subcommand. Put the subcommand and its options, such as --format or --page, after:
nextpdf [GROUP OPTIONS] COMMAND [SUBCOMMAND] PDF_PATH [COMMAND OPTIONS]If you put a group option after the subcommand, the command fails. For example, nextpdf info document.pdf --base-url ... reports Error: No such option: --base-url and exits with status 2, because Click is already parsing the info subcommand when it sees --base-url, and info does not define that option.
To avoid the ordering trap, supply credentials through environment variables (see Configure once per shell). The examples below show the explicit-flag form first so the correct order is clear.
Quick reference
Section titled “Quick reference”Extract text as JavaScript Object Notation (JSON):
nextpdf --base-url http://localhost:8080 --api-key "$NEXTPDF_API_KEY" extract text document.pdfExtract tables as comma-separated values (CSV):
nextpdf --base-url http://localhost:8080 --api-key "$NEXTPDF_API_KEY" extract tables invoice.pdf --format csvInspect document metadata and structure:
nextpdf --base-url http://localhost:8080 --api-key "$NEXTPDF_API_KEY" info document.pdfGet the full semantic AST:
nextpdf --base-url http://localhost:8080 --api-key "$NEXTPDF_API_KEY" ast document.pdfPrint the installed software development kit (SDK) version without contacting a server:
nextpdf versionThe version command is the one command that needs neither --base-url nor --api-key. Every other command contacts the server and requires both.
Each example reads the application programming interface (API) key from the NEXTPDF_API_KEY environment variable instead of embedding it on the command line. Treat the key as a secret. A literal key on the command line is visible in your shell history and in the process list (ps) to other users on the host.
Commands and options
Section titled “Commands and options”Group options
Section titled “Group options”Place these before the subcommand. Each connection option also reads from an environment variable, so you can omit the flag when the variable is set.
| Option | Environment variable | Default | Purpose |
|---|---|---|---|
--base-url | NEXTPDF_BASE_URL | (required) | NextPDF Connect server uniform resource locator (URL). |
--api-key | NEXTPDF_API_KEY | (required) | API key for bearer authentication. |
--log-level | — | warning | Logging verbosity: debug, info, warning, or error. Logs go to standard error (stderr). |
--output, -o | — | stdout | Write command output to a file instead of stdout. |
--strict | — | off | Reserved for future use. The flag parses today but does not change behavior. |
--help, -h | — | — | Show help and exit. |
The --base-url and --api-key values are required for every command except version. If either value is missing — no flag and no environment variable — the command prints an error and exits with status 1.
nextpdf extract text
Section titled “nextpdf extract text”Extract cited text blocks. Each block includes a citation anchor with a node identifier, page index, bounding box, and confidence score.
nextpdf [GROUP OPTIONS] extract text PDF_PATH [--format FORMAT] [--page N] [--headings-only]| Option | Values | Default | Purpose |
|---|---|---|---|
--format | json, markdown, plain | json | Output format. |
--page | integer | all pages | Extract only this 0-based page index. |
--headings-only | flag | off | Extract only heading nodes. |
PDF_PATH is a file path, or - to read PDF bytes from standard input (stdin).
nextpdf extract tables
Section titled “nextpdf extract tables”Extract every table with citation anchors and cell-level structure.
nextpdf [GROUP OPTIONS] extract tables PDF_PATH [--format FORMAT] [--page-start N] [--page-end N]| Option | Values | Default | Purpose |
|---|---|---|---|
--format | json, csv | json | Output format. |
--page-start | integer | first page | Start page index (0-based). |
--page-end | integer | last page | End page index (0-based). |
PDF_PATH is a file path, or - to read from stdin.
nextpdf ast
Section titled “nextpdf ast”Return the full semantic AST as JSON: a hierarchical tree of nodes, including headings, paragraphs, tables, lists, and figures, with bounding boxes and text content.
nextpdf [GROUP OPTIONS] ast PDF_PATH [--page-start N] [--page-end N] [--token-budget N]| Option | Values | Default | Purpose |
|---|---|---|---|
--page-start | integer | first page | Start page index (0-based). |
--page-end | integer | last page | End page index (0-based). |
--token-budget | integer | unbounded | Approximate token limit for the returned AST. |
PDF_PATH is a file path, or - to read from stdin. The ast command produces one document tree; it does not compare two PDFs. For structural diffing, see Recipe: diff two PDF revisions.
nextpdf info
Section titled “nextpdf info”Print a compact JSON summary of one document: schema version, source hash, page count, estimated token count, root node type, and number of root children.
nextpdf [GROUP OPTIONS] info PDF_PATHPDF_PATH is a file path, or - to read from stdin.
nextpdf version
Section titled “nextpdf version”Print the installed SDK version, such as nextpdf 1.1.0, and exit. This command contacts no server and needs no credentials.
nextpdf versionConfigure once per shell
Section titled “Configure once per shell”Set the connection options once as environment variables, and omit the repeated flags. This form also avoids the option-ordering trap entirely, because the credentials never appear on the command line.
export NEXTPDF_BASE_URL=http://localhost:8080export NEXTPDF_API_KEY=your-keynextpdf extract text document.pdfOn Windows PowerShell:
$env:NEXTPDF_BASE_URL = "http://localhost:8080"$env:NEXTPDF_API_KEY = "your-key"nextpdf extract text document.pdfPrefer loading the key from a secret store or a .env file that you keep out of version control. Do not paste a production key into a shared terminal session or into a script that you commit.
Output formats
Section titled “Output formats”Select the output format per command with --format. The text and table commands support more than one format; ast and info always emit JSON.
| Command | Formats | Default |
|---|---|---|
extract text | json, markdown, plain | json |
extract tables | json, csv | json |
ast | json | json |
info | json | json |
Choose JSON when a downstream program needs page indexes, confidence scores, or node identifiers. Choose CSV when a spreadsheet or tabular pipeline consumes the tables. Choose plain or markdown text when a person or text-only tool reads the result.
Parsing JSON output
Section titled “Parsing JSON output”The text command emits a JSON array of cited blocks. Each block has text, a citation object (node_id, page_index, bbox, confidence), and an optional node_type. Send the output to a file with --output, or redirect stdout, then parse it.
This shell example uses jq to keep only headings on page 0:
nextpdf --output blocks.json extract text report.pdf --format jsonjq '[.[] | select(.citation.page_index == 0 and .node_type == "heading") | .text]' blocks.jsonThe same data parses cleanly in Python. The command-line interface (CLI) writes a JSON array, so load it with the standard library and read the typed fields:
"""Parse cited text blocks emitted by `nextpdf extract text --format json`."""
import jsonfrom pathlib import Path
def load_headings(blocks_path: Path) -> list[str]: """Return the text of every heading block on page 0.
Args: blocks_path: Path to the JSON file written by `nextpdf extract text`.
Returns: The text of each heading-type block whose citation is on page 0. """ raw = blocks_path.read_text(encoding="utf-8") blocks: list[dict[str, object]] = json.loads(raw) headings: list[str] = [] for block in blocks: citation = block["citation"] if block.get("node_type") == "heading" and citation["page_index"] == 0: headings.append(str(block["text"])) return headings
if __name__ == "__main__": for heading in load_headings(Path("blocks.json")): print(heading)When you need validated, typed models instead of raw dictionaries, call the SDK directly instead of parsing CLI output. See the Python overview for the NextPDF client and its CitedTextBlock return type.
Parsing CSV output
Section titled “Parsing CSV output”With --format csv, the table command writes one CSV block per table. A comment row, # Table N (pM), precedes each table and names its 1-based table number and 0-based page index. A blank line separates consecutive tables. The CLI quotes and escapes cell values with Python’s csv module, so values that contain commas, quotes, or newlines round-trip safely.
nextpdf --output tables.csv extract tables statement.pdf --format csvBecause the file contains multiple CSV blocks, split on the comment rows before you parse each block as a standalone table:
"""Split multi-table CSV output from `nextpdf extract tables --format csv`."""
import csvimport iofrom pathlib import Path
def read_tables(csv_path: Path) -> list[list[list[str]]]: """Parse the multi-block CSV file into a list of tables.
Each table is a list of rows; each row is a list of cell strings. The leading `# Table N (pM)` comment row is dropped from every table.
Args: csv_path: Path to the file written by `nextpdf extract tables`.
Returns: One parsed table per `# Table` block in the file. """ text = csv_path.read_text(encoding="utf-8") tables: list[list[list[str]]] = [] current: list[str] = [] for line in text.splitlines(keepends=True): if line.startswith("# Table ") and current: tables.append(_parse_block(current)) current = [] current.append(line) if current: tables.append(_parse_block(current)) return tables
def _parse_block(lines: list[str]) -> list[list[str]]: """Parse one CSV block, discarding its leading comment row.""" reader = csv.reader(io.StringIO("".join(lines))) rows = [row for row in reader if row] return rows[1:] if rows and rows[0] and rows[0][0].startswith("# Table ") else rows
if __name__ == "__main__": for index, table in enumerate(read_tables(Path("tables.csv")), start=1): print(f"table {index}: {len(table)} rows")Exit codes and error detection
Section titled “Exit codes and error detection”The CLI uses three exit codes. Check $? in shell scripts, or $LASTEXITCODE in PowerShell, to branch on success or failure. Read diagnostic messages from stderr, which stays separate from the data on stdout.
| Exit code | Meaning | Examples |
|---|---|---|
0 | Success. | A command completed; version printed. |
1 | Runtime error. The CLI prints Error: <message> to stderr. | Input file not found or not a regular file, empty stdin, missing or invalid --base-url/--api-key, any server-side error (license required, quota exceeded, untagged PDF, build timeout, or other API failure). |
2 | Usage error, reported by Click. | Unknown command or option (including a group option placed after the subcommand), a missing required argument such as PDF_PATH. |
Every server-side failure returns exit code 1 with a human-readable message on stderr. The SDK raises a typed exception — NextPDFLicenseError (Hypertext Transfer Protocol (HTTP) 402), QuotaExceededError (HTTP 429), AstNoStructTreeError (HTTP 422, untagged PDF), AstBuildTimeoutError (HTTP 504), or the base NextPDFAPIError. The CLI catches all of them under their shared NextPDFError base, prints the message, and exits 1. The CLI does not expose distinct exit codes per failure type. To distinguish, for example, a quota error from a license error in a script, inspect the message text on stderr or call the SDK directly (see the Python overview for the exception classes).
Use this scripting pattern to separate data from diagnostics:
#!/usr/bin/env bashset -euo pipefail
# Credentials come from the environment, not the command line.: "${NEXTPDF_BASE_URL:?set NEXTPDF_BASE_URL}": "${NEXTPDF_API_KEY:?set NEXTPDF_API_KEY}"
if nextpdf --output contract.ast.json ast contract.pdf; then echo "AST written to contract.ast.json"else status=$? echo "nextpdf failed with exit code ${status}" >&2 exit "${status}"fiWith --output, the CLI writes data to the named file and prints only the confirmation line Written to <path> to stderr, so stdout stays empty. Without --output, the data goes to stdout, and you can redirect it.
Recipes
Section titled “Recipes”Every recipe below uses only verified commands and flags. In each case, credentials come from the environment.
Recipe: extract invoice tables to CSV
Section titled “Recipe: extract invoice tables to CSV”Turn a folder of invoices into one CSV file per document for a spreadsheet or accounting pipeline:
#!/usr/bin/env bashset -euo pipefail
: "${NEXTPDF_BASE_URL:?set NEXTPDF_BASE_URL}": "${NEXTPDF_API_KEY:?set NEXTPDF_API_KEY}"
mkdir -p outfor pdf in invoices/*.pdf; do name="$(basename "${pdf}" .pdf)" nextpdf --output "out/${name}.csv" extract tables "${pdf}" --format csvdoneEach out/<name>.csv contains one CSV block per detected table, with a # Table N (pM) header before each block. Parse the blocks with the CSV reader shown above.
Recipe: build a document outline
Section titled “Recipe: build a document outline”Combine --headings-only with the markdown format to produce a quick outline you can read or paste into notes:
nextpdf --output outline.md extract text whitepaper.pdf --headings-only --format markdownRecipe: diff two PDF revisions
Section titled “Recipe: diff two PDF revisions”The CLI ast command returns the tree for one document; it has no diff subcommand. Structural diffing lives in the SDK as client.ast.get_ast_diff(...) and in the Model Context Protocol (MCP) server as the nextpdf_diff tool. Run the diff through the SDK:
"""Compare two PDF revisions structurally with the NextPDF SDK.
The API key is read from the environment, never hard-coded."""
import osfrom pathlib import Path
from nextpdf import NextPDF
def diff_revisions(original: Path, modified: Path) -> None: """Print a structural diff summary between two PDF revisions.
Args: original: Path to the earlier PDF revision. modified: Path to the later PDF revision. """ base_url = os.environ["NEXTPDF_BASE_URL"] api_key = os.environ["NEXTPDF_API_KEY"]
client = NextPDF(base_url=base_url, api_key=api_key) result = client.ast.get_ast_diff( original.read_bytes(), modified.read_bytes(), )
summary = result.summary print(f"added: {summary.added_node_count}") print(f"removed: {summary.removed_node_count}") print(f"changed: {summary.changed_node_count}") for entry in result.diff: preview = entry.text_preview or "" print(f" {entry.type:<8} {entry.node_type:<12} p{entry.page_index} {preview}")
if __name__ == "__main__": diff_revisions(Path("contract-v1.pdf"), Path("contract-v2.pdf"))To run the same diff from an artificial intelligence (AI) agent instead of a script, register the MCP server and call the nextpdf_diff tool. See the Python MCP server page.
Recipe: stream a PDF in from another tool
Section titled “Recipe: stream a PDF in from another tool”Read PDF bytes from stdin with - to chain nextpdf after a tool that emits a PDF on its own stdout:
curl --silent https://example.com/report.pdf | nextpdf info -The - argument tells the command to read the document from stdin. If no bytes arrive, the command reports an error and exits 1.
Performance notes
Section titled “Performance notes”The CLI builds each response in memory and writes it once. Redirecting or piping output is straightforward, but output is not produced incrementally. A large AST or table set is fully buffered before the first byte reaches stdout or the --output file. Plan memory and latency for whole-document responses, not for a stream.
Each nextpdf invocation creates a fresh client and HTTP connection, so a loop over many files opens and closes a connection per file. The connection cost is usually small next to server-side extraction time, but it becomes real overhead at scale.
- Reuse one endpoint. Point every invocation at the same NextPDF Connect deployment so the server can reuse warmed caches and connection pools. Avoid spreading a batch across endpoints unless you are load-balancing on purpose.
- Bound the work per call. Use
--page,--page-start/--page-end, or--token-budgetto process only the pages you need. Smaller page ranges reduce both server time and response size;--token-budgetcaps the AST your agent has to read. - Batch in one process for large jobs. For high-volume batches, prefer the Python SDK over repeated CLI calls. A single long-lived
NextPDForAsyncNextPDFclient reuses one pooled HTTP connection across every document, which removes the per-process startup and connection setup that a CLI loop pays each time. The Python overview shows the client lifecycle, andAsyncNextPDFsupports concurrent extraction across many PDFs. - Keep logs out of the data path. Leave
--log-levelat its default for batch runs. Diagnostic logs go to stderr and do not corrupt stdout data, but raising the level todebugadds noise and minor overhead.