Python API 參考資料

概覽

NextPDF Python 軟體開發套件（SDK）對外提供兩個用戶端、兩者共用的單一 ast 方法命名空間、每個回應所使用的一組定型 Pydantic 模型、一個 nextpdf 命令列介面（CLI），以及一個六類別的例外階層。本頁提供上述每個符號的參考說明。

從頂層套件匯入公開介面：

from nextpdf import (
    AsyncNextPDF,
    NextPDF,
    AstBuildTimeoutError,
    AstNoStructTreeError,
    NextPDFAPIError,
    NextPDFError,
    NextPDFLicenseError,
    QuotaExceededError,
)

每個擷取方法都以原始 PDF 位元組（bytes）作為第一個位置引數，並回傳定型的 Pydantic 模型。其後接著僅限關鍵字的選項。同步的 NextPDF.ast.* 方法與非同步的 AsyncNextPDF.ast.* 方法具有完全相同的簽章。非同步方法是需要以 await 等待的協程。

用戶端

同步的 NextPDF 用戶端會包裝非同步用戶端，並將每個協程執行至完成。非同步的 AsyncNextPDF 用戶端同時也是非同步情境管理器。建議採用情境管理器形式，讓底層傳輸層能以可確定的方式關閉。

建構式

符號	參數	預設行為	回傳	拋出或失敗於	備註
`NextPDF(*, base_url, api_key, api_version='v1')`	僅限關鍵字的基底 URL、API 金鑰，以及選用的 API 版本。	建構以遠端為後端的同步用戶端。	`NextPDF`	`ValueError`，於 `base_url` 或 `api_key` 為空時拋出。	這些方法會以同步方式執行非同步工作，在 notebook 與正在執行的事件迴圈中皆可安全使用。
`AsyncNextPDF(*, base_url='', api_key='', api_version='v1', backend=None)`	僅限關鍵字的基底 URL、API 金鑰、選用的 API 版本，以及選用的注入後端。	未注入後端時，建構以遠端為後端的非同步用戶端。	`AsyncNextPDF`	`ValueError`，於 `base_url` 或 `api_key` 為空且未提供 `backend` 時拋出。	傳入 `backend=` 可在測試中注入自訂或本機後端。
`AsyncNextPDF.__aenter__()`	無。	進入非同步情境並回傳用戶端。	`AsyncNextPDF`	無預期項目。	使用 `async with AsyncNextPDF(...) as client:`。
`AsyncNextPDF.__aexit__(*_)`	被抑制的例外引數。	在情境結束時呼叫 `close()`。	`None`	無預期項目。	即使主體拋出例外，也會釋放傳輸層資源。
`AsyncNextPDF.close()`	無。	關閉自有的遠端後端並釋放連線池。	`None`	無預期項目。	具冪等性；注入的後端不會被關閉。

請勿將 API 金鑰留在原始碼中。請從環境變數（NEXTPDF_BASE_URL、NEXTPDF_API_KEY）或祕密管理器讀取 base_url 與 api_key。

import os

from nextpdf import AsyncNextPDF


async def extract(pdf_bytes: bytes) -> int:
    """Return the page count of a PDF using the async client as a context manager."""
    base_url = os.environ["NEXTPDF_BASE_URL"]
    api_key = os.environ["NEXTPDF_API_KEY"]

    async with AsyncNextPDF(base_url=base_url, api_key=api_key) as client:
        document = await client.ast.get_document_ast(pdf_bytes)
        return document.page_count

同步 AST 方法 — `NextPDF.ast.*`

符號	參數	預設行為	回傳	拋出或失敗於	備註
`NextPDF.ast.get_document_ast()`	`pdf_data: bytes`；關鍵字 `page_range_start`、`page_range_end`、`token_budget`。	為每一頁建構完整的語意 AST。	`AstDocument`	`AstNoStructTreeError`、`AstBuildTimeoutError`、`NextPDFLicenseError`、`QuotaExceededError`。	建構逾時時，請縮小頁面範圍。
`NextPDF.ast.extract_cited_text()`	`pdf_data: bytes`；關鍵字 `page_index`、`headings_only`。	擷取每個帶有引用錨點的文字區塊。	`list[CitedTextBlock]`	`NextPDFAPIError`、`QuotaExceededError`。	設定 `headings_only=True` 即可只取得標題節點。
`NextPDF.ast.extract_cited_tables()`	`pdf_data: bytes`；關鍵字 `page_range`（為 `dict`，含 `start` 與 `end`）。	擷取每個帶有儲存格層級引用錨點的表格。	`ExtractCitedTablesResponse`	`NextPDFAPIError`、`QuotaExceededError`。	省略 `page_range` 即可掃描整份文件。
`NextPDF.ast.get_ast_node()`	`pdf_data: bytes`、`node_id: str`。	依識別碼取得單一節點。	`GetAstNodeResponse`	`NextPDFError`，於找不到節點時拋出。	`node_id` 的格式為 `ast:{hash6}:{pageIdx}:{seq}`。
`NextPDF.ast.search_ast_nodes()`	`pdf_data: bytes`；關鍵字 `node_type`、`page_index`、`text_query`、`max_results=100`。	回傳符合篩選條件的淺層節點。	`SearchAstNodesResponse`	`NextPDFAPIError`。	`text_query` 會進行不區分大小寫的子字串比對。
`NextPDF.ast.get_ast_diff()`	`original_pdf_data: bytes`、`modified_pdf_data: bytes`。	在結構層級比較兩份文件。	`GetAstDiffResponse`	`NextPDFAPIError`、`QuotaExceededError`。	回報新增、移除與變更的節點。

非同步 AST 方法 — `AsyncNextPDF.ast.*`

每個非同步方法都是協程，其參數、預設值、回傳型別與失敗模式皆與上方對應的同步方法相同。在 asyncio 執行環境中，請以 await 呼叫。

符號	參數	預設行為	回傳	拋出或失敗於	備註
`AsyncNextPDF.ast.get_document_ast()`	`pdf_data: bytes`；關鍵字 `page_range_start`、`page_range_end`、`token_budget`。	為每一頁建構完整的語意 AST。	`AstDocument`	`AstNoStructTreeError`、`AstBuildTimeoutError`、`NextPDFLicenseError`、`QuotaExceededError`。	`await` 取得結果。
`AsyncNextPDF.ast.extract_cited_text()`	`pdf_data: bytes`；關鍵字 `page_index`、`headings_only`。	擷取每個帶有引用錨點的文字區塊。	`list[CitedTextBlock]`	`NextPDFAPIError`、`QuotaExceededError`。	`await` 取得結果。
`AsyncNextPDF.ast.extract_cited_tables()`	`pdf_data: bytes`；關鍵字 `page_range`。	擷取每個帶有儲存格層級引用錨點的表格。	`ExtractCitedTablesResponse`	`NextPDFAPIError`、`QuotaExceededError`。	`await` 取得結果。
`AsyncNextPDF.ast.get_ast_node()`	`pdf_data: bytes`、`node_id: str`。	依識別碼取得單一節點。	`GetAstNodeResponse`	`NextPDFError`，於找不到節點時拋出。	`await` 取得結果。
`AsyncNextPDF.ast.search_ast_nodes()`	`pdf_data: bytes`；關鍵字 `node_type`、`page_index`、`text_query`、`max_results=100`。	回傳符合篩選條件的淺層節點。	`SearchAstNodesResponse`	`NextPDFAPIError`。	`await` 取得結果。
`AsyncNextPDF.ast.get_ast_diff()`	`original_pdf_data: bytes`、`modified_pdf_data: bytes`。	在結構層級比較兩份文件。	`GetAstDiffResponse`	`NextPDFAPIError`、`QuotaExceededError`。	`await` 取得結果。

以情境管理器形式使用非同步用戶端，並行批次執行兩項擷取：

import asyncio

from nextpdf import AsyncNextPDF


async def extract_pair(first: bytes, second: bytes) -> None:
    """Extract two PDFs concurrently with one shared async client."""
    async with AsyncNextPDF(base_url="https://connect.example.com", api_key="set-from-secret") as client:
        text_blocks, tables = await asyncio.gather(
            client.ast.extract_cited_text(first),
            client.ast.extract_cited_tables(second),
        )
        print(f"text blocks: {len(text_blocks)}; tables: {tables.table_count}")

模型

每個回應都是 Pydantic 模型。你可以從 nextpdf 或 nextpdf.models.ast 匯入這些模型類別。

符號	種類	主要欄位	備註
`AstDocument`	文件根節點	`schema_version`、`source_hash`、`page_count`、`root: AstNode`、`estimated_tokens`（屬性）。	由 `get_document_ast()` 回傳。接受 `schemaVersion`、`sourceHash`、`pageCount` 別名。
`AstNode`	樹狀節點	`id`、`type: NodeType`、`page_index`、`bbox`、`text_content`、`attributes`、`children: list[AstNode]`、`pdf_object_number`、`mcid`。	承載文件樹的遞迴節點。
`AstNodeMeta`	回應中繼資料	`etag`、`pages_processed`。	已凍結；附加於節點與搜尋回應上。
`AstNodeShallow`	搜尋命中	`id`、`type: NodeType`、`page_index`、`bbox`、`text_content`、`attributes`、`children_count`。	已凍結；不含深層子節點。
`BoundingBox`	值物件	`x`、`y`、`width`、`height`（各為 `0.0`–`1.0`）。	正規化的頁面座標。
`CitationAnchor`	值物件	`node_id`、`page_index`、`bbox: BoundingBox`、`confidence`、`content_hash`。	每個區塊上的 provenance（來源資訊）記錄。
`CitedTextBlock`	文字區塊	`text`、`citation: CitationAnchor`、`node_type`、`chunk_index`、`depth`。	位於 `extract_cited_text()` 清單中的每個項目。
`CitedTableBlock`	表格區塊	`table_node_id`、`page_index`、`citation_anchor`、`row_count`、`col_count`、`rows`。	已凍結；表示單一表格。
`CitedTableCell`	表格儲存格	`row`、`col`、`row_span`、`col_span`、`text`、`bbox`、`confidence`。	已凍結；表示單一儲存格。
`NodeType`	列舉	`document`、`section`、`heading`、`paragraph`、`list`、`table`、`figure`，以及其他類型。	節點類型值的字串列舉。
`GetAstNodeResponse`	回應	`node: AstNode`、`meta: AstNodeMeta`。	由 `get_ast_node()` 回傳。
`SearchAstNodesResponse`	回應	`nodes: list[AstNodeShallow]`、`total_matches`、`truncated`、`meta`。	由 `search_ast_nodes()` 回傳。
`ExtractCitedTablesResponse`	回應	`tables: list[CitedTableBlock]`、`table_count`、`pages_processed`。	由 `extract_cited_tables()` 回傳。
`AstDiffEntry`	差異項目	`type`（`added`/`removed`/`changed`）、`node_id`、`node_type`、`page_index`、`text_preview`。	差異中的一項變更。
`AstDiffSummary`	差異總計	`added_node_count`、`removed_node_count`、`changed_node_count`。	彙總計數。
`GetAstDiffResponse`	回應	`original_page_count`、`modified_page_count`、`summary: AstDiffSummary`、`diff: list[AstDiffEntry]`、`pages_processed`。	由 `get_ast_diff()` 回傳。

從擷取出的文字區塊讀取引用錨點：

from nextpdf import CitedTextBlock


def describe(block: CitedTextBlock) -> str:
    """Render a text block with its page index and confidence."""
    anchor = block.citation
    return f"[page {anchor.page_index}, confidence {anchor.confidence:.2f}] {block.text[:80]}"

CLI 指令

你可以從終端機執行 nextpdf 指令來擷取內容。請傳入 --base-url 與 --api-key，或在環境變數中設定 NEXTPDF_BASE_URL 與 NEXTPDF_API_KEY。除了 version 以外，每個指令都需要連線設定。當 PDF_PATH 為 - 時，會從標準輸入讀取 PDF 位元組。

符號	參數	預設行為	回傳	拋出或失敗於	備註
`nextpdf extract text`	`PDF_PATH`；`--format {json,markdown,plain}`、`--page`、`--headings-only`。	以 JSON 輸出帶有引用的文字區塊。	寫到標準輸出或 `--output` 指定的檔案。	發生任何 `NextPDFError` 時，以結束代碼 1 結束。	`--page` 選擇單一以 0 為起始的頁面 Index（索引）。
`nextpdf extract tables`	`PDF_PATH`；`--format {json,csv}`、`--page-start`、`--page-end`。	以 JSON 輸出表格。	寫到標準輸出或 `--output` 指定的檔案。	發生任何 `NextPDFError` 時，以結束代碼 1 結束。	`--format csv` 會為每個表格各寫出一個 CSV 區塊。
`nextpdf ast`	`PDF_PATH`；`--page-start`、`--page-end`、`--token-budget`。	以 JSON 輸出完整的語意 AST。	寫到標準輸出或 `--output` 指定的檔案。	發生任何 `NextPDFError` 時，以結束代碼 1 結束。	使用 `--token-budget` 限制回應大小。
`nextpdf info`	`PDF_PATH`。	輸出文件中繼資料：頁數、結構描述版本、來源雜湊、估計權杖數，以及根節點摘要。	將 JSON 寫到標準輸出或 `--output` 指定的檔案。	發生任何 `NextPDFError` 時，以結束代碼 1 結束。	輕量的檢視指令。
`nextpdf version`	無。	印出已安裝的 SDK 版本。	寫到標準輸出。	無預期項目。	不會連線伺服器；不需要任何憑證。
`python -m nextpdf.mcp`	無（讀取 `NEXTPDF_BASE_URL`、`NEXTPDF_API_KEY`）。	在標準 input/output 上執行 Model Context Protocol 伺服器。	長時間執行的伺服器處理程序。	`RuntimeError`，於環境變數未設定時拋出。	需要 `nextpdf[mcp]` 額外套件。

從某個頁面範圍擷取表格並輸出為逗號分隔值（CSV）：

nextpdf extract tables invoice.pdf --format csv --page-start 0 --page-end 2 --output tables.csv

例外

例外階層位於 nextpdf.models.errors，並從 nextpdf 重新匯出。請先攔截你的程式碼能處理的最具體類別，再退回到基礎類別。伺服器以符合 RFC 9110 的 HTTP 狀態語意表示失敗。每個例外都帶有來源 status_code，並在適用時帶有 error_code。

符號	基礎類別	`status_code`	何時觸發	備註
`NextPDFError`	`Exception`	選填	所有 SDK 錯誤的基礎類別。	帶有選用的 `status_code`。作為備援，最後再攔截。
`NextPDFAPIError`	`NextPDFError`	必填	Connect 端點回傳了 HTTP 錯誤。	額外帶有 `error_code`。
`NextPDFLicenseError`	`NextPDFAPIError`	`402`	該功能在伺服器上需要更高層級的授權。	`error_code` 為 `license/tier-required`。
`QuotaExceededError`	`NextPDFAPIError`	`429`	超過速率限制或配額。	帶有 `retry_after`；重試前請遵守該值。
`AstNoStructTreeError`	`NextPDFAPIError`	`422`	該 PDF 未加標籤，且未啟用啟發式備援。	啟用啟發式模式，或提供已加標籤的 PDF。
`AstBuildTimeoutError`	`NextPDFAPIError`	`504`	AST 建構在伺服器上逾時。	縮小頁面範圍後重試。

from nextpdf import (
    NextPDF,
    AstBuildTimeoutError,
    NextPDFAPIError,
    NextPDFError,
    QuotaExceededError,
)


def extract_text(client: NextPDF, pdf_bytes: bytes) -> int:
    """Extract cited text, handling the most specific failures first."""
    try:
        blocks = client.ast.extract_cited_text(pdf_bytes)
    except QuotaExceededError as error:
        raise RuntimeError(f"Quota exceeded (retry after {error.retry_after}s)") from error
    except AstBuildTimeoutError as error:
        raise RuntimeError("AST build timed out; reduce the page range") from error
    except NextPDFAPIError as error:
        raise RuntimeError(f"API error {error.status_code}: {error}") from error
    except NextPDFError as error:
        raise RuntimeError(f"SDK error: {error}") from error
    return len(blocks)

開發備註

同步的 NextPDF 用戶端會將每個呼叫委派給 AsyncNextPDF。由於偵測到正在執行的迴圈時會把協程分派到工作執行緒，因此即使在 notebook 或已執行事件迴圈的執行緒中呼叫也很安全。
建議採用非同步情境管理器形式 async with AsyncNextPDF(...) as client:，讓連線池能以可確定的方式關閉。當你直接建構 AsyncNextPDF 時，請自行呼叫 close()。
持有人權杖（bearer token）絕不會被記錄或放進錯誤訊息，且預設啟用傳輸層安全性（TLS）驗證。請勿將憑證內嵌在原始碼中；請從環境變數或祕密管理器讀取。
所有模型都是 Pydantic v2 類別；有數個回應模型是凍結的（不可變）。請把擷取出的區塊視為唯讀值。
發生任何 NextPDFError 時，CLI 會以結束代碼 1 結束，並將訊息印到標準錯誤。請把該結束代碼接入你的管線。

另請參閱

Python SDK 開發者指南 — 架構、非同步批次處理與失敗處理。
Python CLI — 終端機擷取與大型檔案的串流處理。
Python MCP 伺服器 — 將擷取工具開放給 AI Agent（代理）使用。
RFC 9110（HTTP 語意）與 RFC 9457（HTTP API 的問題詳情）描述 Connect 端點回傳的狀態語意與機器可讀的錯誤主體。權威文字請參閱 IETF RFC 索引。