ข้อมูลอ้างอิง API ของ Python

ภาพรวมโดยย่อ

NextPDF Python Software Development Kit (SDK) มีไคลเอนต์สองแบบ เนมสเปซเมท็อด Abstract Syntax Tree (AST) ที่ใช้ร่วมกันชื่อ ast โมเดล Pydantic สำหรับการตอบกลับทั้งหมด command-line interface (CLI) ชื่อ nextpdf และลำดับชั้นข้อยกเว้นหกคลาส ใช้หน้านี้เป็นข้อมูลอ้างอิงสำหรับสัญลักษณ์ application programming interface (API) สาธารณะที่ใช้ทำงานกับเอกสาร Portable Document Format (PDF)

นำเข้าสัญลักษณ์สาธารณะจากแพ็กเกจระดับบนสุด

from nextpdf import (
    AsyncNextPDF,
    NextPDF,
    AstBuildTimeoutError,
    AstNoStructTreeError,
    NextPDFAPIError,
    NextPDFError,
    NextPDFLicenseError,
    QuotaExceededError,
)

เมท็อดการแยกข้อมูลทั้งหมดรับ PDF bytes ดิบ (bytes) เป็นอาร์กิวเมนต์เชิงตำแหน่งตัวแรก และคืนค่าโมเดล Pydantic ที่มีชนิดข้อมูลกำกับ ส่งตัวเลือกเป็นอาร์กิวเมนต์แบบคีย์เวิร์ดเท่านั้น เมท็อดแบบ synchronous NextPDF.ast.* และเมท็อดแบบ asynchronous AsyncNextPDF.ast.* มี signature เหมือนกัน เมท็อดแบบ asynchronous เป็น coroutine จึงต้องเรียกใช้ด้วย await

ไคลเอนต์

ไคลเอนต์แบบ synchronous NextPDF ห่อหุ้มไคลเอนต์แบบ asynchronous และรันแต่ละ coroutine จนเสร็จสมบูรณ์ AsyncNextPDF เป็นทั้งไคลเอนต์แบบ asynchronous และ async context manager ควรใช้รูปแบบ context manager เพื่อให้ transport ระดับล่างปิดลงอย่างแน่นอน

คอนสตรักเตอร์

สัญลักษณ์	พารามิเตอร์	พฤติกรรมเริ่มต้น	ค่าที่คืน	โยนหรือล้มเหลวด้วย	หมายเหตุ
`NextPDF(*, base_url, api_key, api_version='v1')`	base URL แบบคีย์เวิร์ดเท่านั้น API key และ API version ที่ไม่บังคับ	สร้างไคลเอนต์แบบ synchronous ที่ทำงานผ่านรีโมต	`NextPDF`	`ValueError` เมื่อ `base_url` หรือ `api_key` ว่างเปล่า	รันงานแบบ async ในรูปแบบ synchronous ภายในโน้ตบุ๊กและ event loop ที่กำลังทำงานได้อย่างปลอดภัย
`AsyncNextPDF(*, base_url='', api_key='', api_version='v1', backend=None)`	base URL แบบคีย์เวิร์ดเท่านั้น API key API version ที่ไม่บังคับ และ backend แบบไม่บังคับที่ส่งเข้ามา	สร้างไคลเอนต์แบบ asynchronous ที่ทำงานผ่านรีโมตเมื่อไม่มีการส่ง backend เข้ามา	`AsyncNextPDF`	`ValueError` เมื่อ `base_url` หรือ `api_key` ว่างเปล่า และไม่มีการระบุ `backend`	ส่ง `backend=` เพื่อใช้ backend แบบกำหนดเองหรือแบบโลคัลสำหรับการทดสอบ
`AsyncNextPDF.__aenter__()`	ไม่มี	เข้าสู่ async context และคืนค่าไคลเอนต์	`AsyncNextPDF`	ไม่คาดว่าจะมี	ใช้ `async with AsyncNextPDF(...) as client:`
`AsyncNextPDF.__aexit__(*_)`	อาร์กิวเมนต์ข้อยกเว้นที่ถูกระงับ	เรียก `close()` เมื่อออกจาก context	`None`	ไม่คาดว่าจะมี	ปล่อย transport แม้บอดีจะโยนข้อยกเว้น
`AsyncNextPDF.close()`	ไม่มี	ปิด remote backend ที่ไคลเอนต์เป็นเจ้าของและปล่อย connection pool	`None`	ไม่คาดว่าจะมี	เป็น idempotent และจะไม่ปิด backend ที่ส่งเข้ามา

อย่าเก็บ API key ไว้ในซอร์สโค้ด อ่าน base_url และ api_key จากสภาพแวดล้อม (NEXTPDF_BASE_URL, NEXTPDF_API_KEY) หรือ secret manager

import os

from nextpdf import AsyncNextPDF


async def extract(pdf_bytes: bytes) -> int:
    """Return the page count of a PDF using the async client as a context manager."""
    base_url = os.environ["NEXTPDF_BASE_URL"]
    api_key = os.environ["NEXTPDF_API_KEY"]

    async with AsyncNextPDF(base_url=base_url, api_key=api_key) as client:
        document = await client.ast.get_document_ast(pdf_bytes)
        return document.page_count

เมท็อด AST แบบ synchronous — `NextPDF.ast.*`

สัญลักษณ์	พารามิเตอร์	พฤติกรรมเริ่มต้น	ค่าที่คืน	โยนหรือล้มเหลวด้วย	หมายเหตุ
`NextPDF.ast.get_document_ast()`	`pdf_data: bytes` คีย์เวิร์ด `page_range_start`, `page_range_end`, `token_budget`	สร้าง Semantic AST แบบเต็มสำหรับทุกหน้า	`AstDocument`	`AstNoStructTreeError`, `AstBuildTimeoutError`, `NextPDFLicenseError`, `QuotaExceededError`	ลดช่วงหน้าเมื่อการสร้างหมดเวลา
`NextPDF.ast.extract_cited_text()`	`pdf_data: bytes` คีย์เวิร์ด `page_index`, `headings_only`	แยกบล็อกข้อความทั้งหมดพร้อม citation anchor	`list[CitedTextBlock]`	`NextPDFAPIError`, `QuotaExceededError`	ตั้งค่า `headings_only=True` เพื่อดึงเฉพาะโหนดหัวเรื่อง
`NextPDF.ast.extract_cited_tables()`	`pdf_data: bytes` คีย์เวิร์ด `page_range` (`dict` ที่มี `start` และ `end`)	แยกตารางทั้งหมดพร้อม citation anchor ระดับเซลล์	`ExtractCitedTablesResponse`	`NextPDFAPIError`, `QuotaExceededError`	เว้น `page_range` เพื่อสแกนทั้งเอกสาร
`NextPDF.ast.get_ast_node()`	`pdf_data: bytes`, `node_id: str`	ดึงโหนดหนึ่งโหนดด้วยตัวระบุของโหนด	`GetAstNodeResponse`	`NextPDFError` เมื่อไม่พบโหนด	`node_id` มีรูปแบบเป็น `ast:{hash6}:{pageIdx}:{seq}` ดังนี้
`NextPDF.ast.search_ast_nodes()`	`pdf_data: bytes` คีย์เวิร์ด `node_type`, `page_index`, `text_query`, `max_results=100`	คืนค่าโหนดแบบตื้นที่ตรงกับตัวกรอง	`SearchAstNodesResponse`	`NextPDFAPIError`	`text_query` จับคู่สตริงย่อยโดยไม่คำนึงถึงตัวพิมพ์ใหญ่เล็ก
`NextPDF.ast.get_ast_diff()`	`original_pdf_data: bytes`, `modified_pdf_data: bytes`	เปรียบเทียบเอกสารสองฉบับตามโครงสร้าง	`GetAstDiffResponse`	`NextPDFAPIError`, `QuotaExceededError`	รายงานโหนดที่เพิ่ม ลบ และเปลี่ยนแปลง

เมท็อด AST แบบ asynchronous — `AsyncNextPDF.ast.*`

เมท็อดแบบ asynchronous แต่ละตัวเป็น coroutine ที่มีพารามิเตอร์ ค่าเริ่มต้น ชนิดข้อมูลที่คืน และโหมดความล้มเหลวเหมือนกับเมท็อดแบบ synchronous คู่กัน เรียกใช้ด้วย await ภายในรันไทม์ asyncio

สัญลักษณ์	พารามิเตอร์	พฤติกรรมเริ่มต้น	ค่าที่คืน	โยนหรือล้มเหลวด้วย	หมายเหตุ
`AsyncNextPDF.ast.get_document_ast()`	`pdf_data: bytes` คีย์เวิร์ด `page_range_start`, `page_range_end`, `token_budget`	สร้าง Semantic AST แบบเต็มสำหรับทุกหน้า	`AstDocument`	`AstNoStructTreeError`, `AstBuildTimeoutError`, `NextPDFLicenseError`, `QuotaExceededError`	`await` ผลลัพธ์
`AsyncNextPDF.ast.extract_cited_text()`	`pdf_data: bytes` คีย์เวิร์ด `page_index`, `headings_only`	แยกบล็อกข้อความทั้งหมดพร้อม citation anchor	`list[CitedTextBlock]`	`NextPDFAPIError`, `QuotaExceededError`	`await` ผลลัพธ์
`AsyncNextPDF.ast.extract_cited_tables()`	`pdf_data: bytes` คีย์เวิร์ด `page_range`	แยกตารางทั้งหมดพร้อม citation anchor ระดับเซลล์	`ExtractCitedTablesResponse`	`NextPDFAPIError`, `QuotaExceededError`	`await` ผลลัพธ์
`AsyncNextPDF.ast.get_ast_node()`	`pdf_data: bytes`, `node_id: str`	ดึงโหนดหนึ่งโหนดด้วยตัวระบุของโหนด	`GetAstNodeResponse`	`NextPDFError` เมื่อไม่พบโหนด	`await` ผลลัพธ์
`AsyncNextPDF.ast.search_ast_nodes()`	`pdf_data: bytes` คีย์เวิร์ด `node_type`, `page_index`, `text_query`, `max_results=100`	คืนค่าโหนดแบบตื้นที่ตรงกับตัวกรอง	`SearchAstNodesResponse`	`NextPDFAPIError`	`await` ผลลัพธ์
`AsyncNextPDF.ast.get_ast_diff()`	`original_pdf_data: bytes`, `modified_pdf_data: bytes`	เปรียบเทียบเอกสารสองฉบับตามโครงสร้าง	`GetAstDiffResponse`	`NextPDFAPIError`, `QuotaExceededError`	`await` ผลลัพธ์

ใช้ไคลเอนต์แบบ async เป็น context manager เพื่อแยกข้อมูลสองรายการพร้อมกัน

import asyncio

from nextpdf import AsyncNextPDF


async def extract_pair(first: bytes, second: bytes) -> None:
    """Extract two PDFs concurrently with one shared async client."""
    async with AsyncNextPDF(base_url="https://connect.example.com", api_key="set-from-secret") as client:
        text_blocks, tables = await asyncio.gather(
            client.ast.extract_cited_text(first),
            client.ast.extract_cited_tables(second),
        )
        print(f"text blocks: {len(text_blocks)}; tables: {tables.table_count}")

โมเดล

การตอบกลับทั้งหมดเป็นโมเดล Pydantic นำเข้าคลาสโมเดลจาก nextpdf หรือ nextpdf.models.ast

สัญลักษณ์	ชนิด	ฟิลด์สำคัญ	หมายเหตุ
`AstDocument`	รากเอกสาร	`schema_version`, `source_hash`, `page_count`, `root: AstNode`, `estimated_tokens` (พร็อพเพอร์ตี)	คืนค่าโดย `get_document_ast()` ยอมรับชื่อแทน `schemaVersion`, `sourceHash` และ `pageCount`
`AstNode`	โหนดในทรี	`id`, `type: NodeType`, `page_index`, `bbox`, `text_content`, `attributes`, `children: list[AstNode]`, `pdf_object_number`, `mcid`	โหนดแบบเรียกซ้ำที่บรรจุทรีของเอกสาร
`AstNodeMeta`	เมทาดาทาของการตอบกลับ	`etag`, `pages_processed`	เป็น frozen และแนบมากับการตอบกลับของโหนดและการค้นหา
`AstNodeShallow`	ผลการค้นหาที่ตรงกัน	`id`, `type: NodeType`, `page_index`, `bbox`, `text_content`, `attributes`, `children_count`	เป็น frozen ไม่มี children เชิงลึก
`BoundingBox`	value object	`x`, `y`, `width`, `height` (แต่ละค่า `0.0`–`1.0`)	พิกัดที่ทำให้เป็นบรรทัดฐานภายในหน้า
`CitationAnchor`	value object	`node_id`, `page_index`, `bbox: BoundingBox`, `confidence`, `content_hash`	ระเบียน provenance สำหรับแต่ละบล็อก
`CitedTextBlock`	บล็อกข้อความ	`text`, `citation: CitationAnchor`, `node_type`, `chunk_index`, `depth`	แต่ละรายการในลิสต์ที่ `extract_cited_text()` คืนค่า
`CitedTableBlock`	บล็อกตาราง	`table_node_id`, `page_index`, `citation_anchor`, `row_count`, `col_count`, `rows`	เป็น frozen หนึ่งตาราง
`CitedTableCell`	เซลล์ตาราง	`row`, `col`, `row_span`, `col_span`, `text`, `bbox`, `confidence`	เป็น frozen หนึ่งเซลล์
`NodeType`	enum	`document`, `section`, `heading`, `paragraph`, `list`, `table`, `figure` และอื่น ๆ	string enum สำหรับค่าชนิดโหนด
`GetAstNodeResponse`	การตอบกลับ	`node: AstNode`, `meta: AstNodeMeta`	คืนค่าโดย `get_ast_node()`
`SearchAstNodesResponse`	การตอบกลับ	`nodes: list[AstNodeShallow]`, `total_matches`, `truncated`, `meta`	คืนค่าโดย `search_ast_nodes()`
`ExtractCitedTablesResponse`	การตอบกลับ	`tables: list[CitedTableBlock]`, `table_count`, `pages_processed`	คืนค่าโดย `extract_cited_tables()`
`AstDiffEntry`	รายการ diff	`type` (`added`/`removed`/`changed`), `node_id`, `node_type`, `page_index`, `text_preview`	การเปลี่ยนแปลงหนึ่งรายการใน diff
`AstDiffSummary`	ยอดรวม diff	`added_node_count`, `removed_node_count`, `changed_node_count`	จำนวนรวม
`GetAstDiffResponse`	การตอบกลับ	`original_page_count`, `modified_page_count`, `summary: AstDiffSummary`, `diff: list[AstDiffEntry]`, `pages_processed`	คืนค่าโดย `get_ast_diff()`

อ่าน citation anchor จากบล็อกข้อความที่แยกออกมา

from nextpdf import CitedTextBlock


def describe(block: CitedTextBlock) -> str:
    """Render a text block with its page index and confidence."""
    anchor = block.citation
    return f"[page {anchor.page_index}, confidence {anchor.confidence:.2f}] {block.text[:80]}"

คำสั่ง CLI

คำสั่ง nextpdf ใช้แยกข้อมูลจากเทอร์มินัล ส่ง --base-url และ --api-key หรือกำหนด NEXTPDF_BASE_URL และ NEXTPDF_API_KEY ในสภาพแวดล้อม ทุกคำสั่งยกเว้น version ต้องมีการตั้งค่าการเชื่อมต่อ ค่า PDF_PATH ที่เป็น - จะอ่าน PDF bytes จาก standard input

สัญลักษณ์	พารามิเตอร์	พฤติกรรมเริ่มต้น	ค่าที่คืน	โยนหรือล้มเหลวด้วย	หมายเหตุ
`nextpdf extract text`	`PDF_PATH` `--format {json,markdown,plain}`, `--page`, `--headings-only`	ส่งออกบล็อกข้อความที่อ้างอิงเป็น JavaScript Object Notation (JSON)	เขียนไปยัง standard output หรือไฟล์ `--output` ที่กำหนด	รหัสออก 1 สำหรับ `NextPDFError` ใด ๆ	`--page` เลือกดัชนีหน้าหนึ่งหน้าโดยเริ่มนับจาก 0
`nextpdf extract tables`	`PDF_PATH` `--format {json,csv}`, `--page-start`, `--page-end`	ส่งออกตารางเป็น JSON	เขียนไปยัง standard output หรือไฟล์ `--output` ที่กำหนด	รหัสออก 1 สำหรับ `NextPDFError` ใด ๆ	`--format csv` เขียนบล็อก comma-separated values (CSV) หนึ่งบล็อกต่อหนึ่งตาราง
`nextpdf ast`	`PDF_PATH` `--page-start`, `--page-end`, `--token-budget`	ส่งออก Semantic AST แบบเต็มเป็น JSON	เขียนไปยัง standard output หรือไฟล์ `--output` ที่กำหนด	รหัสออก 1 สำหรับ `NextPDFError` ใด ๆ	ใช้ `--token-budget` เพื่อจำกัดขนาดของการตอบกลับ
`nextpdf info`	`PDF_PATH`	ส่งออกเมทาดาทาของเอกสาร ได้แก่ จำนวนหน้า เวอร์ชันสคีมา source hash โทเค็นโดยประมาณ และสรุปราก	เขียน JSON ไปยัง standard output หรือไฟล์ `--output` ที่กำหนด	รหัสออก 1 สำหรับ `NextPDFError` ใด ๆ	คำสั่งตรวจสอบแบบเบา
`nextpdf version`	ไม่มี	พิมพ์เวอร์ชัน SDK ที่ติดตั้งอยู่	เขียนไปยัง standard output	ไม่คาดว่าจะมี	ไม่ติดต่อเซิร์ฟเวอร์ ไม่ต้องใช้ข้อมูลรับรอง
`python -m nextpdf.mcp`	ไม่มี (อ่าน `NEXTPDF_BASE_URL`, `NEXTPDF_API_KEY`)	รันเซิร์ฟเวอร์ Model Context Protocol ผ่านทาง standard input/output	โปรเซสเซิร์ฟเวอร์ที่ทำงานต่อเนื่องยาวนาน	`RuntimeError` เมื่อไม่ได้กำหนดตัวแปรสภาพแวดล้อม	ต้องติดตั้ง `nextpdf[mcp]` extra

แยกตารางเป็น comma-separated values (CSV) จากช่วงหน้า

nextpdf extract tables invoice.pdf --format csv --page-start 0 --page-end 2 --output tables.csv

ข้อยกเว้น

ลำดับชั้นข้อยกเว้นอยู่ใน nextpdf.models.errors และถูกส่งออกซ้ำจาก nextpdf ให้ดักจับคลาสที่เฉพาะเจาะจงที่สุดที่โค้ดของคุณรองรับได้ก่อน แล้วจึงถอยกลับไปยังคลาสฐาน เซิร์ฟเวอร์รายงานความล้มเหลวด้วยความหมายสถานะ Hypertext Transfer Protocol (HTTP) ที่สอดคล้องกับ Request for Comments (RFC) 9110 ข้อยกเว้นแต่ละรายการบรรจุ status_code ต้นทาง และ error_code เมื่อมีให้

สัญลักษณ์	คลาสฐาน	`status_code`	เมื่อมีการโยน	หมายเหตุ
`NextPDFError`	`Exception`	ไม่บังคับ	คลาสฐานสำหรับข้อผิดพลาด SDK ทุกรายการ	บรรจุ `status_code` แบบไม่บังคับ ดักจับเป็นรายการสุดท้ายเพื่อเป็นทางสำรอง
`NextPDFAPIError`	`NextPDFError`	บังคับ	เอนด์พอยต์ Connect คืนค่าข้อผิดพลาด HTTP	เพิ่ม `error_code`
`NextPDFLicenseError`	`NextPDFAPIError`	`402`	เซิร์ฟเวอร์ต้องการสิทธิ์การใช้งานระดับสูงกว่าสำหรับคุณสมบัตินี้	`error_code` คือ `license/tier-required`
`QuotaExceededError`	`NextPDFAPIError`	`429`	เกินขีดจำกัดอัตราหรือโควตา	บรรจุ `retry_after` ให้ปฏิบัติตามก่อนลองใหม่
`AstNoStructTreeError`	`NextPDFAPIError`	`422`	PDF ไม่มีการแท็ก และไม่ได้เปิดใช้งานทางสำรองแบบฮิวริสติก	เปิดใช้งานโหมดฮิวริสติกหรือจัดหา PDF ที่มีการแท็ก
`AstBuildTimeoutError`	`NextPDFAPIError`	`504`	การสร้าง AST หมดเวลาบนเซิร์ฟเวอร์	ลดช่วงหน้าและลองใหม่

from nextpdf import (
    NextPDF,
    AstBuildTimeoutError,
    NextPDFAPIError,
    NextPDFError,
    QuotaExceededError,
)


def extract_text(client: NextPDF, pdf_bytes: bytes) -> int:
    """Extract cited text, handling the most specific failures first."""
    try:
        blocks = client.ast.extract_cited_text(pdf_bytes)
    except QuotaExceededError as error:
        raise RuntimeError(f"Quota exceeded (retry after {error.retry_after}s)") from error
    except AstBuildTimeoutError as error:
        raise RuntimeError("AST build timed out; reduce the page range") from error
    except NextPDFAPIError as error:
        raise RuntimeError(f"API error {error.status_code}: {error}") from error
    except NextPDFError as error:
        raise RuntimeError(f"SDK error: {error}") from error
    return len(blocks)

หมายเหตุการพัฒนา

ไคลเอนต์แบบ synchronous NextPDF มอบหมายทุกการเรียกไปยัง AsyncNextPDF คุณสามารถเรียกใช้จากโน้ตบุ๊กหรือเธรดที่กำลังรัน event loop อยู่แล้วได้ เพราะเมื่อพบลูปที่กำลังทำงาน ระบบจะส่ง coroutine ไปยังเธรดของ worker
ควรใช้รูปแบบ async context manager async with AsyncNextPDF(...) as client: เพื่อให้ connection pool ปิดอย่างแน่นอน เมื่อคุณสร้าง AsyncNextPDF โดยตรง ให้เรียก close() ด้วยตนเอง
bearer token จะไม่ถูกบันทึกหรือรวมอยู่ในข้อความแสดงข้อผิดพลาด และเปิดใช้งานการตรวจสอบ Transport Layer Security (TLS) เป็นค่าเริ่มต้น อย่าฝังข้อมูลรับรองไว้ในซอร์ส ให้อ่านจากสภาพแวดล้อมหรือ secret manager
โมเดลทั้งหมดเป็นคลาส Pydantic v2 โมเดลการตอบกลับหลายรายการเป็น frozen (ไม่สามารถเปลี่ยนแปลงได้) ถือว่าบล็อกที่แยกออกมาเป็นค่าแบบอ่านอย่างเดียว
CLI ออกด้วยรหัสสถานะ 1 สำหรับ NextPDFError ใด ๆ และพิมพ์ข้อความไปยัง standard error ให้นำรหัสออกนั้นไปเชื่อมกับไปป์ไลน์

ดูเพิ่มเติม

คู่มือนักพัฒนา Python SDK — สถาปัตยกรรม การรวมงานแบบ async และการจัดการความล้มเหลว
Python CLI — การแยกข้อมูลจากเทอร์มินัลและการสตรีมสำหรับไฟล์ขนาดใหญ่
เซิร์ฟเวอร์ Python MCP — เปิดให้เอเจนต์ปัญญาประดิษฐ์ (AI) ใช้เครื่องมือแยกข้อมูล
RFC 9110 (HTTP Semantics) และ RFC 9457 (Problem Details for HTTP APIs) อธิบายความหมายของสถานะและบอดีข้อผิดพลาดที่เครื่องอ่านได้ซึ่งเอนด์พอยต์ Connect คืนค่า ดูดัชนี IETF RFC สำหรับข้อความที่เป็นทางการ