文件攝取管線¶

RAG 管線的第一階段是將 PDF 文件轉換為可搜尋的向量索引。NextPDF Enterprise 的攝取管線包含結構感知解析、智能分塊、GPU 嵌入，以及雙軌索引（向量 + BM25）。

端點¶

`POST /v1/rag/ingest`¶

攝取單份 PDF 文件，系統自動執行解析、分塊、嵌入與索引。

請求¶

POST /v1/rag/ingest
Authorization: Bearer {jwt_token}
X-Tenant-ID: acme-corp-001
Content-Type: multipart/form-data

--boundary
Content-Disposition: form-data; name="document"; filename="annual-report-2024.pdf"
Content-Type: application/pdf

{pdf binary data}
--boundary
Content-Disposition: form-data; name="options"
Content-Type: application/json

{
  "document_id": "annual-report-2024",
  "title": "Annual Report 2024",
  "tags": ["financial", "2024", "annual"],
  "chunking_strategy": "structure_aware",
  "chunk_max_tokens": 512,
  "chunk_overlap_tokens": 64,
  "extract_tables": true,
  "extract_headers": true,
  "language_hint": "en"
}

回應（202 Accepted）¶

{
  "ingest_job_id": "job_01HX8K2N3P4Q5R6S7T8U9V0W",
  "document_id": "annual-report-2024",
  "status": "queued",
  "estimated_chunks": 247,
  "created_at": "2025-01-15T09:30:00Z",
  "status_url": "/v1/rag/ingest/jobs/job_01HX8K2N3P4Q5R6S7T8U9V0W"
}

`POST /v1/rag/ingest-chunks`¶

直接提交預先分塊的段落，跳過 NextPDF 的自動分塊（適用於已有自定義分塊邏輯的場景）：

請求¶

POST /v1/rag/ingest-chunks
Authorization: Bearer {jwt_token}
X-Tenant-ID: acme-corp-001
Content-Type: application/json

{
  "document_id": "contract-2025-001",
  "chunks": [
    {
      "chunk_id": "c001",
      "text": "This Service Agreement is entered into as of January 15, 2025...",
      "metadata": {
        "page_number": 1,
        "section": "Introduction",
        "document_type": "legal_contract",
        "headings": ["Service Agreement", "1. Definitions"]
      }
    },
    {
      "chunk_id": "c002",
      "text": "1.1 \"Service\" means the PDF generation and processing capabilities...",
      "metadata": {
        "page_number": 1,
        "section": "1. Definitions",
        "headings": ["1. Definitions"]
      }
    }
  ]
}

結構感知分塊¶

structure_aware 分塊策略使用 NextPDF 的文件結構解析能力，在語意邊界（標題、段落、章節）處分塊：

flowchart TD
    A[PDF 文件] --> B[結構解析]
    B --> C[識別標題階層 H1/H2/H3]
    C --> D[識別段落邊界]
    D --> E[識別表格 / 圖像]
    E --> F{段落長度 > max_tokens?}
    F -->|是| G[句子邊界分割]
    F -->|否| H[保持完整段落]
    G --> I[加入 overlap 上下文]
    H --> I
    I --> J[標記元資料（頁碼、標題路徑）]

分塊策略選項¶

策略	說明	適用場景
`structure_aware`	依文件結構邊界分塊（推薦）	有明確章節結構的文件
`fixed_token`	固定 token 數分塊（含 overlap）	掃描文件、無結構 PDF
`sentence`	以句子為邊界分塊	短段落密集文件
`paragraph`	以段落為邊界分塊	新聞稿、報告
`page`	以頁面為邊界分塊（最大粒度）	頁面獨立的文件（表單、簡報）

攝取進度追蹤¶

`GET /v1/rag/ingest/jobs/{job_id}`¶

GET /v1/rag/ingest/jobs/job_01HX8K2N3P4Q5R6S7T8U9V0W
Authorization: Bearer {jwt_token}

{
  "ingest_job_id": "job_01HX8K2N3P4Q5R6S7T8U9V0W",
  "document_id": "annual-report-2024",
  "status": "embedding",
  "progress": {
    "total_chunks": 247,
    "parsed_chunks": 247,
    "embedded_chunks": 183,
    "indexed_chunks": 183,
    "percent_complete": 74
  },
  "started_at": "2025-01-15T09:30:01Z",
  "estimated_completion": "2025-01-15T09:30:08Z"
}

狀態流轉¶

queued → parsing → chunking → embedding → indexing → completed
                                                    ↘ failed

PHP 客戶端¶

use NextPDF\Enterprise\AiRag\RagClient;
use NextPDF\Enterprise\AiRag\IngestOptions;
use NextPDF\Enterprise\AiRag\ChunkingStrategy;

$client = RagClient::fromEnvironment();

// 非同步攝取
$job = $client->ingest(
    documentId: 'annual-report-2024',
    pdfBytes: file_get_contents('annual-report-2024.pdf'),
    options: IngestOptions::create()
        ->withTitle('Annual Report 2024')
        ->withTags(['financial', '2024'])
        ->withChunkingStrategy(ChunkingStrategy::StructureAware)
        ->withChunkMaxTokens(512)
        ->withChunkOverlapTokens(64)
        ->withTableExtraction(true),
);

// 等待完成（帶超時）
$completedJob = $client->waitForIngest(
    jobId: $job->ingestJobId(),
    timeoutSeconds: 60,
    pollIntervalMs: 500,
);

echo '攝取完成，共 ' . $completedJob->progress()->indexedChunks() . ' 個索引段落';

PHP Compatibility

This example uses PHP 8.5 syntax. If your environment runs PHP 8.1 or 7.4, use NextPDF Backport for a backward-compatible build.

文件移除¶

DELETE /v1/rag/documents/{document_id}
Authorization: Bearer {jwt_token}
X-Tenant-ID: acme-corp-001

// 移除文件及其所有向量（GDPR 遺忘權支援）
$client->removeDocument(documentId: 'annual-report-2024');

效能規格¶

場景	指標
單份文件攝取（50 頁）
批次攝取吞吐量（GPU）
批次攝取吞吐量（CPU）