文件攝取管線¶
RAG 管線的第一階段是將 PDF 文件轉換為可搜尋的向量索引。NextPDF Enterprise 的攝取管線包含結構感知解析、智能分塊、GPU 嵌入,以及雙軌索引(向量 + BM25)。
端點¶
POST /v1/rag/ingest¶
攝取單份 PDF 文件,系統自動執行解析、分塊、嵌入與索引。
請求¶
POST /v1/rag/ingest
Authorization: Bearer {jwt_token}
X-Tenant-ID: acme-corp-001
Content-Type: multipart/form-data
--boundary
Content-Disposition: form-data; name="document"; filename="annual-report-2024.pdf"
Content-Type: application/pdf
{pdf binary data}
--boundary
Content-Disposition: form-data; name="options"
Content-Type: application/json
{
"document_id": "annual-report-2024",
"title": "Annual Report 2024",
"tags": ["financial", "2024", "annual"],
"chunking_strategy": "structure_aware",
"chunk_max_tokens": 512,
"chunk_overlap_tokens": 64,
"extract_tables": true,
"extract_headers": true,
"language_hint": "en"
}
回應(202 Accepted)¶
{
"ingest_job_id": "job_01HX8K2N3P4Q5R6S7T8U9V0W",
"document_id": "annual-report-2024",
"status": "queued",
"estimated_chunks": 247,
"created_at": "2025-01-15T09:30:00Z",
"status_url": "/v1/rag/ingest/jobs/job_01HX8K2N3P4Q5R6S7T8U9V0W"
}
POST /v1/rag/ingest-chunks¶
直接提交預先分塊的段落,跳過 NextPDF 的自動分塊(適用於已有自定義分塊邏輯的場景):
請求¶
POST /v1/rag/ingest-chunks
Authorization: Bearer {jwt_token}
X-Tenant-ID: acme-corp-001
Content-Type: application/json
{
"document_id": "contract-2025-001",
"chunks": [
{
"chunk_id": "c001",
"text": "This Service Agreement is entered into as of January 15, 2025...",
"metadata": {
"page_number": 1,
"section": "Introduction",
"document_type": "legal_contract",
"headings": ["Service Agreement", "1. Definitions"]
}
},
{
"chunk_id": "c002",
"text": "1.1 \"Service\" means the PDF generation and processing capabilities...",
"metadata": {
"page_number": 1,
"section": "1. Definitions",
"headings": ["1. Definitions"]
}
}
]
}
結構感知分塊¶
structure_aware 分塊策略使用 NextPDF 的文件結構解析能力,在語意邊界(標題、段落、章節)處分塊:
flowchart TD
A[PDF 文件] --> B[結構解析]
B --> C[識別標題階層 H1/H2/H3]
C --> D[識別段落邊界]
D --> E[識別表格 / 圖像]
E --> F{段落長度 > max_tokens?}
F -->|是| G[句子邊界分割]
F -->|否| H[保持完整段落]
G --> I[加入 overlap 上下文]
H --> I
I --> J[標記元資料(頁碼、標題路徑)] 分塊策略選項¶
| 策略 | 說明 | 適用場景 |
|---|---|---|
structure_aware | 依文件結構邊界分塊(推薦) | 有明確章節結構的文件 |
fixed_token | 固定 token 數分塊(含 overlap) | 掃描文件、無結構 PDF |
sentence | 以句子為邊界分塊 | 短段落密集文件 |
paragraph | 以段落為邊界分塊 | 新聞稿、報告 |
page | 以頁面為邊界分塊(最大粒度) | 頁面獨立的文件(表單、簡報) |
攝取進度追蹤¶
GET /v1/rag/ingest/jobs/{job_id}¶
{
"ingest_job_id": "job_01HX8K2N3P4Q5R6S7T8U9V0W",
"document_id": "annual-report-2024",
"status": "embedding",
"progress": {
"total_chunks": 247,
"parsed_chunks": 247,
"embedded_chunks": 183,
"indexed_chunks": 183,
"percent_complete": 74
},
"started_at": "2025-01-15T09:30:01Z",
"estimated_completion": "2025-01-15T09:30:08Z"
}
狀態流轉¶
PHP 客戶端¶
use NextPDF\Enterprise\AiRag\RagClient;
use NextPDF\Enterprise\AiRag\IngestOptions;
use NextPDF\Enterprise\AiRag\ChunkingStrategy;
$client = RagClient::fromEnvironment();
// 非同步攝取
$job = $client->ingest(
documentId: 'annual-report-2024',
pdfBytes: file_get_contents('annual-report-2024.pdf'),
options: IngestOptions::create()
->withTitle('Annual Report 2024')
->withTags(['financial', '2024'])
->withChunkingStrategy(ChunkingStrategy::StructureAware)
->withChunkMaxTokens(512)
->withChunkOverlapTokens(64)
->withTableExtraction(true),
);
// 等待完成(帶超時)
$completedJob = $client->waitForIngest(
jobId: $job->ingestJobId(),
timeoutSeconds: 60,
pollIntervalMs: 500,
);
echo '攝取完成,共 ' . $completedJob->progress()->indexedChunks() . ' 個索引段落';
PHP Compatibility
This example uses PHP 8.5 syntax. If your environment runs PHP 8.1 or 7.4, use NextPDF Backport for a backward-compatible build.
文件移除¶
效能規格¶
| 場景 | 指標 |
|---|---|
| 單份文件攝取(50 頁) | |
| 批次攝取吞吐量(GPU) | |
| 批次攝取吞吐量(CPU) |