文字萃取與文件分割¶
將 PDF 文件的文字內容轉換為結構化資料,是搜尋索引、AI 分析、資料遷移與合規稽核的基礎操作。NextPDF Pro 的 DocumentSegmentationEngine 解析 PDF 內容串流,輸出帶有位置資訊、字型屬性與語意分組的結構化文字物件。
核心物件模型¶
DocumentSegmentationEngine
└── extract(pdf) → ExtractedDocument
├── getPages() → list<ExtractedPage>
│ ├── getNumber() → int
│ ├── getWidth() → float (單位:pt)
│ ├── getHeight() → float
│ └── getBlocks() → list<ExtractedTextBlock>
│ ├── getBoundingBox() → BoundingBox
│ ├── getBlockType() → BlockType (paragraph/heading/table/list)
│ ├── getReadingOrder() → int
│ └── getLines() → list<ExtractedText>
│ ├── getText() → string
│ ├── getFontName() → string
│ ├── getFontSize() → float
│ ├── isBold() → bool
│ ├── isItalic() → bool
│ ├── getColor() → RgbColor
│ └── getBoundingBox() → BoundingBox
└── getMetadata() → DocumentMetadata
快速開始¶
use NextPDF\Pro\Converter\DocumentSegmentationEngine;
use NextPDF\Pro\Converter\ExtractionOptions;
use NextPDF\Pro\Converter\ExtractedDocument;
$pdf = file_get_contents('/documents/annual-report.pdf');
$engine = new DocumentSegmentationEngine(
ExtractionOptions::create()
->withReadingOrderDetection(true) // 智慧閱讀順序重建
->withTableDetection(true) // 表格結構偵測
->withHeadingDetection(true) // 標題層級識別
->withHyphenationMerge(true) // 合併跨行的連字符斷行
);
/** @var ExtractedDocument $document */
$document = $engine->extract($pdf);
echo 'Pages: ' . $document->getPageCount();
echo 'Total words: ' . $document->getTotalWordCount();
echo 'Language detected: ' . $document->getDetectedLanguage();
// 輸出純文字(保留段落結構)
echo $document->toPlainText(preserveParagraphs: true);
ExtractedPage — 頁面物件¶
use NextPDF\Pro\Converter\ExtractedPage;
foreach ($document->getPages() as $page) {
/** @var ExtractedPage $page */
printf("=== Page %d (%.0f x %.0f pt) ===\n",
$page->getNumber(),
$page->getWidth(),
$page->getHeight(),
);
// 按閱讀順序取得文字區塊
foreach ($page->getBlocks() as $block) {
printf(
"Block [%s] @ (%.1f, %.1f) reading_order=%d\n",
$block->getBlockType()->value,
$block->getBoundingBox()->getX(),
$block->getBoundingBox()->getY(),
$block->getReadingOrder(),
);
}
// 取得頁面全部純文字(按閱讀順序)
echo $page->toPlainText();
}
ExtractedTextBlock — 文字區塊¶
ExtractedTextBlock 代表一個語意連貫的文字群組(段落、標題、表格儲存格等):
use NextPDF\Pro\Converter\BlockType;
use NextPDF\Pro\Converter\ExtractedTextBlock;
foreach ($page->getBlocks() as $block) {
/** @var ExtractedTextBlock $block */
match ($block->getBlockType()) {
BlockType::Heading => printf("H%d: %s\n", $block->getHeadingLevel(), $block->toPlainText()),
BlockType::Paragraph => printf("P: %s\n", substr($block->toPlainText(), 0, 80)),
BlockType::Table => processTable($block),
BlockType::ListItem => printf("• %s\n", $block->toPlainText()),
BlockType::Footer => null, // 頁尾通常略過
default => null,
};
}
function processTable(ExtractedTextBlock $block): void
{
if (!$block->hasTableData()) {
return;
}
/** @var list<list<string>> $rows */
$rows = $block->getTableRows();
foreach ($rows as $rowIndex => $row) {
echo implode(' | ', $row) . "\n";
}
}
ExtractedText — 文字行物件¶
最細粒度的文字物件,包含完整的字型與位置屬性:
use NextPDF\Pro\Converter\ExtractedText;
foreach ($block->getLines() as $line) {
/** @var ExtractedText $line */
printf(
"Text: %-50s | Font: %-20s | Size: %5.1f | Bold: %s | Color: #%s\n",
$line->getText(),
$line->getFontName(),
$line->getFontSize(),
$line->isBold() ? 'Y' : 'N',
$line->getColor()->toHex(),
);
// 字元級別座標(適用於精確定位需求)
foreach ($line->getCharacters() as $char) {
printf(
" Char: %s @ x=%.2f y=%.2f w=%.2f\n",
$char->getValue(),
$char->getX(),
$char->getY(),
$char->getWidth(),
);
}
}
文件分割(Segmentation)¶
DocumentSegmentationEngine 除文字萃取外,亦支援依語意邊界分割文件——適用於批次文件包(例如一份 PDF 含多張發票):
use NextPDF\Pro\Converter\DocumentSegmentationEngine;
use NextPDF\Pro\Converter\Segmentation\SegmentationStrategy;
use NextPDF\Pro\Converter\Segmentation\DocumentSegment;
$engine = new DocumentSegmentationEngine();
/** @var list<DocumentSegment> $segments */
$segments = $engine->segment(
pdfBytes: $batchPdf,
strategy: SegmentationStrategy::PageBreakAndHeading,
// 其他策略:
// SegmentationStrategy::PageBreak — 每頁為一個文件
// SegmentationStrategy::Heading1 — H1 標題為分割點
// SegmentationStrategy::BlankPage — 空白頁為分割點
// SegmentationStrategy::Custom — 自訂分割規則
);
foreach ($segments as $index => $segment) {
printf(
"Segment %d: pages %d–%d | title: %s\n",
$index + 1,
$segment->getStartPage(),
$segment->getEndPage(),
$segment->getDetectedTitle() ?? '(untitled)',
);
// 萃取為獨立 PDF
$segmentPdf = $segment->toPdf();
file_put_contents(
sprintf('/output/segment-%02d.pdf', $index + 1),
$segmentPdf,
);
}
輸出格式¶
// 純文字輸出
$plainText = $document->toPlainText(preserveParagraphs: true);
// Markdown 輸出(保留標題層級與表格)
$markdown = $document->toMarkdown();
// JSON 輸出(完整結構資料)
$json = $document->toJson(includePositions: true, pretty: true);
// HTML 輸出(保留位置與樣式提示)
$html = $document->toHtml(
includeStyles: true,
wrapInDocument: true,
);
// CSV 輸出(所有偵測到的表格)
$tables = $document->getTables();
foreach ($tables as $table) {
$table->toCsv(path: '/output/extracted-table.csv');
}
搜尋索引整合範例¶
use NextPDF\Pro\Converter\DocumentSegmentationEngine;
use NextPDF\Pro\Converter\ExtractionOptions;
// 為搜尋引擎準備索引資料
$engine = new DocumentSegmentationEngine(
ExtractionOptions::create()
->withReadingOrderDetection(true)
->withHeadingDetection(true)
);
$document = $engine->extract($pdf);
$indexPayload = [
'document_id' => $documentId,
'title' => $document->getMetadata()->getTitle(),
'author' => $document->getMetadata()->getAuthor(),
'pages' => $document->getPageCount(),
'content' => $document->toPlainText(),
'headings' => array_map(
fn($h) => $h->toPlainText(),
$document->getHeadings(),
),
'page_snippets' => array_map(
fn($p) => ['page' => $p->getNumber(), 'text' => substr($p->toPlainText(), 0, 500)],
$document->getPages(),
),
];
// 送入 Elasticsearch / OpenSearch / Meilisearch
$searchClient->index($indexPayload);
已知限制¶
| 限制 | 說明 | 替代方案 |
|---|---|---|
| 掃描式 PDF | 圖片型 PDF 無嵌入文字,無法萃取 | 需搭配 OCR 引擎(Tesseract / AWS Textract) |
| 加密 PDF | 需提供使用者密碼才能萃取 | 使用 ExtractionOptions::withPassword() |
| 複雜 RTL 排版 | 阿拉伯語、希伯來語閱讀順序可能不準確 | 設定 withForceRtl(true) |
| XFA 表單 | 動態 XFA 的文字不在標準內容串流中 | 改用 表單資料萃取 |
相關資源¶
- 進階表單處理 — XFA/XFDF 資料萃取
- PDF 差異比較 — 基於萃取結果的文字比對
- DocumentSegmentationEngine API 參考
- ExtractedDocument API 參考