跳轉到

文字萃取與文件分割

將 PDF 文件的文字內容轉換為結構化資料,是搜尋索引、AI 分析、資料遷移與合規稽核的基礎操作。NextPDF Pro 的 DocumentSegmentationEngine 解析 PDF 內容串流,輸出帶有位置資訊、字型屬性與語意分組的結構化文字物件。


核心物件模型

DocumentSegmentationEngine
└── extract(pdf) → ExtractedDocument
    ├── getPages() → list<ExtractedPage>
    │   ├── getNumber() → int
    │   ├── getWidth() → float  (單位:pt)
    │   ├── getHeight() → float
    │   └── getBlocks() → list<ExtractedTextBlock>
    │       ├── getBoundingBox() → BoundingBox
    │       ├── getBlockType() → BlockType (paragraph/heading/table/list)
    │       ├── getReadingOrder() → int
    │       └── getLines() → list<ExtractedText>
    │           ├── getText() → string
    │           ├── getFontName() → string
    │           ├── getFontSize() → float
    │           ├── isBold() → bool
    │           ├── isItalic() → bool
    │           ├── getColor() → RgbColor
    │           └── getBoundingBox() → BoundingBox
    └── getMetadata() → DocumentMetadata

快速開始

use NextPDF\Pro\Converter\DocumentSegmentationEngine;
use NextPDF\Pro\Converter\ExtractionOptions;
use NextPDF\Pro\Converter\ExtractedDocument;

$pdf = file_get_contents('/documents/annual-report.pdf');

$engine = new DocumentSegmentationEngine(
    ExtractionOptions::create()
        ->withReadingOrderDetection(true)    // 智慧閱讀順序重建
        ->withTableDetection(true)           // 表格結構偵測
        ->withHeadingDetection(true)         // 標題層級識別
        ->withHyphenationMerge(true)         // 合併跨行的連字符斷行
);

/** @var ExtractedDocument $document */
$document = $engine->extract($pdf);

echo 'Pages: ' . $document->getPageCount();
echo 'Total words: ' . $document->getTotalWordCount();
echo 'Language detected: ' . $document->getDetectedLanguage();

// 輸出純文字(保留段落結構)
echo $document->toPlainText(preserveParagraphs: true);

ExtractedPage — 頁面物件

use NextPDF\Pro\Converter\ExtractedPage;

foreach ($document->getPages() as $page) {
    /** @var ExtractedPage $page */
    printf("=== Page %d (%.0f x %.0f pt) ===\n",
        $page->getNumber(),
        $page->getWidth(),
        $page->getHeight(),
    );

    // 按閱讀順序取得文字區塊
    foreach ($page->getBlocks() as $block) {
        printf(
            "Block [%s] @ (%.1f, %.1f) reading_order=%d\n",
            $block->getBlockType()->value,
            $block->getBoundingBox()->getX(),
            $block->getBoundingBox()->getY(),
            $block->getReadingOrder(),
        );
    }

    // 取得頁面全部純文字(按閱讀順序)
    echo $page->toPlainText();
}

ExtractedTextBlock — 文字區塊

ExtractedTextBlock 代表一個語意連貫的文字群組(段落、標題、表格儲存格等):

use NextPDF\Pro\Converter\BlockType;
use NextPDF\Pro\Converter\ExtractedTextBlock;

foreach ($page->getBlocks() as $block) {
    /** @var ExtractedTextBlock $block */

    match ($block->getBlockType()) {
        BlockType::Heading  => printf("H%d: %s\n", $block->getHeadingLevel(), $block->toPlainText()),
        BlockType::Paragraph => printf("P: %s\n", substr($block->toPlainText(), 0, 80)),
        BlockType::Table    => processTable($block),
        BlockType::ListItem => printf("• %s\n", $block->toPlainText()),
        BlockType::Footer   => null, // 頁尾通常略過
        default             => null,
    };
}

function processTable(ExtractedTextBlock $block): void
{
    if (!$block->hasTableData()) {
        return;
    }

    /** @var list<list<string>> $rows */
    $rows = $block->getTableRows();

    foreach ($rows as $rowIndex => $row) {
        echo implode(' | ', $row) . "\n";
    }
}

ExtractedText — 文字行物件

最細粒度的文字物件,包含完整的字型與位置屬性:

use NextPDF\Pro\Converter\ExtractedText;

foreach ($block->getLines() as $line) {
    /** @var ExtractedText $line */

    printf(
        "Text: %-50s | Font: %-20s | Size: %5.1f | Bold: %s | Color: #%s\n",
        $line->getText(),
        $line->getFontName(),
        $line->getFontSize(),
        $line->isBold() ? 'Y' : 'N',
        $line->getColor()->toHex(),
    );

    // 字元級別座標(適用於精確定位需求)
    foreach ($line->getCharacters() as $char) {
        printf(
            "  Char: %s @ x=%.2f y=%.2f w=%.2f\n",
            $char->getValue(),
            $char->getX(),
            $char->getY(),
            $char->getWidth(),
        );
    }
}

文件分割(Segmentation)

DocumentSegmentationEngine 除文字萃取外,亦支援依語意邊界分割文件——適用於批次文件包(例如一份 PDF 含多張發票):

use NextPDF\Pro\Converter\DocumentSegmentationEngine;
use NextPDF\Pro\Converter\Segmentation\SegmentationStrategy;
use NextPDF\Pro\Converter\Segmentation\DocumentSegment;

$engine = new DocumentSegmentationEngine();

/** @var list<DocumentSegment> $segments */
$segments = $engine->segment(
    pdfBytes: $batchPdf,
    strategy: SegmentationStrategy::PageBreakAndHeading,
    // 其他策略:
    // SegmentationStrategy::PageBreak — 每頁為一個文件
    // SegmentationStrategy::Heading1  — H1 標題為分割點
    // SegmentationStrategy::BlankPage — 空白頁為分割點
    // SegmentationStrategy::Custom    — 自訂分割規則
);

foreach ($segments as $index => $segment) {
    printf(
        "Segment %d: pages %d–%d | title: %s\n",
        $index + 1,
        $segment->getStartPage(),
        $segment->getEndPage(),
        $segment->getDetectedTitle() ?? '(untitled)',
    );

    // 萃取為獨立 PDF
    $segmentPdf = $segment->toPdf();
    file_put_contents(
        sprintf('/output/segment-%02d.pdf', $index + 1),
        $segmentPdf,
    );
}

輸出格式

// 純文字輸出
$plainText = $document->toPlainText(preserveParagraphs: true);

// Markdown 輸出(保留標題層級與表格)
$markdown = $document->toMarkdown();

// JSON 輸出(完整結構資料)
$json = $document->toJson(includePositions: true, pretty: true);

// HTML 輸出(保留位置與樣式提示)
$html = $document->toHtml(
    includeStyles: true,
    wrapInDocument: true,
);

// CSV 輸出(所有偵測到的表格)
$tables = $document->getTables();
foreach ($tables as $table) {
    $table->toCsv(path: '/output/extracted-table.csv');
}

搜尋索引整合範例

use NextPDF\Pro\Converter\DocumentSegmentationEngine;
use NextPDF\Pro\Converter\ExtractionOptions;

// 為搜尋引擎準備索引資料
$engine = new DocumentSegmentationEngine(
    ExtractionOptions::create()
        ->withReadingOrderDetection(true)
        ->withHeadingDetection(true)
);

$document = $engine->extract($pdf);

$indexPayload = [
    'document_id' => $documentId,
    'title'       => $document->getMetadata()->getTitle(),
    'author'      => $document->getMetadata()->getAuthor(),
    'pages'       => $document->getPageCount(),
    'content'     => $document->toPlainText(),
    'headings'    => array_map(
        fn($h) => $h->toPlainText(),
        $document->getHeadings(),
    ),
    'page_snippets' => array_map(
        fn($p) => ['page' => $p->getNumber(), 'text' => substr($p->toPlainText(), 0, 500)],
        $document->getPages(),
    ),
];

// 送入 Elasticsearch / OpenSearch / Meilisearch
$searchClient->index($indexPayload);

已知限制

限制 說明 替代方案
掃描式 PDF 圖片型 PDF 無嵌入文字,無法萃取 需搭配 OCR 引擎(Tesseract / AWS Textract)
加密 PDF 需提供使用者密碼才能萃取 使用 ExtractionOptions::withPassword()
複雜 RTL 排版 阿拉伯語、希伯來語閱讀順序可能不準確 設定 withForceRtl(true)
XFA 表單 動態 XFA 的文字不在標準內容串流中 改用 表單資料萃取

相關資源