Skip to content

Ast: semantic document tree and serialization

The Ast module provides the engine’s semantic document abstract syntax tree (AST). It models a document as a typed node hierarchy: Document, Section, Heading, Paragraph, List, Table, Figure, Code, and FormField. The model records bounding boxes and citation anchors, and it serializes to versioned JavaScript Object Notation (JSON). The accessibility tagging layer uses this tree to produce a structure tree.

Stability: experimental. This is an internal model surface. Its classes do not carry version-frozen public application programming interface (API) guarantees. The node set and node attributes may change. The serialization schema is versioned independently (AstDocument::CURRENT_SCHEMA_VERSION = '1.0.0'). The serializer detects and rejects an incompatible schema, so persisted AST JSON keeps a stable contract even when the in-memory API changes.

Terminal window
composer require nextpdf/core:^3

Here, an AST represents a document’s logical structure. It is not a parser syntax tree for one input format. AstDocument is the container. It holds the root AstNode (which must be NodeType::Document), a schema version, a hash of the source Portable Document Format (PDF) file, and a page count. It rejects invalid construction, including an empty schema version, a page count below one, or the wrong root type.

AstNode is the recursive node. NodeType enumerates the semantic kinds. A node carries children, an optional BoundingBox, optional text content, and attributes validated by NodeAttributeSchema. The node API supports immutable derivation. withBboxAndText() returns a new node. deepClone() copies a subtree. NodeId is the value-object identity. CitationAnchor ties a node to a source location for traceability. AstNodeCollection is a Countable/IteratorAggregate set with ofType() filtering.

AstSerializer is the persistence boundary. serialize() writes an AstDocument to JSON. deserialize() reads it back. canDeserialize() and extractSchemaVersion() let you check compatibility before parsing, so a schema mismatch is a detected condition instead of a corrupt load. AstDocument::estimateTokenCount() helps size content for downstream token-bounded processing.

ClassKey membersRole
AstDocumenttoJson(), nodeCount(), estimateTokenCount(), CURRENT_SCHEMA_VERSIONRoot container; validates root type and schema
AstNodeaddChild(), children(), childCount(), totalNodeCount(), withBboxAndText(), deepClone()Recursive semantic node
NodeType (enum)Document, Heading, Table, Figure, FormField, …Semantic node kind
AstNodeCollectionadd(), count(), isEmpty(), ofType(), toArray()Iterable, type-filterable node set
AstSerializerserialize(), deserialize(), canDeserialize(), extractSchemaVersion()Versioned JSON persistence
BoundingBoxtoArray(), equals()Geometry value object (epsilon compare)
NodeId / CitationAnchortoString(), equals(), toArray()Node identity and source-traceability anchor
NodeAttributeSchemaattribute validationSchema for node attributes

Run composer docs:generate-api-php -- --module=Ast to generate the full PHPDoc table.

Build a small tree, then serialize it.

<?php
declare(strict_types=1);
require_once __DIR__ . '/../vendor/autoload.php';
use NextPDF\Ast\AstNode;
use NextPDF\Ast\AstSerializer;
use NextPDF\Ast\NodeType;
$root = new AstNode(NodeType::Document);
$heading = new AstNode(NodeType::Heading);
$root->addChild($heading);
$root->addChild(new AstNode(NodeType::Paragraph));
echo "Nodes: {$root->totalNodeCount()}\n";
$json = (new AstSerializer())->serialize(/* an AstDocument wrapping $root */);

Round-trip persisted AST defensively. Check schema compatibility before you deserialize untrusted JSON.

<?php
declare(strict_types=1);
require_once __DIR__ . '/../vendor/autoload.php';
use NextPDF\Ast\AstDocument;
use NextPDF\Ast\AstSerializer;
use Psr\Log\LoggerInterface;
final readonly class AstStore
{
public function __construct(
private AstSerializer $serializer,
private LoggerInterface $logger,
) {}
public function load(string $json): ?AstDocument
{
if (!$this->serializer->canDeserialize($json)) {
$this->logger->warning('AST JSON schema incompatible; rejected.', [
'found_schema' => $this->serializer->extractSchemaVersion($json),
'expected' => AstDocument::CURRENT_SCHEMA_VERSION,
]);
return null;
}
return $this->serializer->deserialize($json);
}
}
  • AstDocument requires the root node to be NodeType::Document. A tree with any other root throws at construction.
  • AstNode::withBboxAndText() and deepClone() return new instances. The available node mutators (addChild()) mutate the node. The derivation helpers do not. Know which method you are calling.
  • Always gate deserialize() with canDeserialize() for externally sourced JSON. A schema-version mismatch is a detectable, expected condition.
  • estimateTokenCount() is an estimate for sizing downstream processing, not an exact tokenizer count. Do not treat it as authoritative.
  • BoundingBox::equals() is an epsilon compare (default 0.001). Exact float equality is not the contract.

Tree construction and traversal are O(n) in node count. Serialization is linear in the tree size. The reproducibility profile is bitwise. The same tree serializes to the same JSON bytes, which keeps the schema stable as a persistence contract. The default reference workload stays well inside the 1500 ms wall / 64 MB peak budget.

AstSerializer::deserialize() parses JSON that may be persisted or transmitted. Validate compatibility with canDeserialize() first. Treat the deserialized tree’s text content and attributes as untrusted strings when they re-enter the application or are rendered. The module itself performs no input/output (I/O) and embeds no external data. See the engine threat model in /modules/core/security/.

This module asserts no PDF-specification normative claim. The semantic AST is an engine-internal abstraction. It does not implement a standardized document model whose clauses must be cited. Where the AST feeds accessibility tagging, the PDF/UA and tagged-PDF conformance of the output is documented and validated on /modules/core/accessibility/ and /modules/core/conformance/, not here.