Redact PII from a PDF over Connect
Redact PII from a PDF over Connect
Section titled “Redact PII from a PDF over Connect”At a glance
Section titled “At a glance”This recipe removes detected personally identifiable information (PII) from
a document’s text layer with the redaction tools exposed by NextPDF Connect.
These tools are Enterprise-tier. ToolRegistry builds redact_pdf,
zone_redact_pdf, and deidentify_pdf by probing for the Enterprise
privacy classes (RedactionEngine + PiiDetector) with class_exists().
It registers each tool under the enterprise tier only when those classes
are autoloadable. On an open-source-only install, the tools are absent: the
call fails with an unknown-tool error instead of degrading silently. All
three tools declare destructiveHint: true. The edit rewrites the page
content and is not reversible from the edited document.
This page documents tool behavior over the Connect surface. A redaction workflow does not certify that a document is free of personal data after the call. Detection runs only on the extractable text layer, and the deployment remains responsible for verifying the result.
Install
Section titled “Install”composer require nextpdf/serverThe redaction tools register only when you install the Enterprise privacy
module alongside the server. It ships in nextpdf/premium. Confirm that the
tool is present on the running deployment before you rely on it:
./vendor/bin/nextpdf-mcp <<'EOF'{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-06-18","capabilities":{},"clientInfo":{"name":"c","version":"1.0.0"}}}{"jsonrpc":"2.0","method":"notifications/initialized"}{"jsonrpc":"2.0","id":2,"method":"tools/list"}EOFIf redact_pdf is missing from the tools/list result, the Enterprise
privacy classes did not resolve on this install. See /connect/tool-catalog/
to learn how the registry computes the per-tier tool set at boot.
Conceptual overview
Section titled “Conceptual overview”Three tools cover three redaction strategies. All are Enterprise-tier, and all carry the destructive hint:
redact_pdf— detects and removes personal data from the document’s plain-text content with a built-in detector, then returns the edited content and a structured report.zone_redact_pdf— applies coordinate-based zone redactions to the plain-text content. Use it when you know the region by position rather than by pattern.deidentify_pdf— applies a systematic de-identification strategy (redact or suppress) across detected entities.
Removing content from a page content stream destructively edits that stream: the affected bytes are rewritten and are not recoverable from the edited document (ISO 32000-2 §14.11). By design, the report records the character count and position of each removal, never the removed text itself.
API surface
Section titled “API surface”The Enterprise package that defines each tool also ships its exact request
and response schema. This page documents the Connect invocation contract,
not a fixed parameter list. The tool names verified against the running
registry are redact_pdf, zone_redact_pdf, and deidentify_pdf, all in
the document category with destructiveHint: true. The catalog of record
is /connect/tool-catalog/. This recipe does not restate a tool count,
because that value is a runtime property of the deployment.
Code sample — Quick start
Section titled “Code sample — Quick start”Detect and remove content over Model Context Protocol (MCP) (tools/call).
The arguments below show the call shape. The authoritative argument schema
is the one that tools/list returns on your deployment:
{ "jsonrpc": "2.0", "id": 3, "method": "tools/call", "params": { "name": "redact_pdf", "arguments": { "source": "/var/lib/nextpdf/in/employee-directory.pdf" } }}A successful call returns a report. For each removal, an entry records the page, a category label, the original character count, and a bounding box, not the removed text.
Code sample — Production
Section titled “Code sample — Production”Treat the redaction call as a destructive operation, and inspect the report before you release the document. Over a networked transport, handle a transport failure and a tool-level error as separate cases:
curl -sS -X POST https://connect.example.com/v1/tools/redact_pdf \ -H 'Authorization: Bearer '"$NEXTPDF_CONNECT_TOKEN" \ -H 'Content-Type: application/json' \ -d '{"source":"/var/lib/nextpdf/in/legal-discovery-batch.pdf"}' \ -o /tmp/redaction-report.json -w '%{http_code}' > /tmp/redaction-statusstatus="$(cat /tmp/redaction-status)"if [ "$status" != "200" ]; then # 4xx/5xx is a normal HTTP outcome the caller inspects, not a transport # failure. A connection error (curl non-zero exit) is the separate case. echo "redact_pdf returned HTTP $status; inspect the body, do not release the document" >&2 exit 1fiRelease the edited document only after a human or downstream control has reviewed the report. Holding the release behind that review places the control where the automated edit introduces residual-data risk (IEC 31010:2019).
Edge cases & gotchas
Section titled “Edge cases & gotchas”- Scanned PDF with no text layer. Detection runs on the extractable text layer. An image-only page yields zero removals and is not an error. If the content is rasterized, run optical character recognition (OCR) on the document before redaction.
- Encrypted source. Supply the document password through the tool’s argument schema. Without it, the call fails instead of processing only part of the document.
- Tool absent. On an open-source-only install, the Enterprise privacy
classes do not resolve and
redact_pdfis not registered, so the call fails with an unknown-tool error. This is the intended boundary, not a degradation. - Overlapping detections. When more than one detector matches the same region, the tool removes the region once and de-duplicates the report.
Performance
Section titled “Performance”The performance budget in front-matter is a documentation cap, not a service-level guarantee. Large documents are processed page by page. Plan to re-run the call on a page-range subset instead of raising a global timeout.
Security notes
Section titled “Security notes”Data Residency & PII Mitigations
Section titled “Data Residency & PII Mitigations”The Connect host processes document text in-process. The report deliberately omits removed text and reports only counts and positions, so the report does not re-introduce the personal data it describes. Deployment-level data residency for the input and the edited output is the integrator’s responsibility, not a property of the tool.
Safe Telemetry & Log Scrubbing
Section titled “Safe Telemetry & Log Scrubbing”Do not log the source document path or the report body at an externally shipped log level. Log only the tool name, the request id, and the pass/fail outcome.
Threat model
Section titled “Threat model”A redaction that visually covers text but does not remove it leaves the data extractable. These tools rewrite the affected content stream instead of overlaying a rectangle; recovering removed bytes from the edited document is not possible (ISO 32000-2 §14.11). Residual risk remains when the detector misses content: a pattern outside its rules, or text present only as a rasterized image. The workflow mitigates that risk with the mandatory report review, not with a claim of completeness.
FIPS-mode behavior
Section titled “FIPS-mode behavior”Redaction performs no cryptographic operation and is unaffected by a Federal Information Processing Standards (FIPS)-mode policy on the host.
Conformance
Section titled “Conformance”| Claim | Clause | reference_id |
|---|---|---|
| Removing content rewrites the affected content stream | ISO 32000-2 §14.11 | |
| Redaction marks then removes; the removal is a content edit | ISO 32000-2 §14.11 | |
| Control placed at the point the automated edit introduces risk | IEC 31010:2019 |
Support for the redaction tools does not certify that a processed document is free of personal data. An independent review makes that determination.
Commercial context
Section titled “Commercial context”The redaction tools are Enterprise-tier. They register only when
nextpdf/premium is installed alongside the server. See the conversion
link in front-matter.
Connect specifics
Section titled “Connect specifics”Transport availability (MCP / REST / gRPC)
Section titled “Transport availability (MCP / REST / gRPC)”You invoke the tools the same way over every transport that drives the
shared tool executor: MCP tools/call, the REST tool endpoint, and the
gRPC service. The argument schema is transport-independent. It is the one
returned by tools/list (MCP) or the service descriptor (gRPC).
HITL risk tier
Section titled “HITL risk tier”All three tools declare destructiveHint: true. When an operator raises a
tool to the approval_required risk level through a configuration override,
the call is gated behind the ConfirmationGate. The override may only raise
risk, never lower it. See /connect/hitl-risk-tiers/.
Confirmation gate JSON envelope
Section titled “Confirmation gate JSON envelope”When the tool is gated and invoked without a valid token, the gate returns a challenge envelope of this form:
{ "allowed": false, "challenge": "<human-readable text>", "token": "confirm_<nonce>" }The caller re-invokes the same tool with arguments._confirmation_token set
to the issued token. The token binds the tool name, a nonce, and a
300-second TTL — not the arguments — and is single-use.
See also
Section titled “See also”- /connect/tool-catalog/ — how the registry computes the per-tier tool set.
- /connect/hitl-risk-tiers/ — the four-level risk model and the gate.
- /cookbook/connect/extract-text-content/ — preview the extractable text before redacting.
- /cookbook/connect/digital-signature/ — sign the edited document for chain-of-custody.