Skip to content

Redact PII from a PDF over Connect

This recipe removes detected personally identifiable information (PII) from a document’s text layer with the redaction tools exposed by NextPDF Connect. These tools are Enterprise-tier. ToolRegistry builds redact_pdf, zone_redact_pdf, and deidentify_pdf by probing for the Enterprise privacy classes (RedactionEngine + PiiDetector) with class_exists(). It registers each tool under the enterprise tier only when those classes are autoloadable. On an open-source-only install, the tools are absent: the call fails with an unknown-tool error instead of degrading silently. All three tools declare destructiveHint: true. The edit rewrites the page content and is not reversible from the edited document.

This page documents tool behavior over the Connect surface. A redaction workflow does not certify that a document is free of personal data after the call. Detection runs only on the extractable text layer, and the deployment remains responsible for verifying the result.

Terminal window
composer require nextpdf/server

The redaction tools register only when you install the Enterprise privacy module alongside the server. It ships in nextpdf/premium. Confirm that the tool is present on the running deployment before you rely on it:

Terminal window
./vendor/bin/nextpdf-mcp <<'EOF'
{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-06-18","capabilities":{},"clientInfo":{"name":"c","version":"1.0.0"}}}
{"jsonrpc":"2.0","method":"notifications/initialized"}
{"jsonrpc":"2.0","id":2,"method":"tools/list"}
EOF

If redact_pdf is missing from the tools/list result, the Enterprise privacy classes did not resolve on this install. See /connect/tool-catalog/ to learn how the registry computes the per-tier tool set at boot.

Three tools cover three redaction strategies. All are Enterprise-tier, and all carry the destructive hint:

  • redact_pdf — detects and removes personal data from the document’s plain-text content with a built-in detector, then returns the edited content and a structured report.
  • zone_redact_pdf — applies coordinate-based zone redactions to the plain-text content. Use it when you know the region by position rather than by pattern.
  • deidentify_pdf — applies a systematic de-identification strategy (redact or suppress) across detected entities.

Removing content from a page content stream destructively edits that stream: the affected bytes are rewritten and are not recoverable from the edited document (ISO 32000-2 §14.11). By design, the report records the character count and position of each removal, never the removed text itself.

The Enterprise package that defines each tool also ships its exact request and response schema. This page documents the Connect invocation contract, not a fixed parameter list. The tool names verified against the running registry are redact_pdf, zone_redact_pdf, and deidentify_pdf, all in the document category with destructiveHint: true. The catalog of record is /connect/tool-catalog/. This recipe does not restate a tool count, because that value is a runtime property of the deployment.

Detect and remove content over Model Context Protocol (MCP) (tools/call). The arguments below show the call shape. The authoritative argument schema is the one that tools/list returns on your deployment:

{
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "redact_pdf",
"arguments": {
"source": "/var/lib/nextpdf/in/employee-directory.pdf"
}
}
}

A successful call returns a report. For each removal, an entry records the page, a category label, the original character count, and a bounding box, not the removed text.

Treat the redaction call as a destructive operation, and inspect the report before you release the document. Over a networked transport, handle a transport failure and a tool-level error as separate cases:

Terminal window
curl -sS -X POST https://connect.example.com/v1/tools/redact_pdf \
-H 'Authorization: Bearer '"$NEXTPDF_CONNECT_TOKEN" \
-H 'Content-Type: application/json' \
-d '{"source":"/var/lib/nextpdf/in/legal-discovery-batch.pdf"}' \
-o /tmp/redaction-report.json -w '%{http_code}' > /tmp/redaction-status
Terminal window
status="$(cat /tmp/redaction-status)"
if [ "$status" != "200" ]; then
# 4xx/5xx is a normal HTTP outcome the caller inspects, not a transport
# failure. A connection error (curl non-zero exit) is the separate case.
echo "redact_pdf returned HTTP $status; inspect the body, do not release the document" >&2
exit 1
fi

Release the edited document only after a human or downstream control has reviewed the report. Holding the release behind that review places the control where the automated edit introduces residual-data risk (IEC 31010:2019).

  • Scanned PDF with no text layer. Detection runs on the extractable text layer. An image-only page yields zero removals and is not an error. If the content is rasterized, run optical character recognition (OCR) on the document before redaction.
  • Encrypted source. Supply the document password through the tool’s argument schema. Without it, the call fails instead of processing only part of the document.
  • Tool absent. On an open-source-only install, the Enterprise privacy classes do not resolve and redact_pdf is not registered, so the call fails with an unknown-tool error. This is the intended boundary, not a degradation.
  • Overlapping detections. When more than one detector matches the same region, the tool removes the region once and de-duplicates the report.

The performance budget in front-matter is a documentation cap, not a service-level guarantee. Large documents are processed page by page. Plan to re-run the call on a page-range subset instead of raising a global timeout.

The Connect host processes document text in-process. The report deliberately omits removed text and reports only counts and positions, so the report does not re-introduce the personal data it describes. Deployment-level data residency for the input and the edited output is the integrator’s responsibility, not a property of the tool.

Do not log the source document path or the report body at an externally shipped log level. Log only the tool name, the request id, and the pass/fail outcome.

A redaction that visually covers text but does not remove it leaves the data extractable. These tools rewrite the affected content stream instead of overlaying a rectangle; recovering removed bytes from the edited document is not possible (ISO 32000-2 §14.11). Residual risk remains when the detector misses content: a pattern outside its rules, or text present only as a rasterized image. The workflow mitigates that risk with the mandatory report review, not with a claim of completeness.

Redaction performs no cryptographic operation and is unaffected by a Federal Information Processing Standards (FIPS)-mode policy on the host.

ClaimClausereference_id
Removing content rewrites the affected content streamISO 32000-2 §14.11
Redaction marks then removes; the removal is a content editISO 32000-2 §14.11
Control placed at the point the automated edit introduces riskIEC 31010:2019

Support for the redaction tools does not certify that a processed document is free of personal data. An independent review makes that determination.

The redaction tools are Enterprise-tier. They register only when nextpdf/premium is installed alongside the server. See the conversion link in front-matter.

Transport availability (MCP / REST / gRPC)

Section titled “Transport availability (MCP / REST / gRPC)”

You invoke the tools the same way over every transport that drives the shared tool executor: MCP tools/call, the REST tool endpoint, and the gRPC service. The argument schema is transport-independent. It is the one returned by tools/list (MCP) or the service descriptor (gRPC).

All three tools declare destructiveHint: true. When an operator raises a tool to the approval_required risk level through a configuration override, the call is gated behind the ConfirmationGate. The override may only raise risk, never lower it. See /connect/hitl-risk-tiers/.

When the tool is gated and invoked without a valid token, the gate returns a challenge envelope of this form:

{ "allowed": false, "challenge": "<human-readable text>", "token": "confirm_<nonce>" }

The caller re-invokes the same tool with arguments._confirmation_token set to the issued token. The token binds the tool name, a nonce, and a 300-second TTL — not the arguments — and is single-use.

  • /connect/tool-catalog/ — how the registry computes the per-tier tool set.
  • /connect/hitl-risk-tiers/ — the four-level risk model and the gate.
  • /cookbook/connect/extract-text-content/ — preview the extractable text before redacting.
  • /cookbook/connect/digital-signature/ — sign the edited document for chain-of-custody.