Redact PII from a PDF over Connect

At a glance

This recipe removes detected personally identifiable information (PII) from a document’s text layer with the redaction tools exposed by NextPDF Connect. These tools are Enterprise-tier. ToolRegistry builds redact_pdf, zone_redact_pdf, and deidentify_pdf by probing for the Enterprise privacy classes (RedactionEngine + PiiDetector) with class_exists(). It registers each tool under the enterprise tier only when those classes are autoloadable. On an open-source-only install, the tools are absent: the call fails with an unknown-tool error instead of degrading silently. All three tools declare destructiveHint: true. The edit rewrites the page content and is not reversible from the edited document.

This page documents tool behavior over the Connect surface. A redaction workflow does not certify that a document is free of personal data after the call. Detection runs only on the extractable text layer, and the deployment remains responsible for verifying the result.

Install

composer require nextpdf/server

The redaction tools register only when you install the Enterprise privacy module alongside the server. It ships in nextpdf/premium. Confirm that the tool is present on the running deployment before you rely on it:

./vendor/bin/nextpdf-mcp <<'EOF'
{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-06-18","capabilities":{},"clientInfo":{"name":"c","version":"1.0.0"}}}
{"jsonrpc":"2.0","method":"notifications/initialized"}
{"jsonrpc":"2.0","id":2,"method":"tools/list"}
EOF

If redact_pdf is missing from the tools/list result, the Enterprise privacy classes did not resolve on this install. See /connect/tool-catalog/ to learn how the registry computes the per-tier tool set at boot.

Conceptual overview

Three tools cover three redaction strategies. All are Enterprise-tier, and all carry the destructive hint:

redact_pdf — detects and removes personal data from the document’s plain-text content with a built-in detector, then returns the edited content and a structured report.
zone_redact_pdf — applies coordinate-based zone redactions to the plain-text content. Use it when you know the region by position rather than by pattern.
deidentify_pdf — applies a systematic de-identification strategy (redact or suppress) across detected entities.

Removing content from a page content stream destructively edits that stream: the affected bytes are rewritten and are not recoverable from the edited document (ISO 32000-2 §14.11). By design, the report records the character count and position of each removal, never the removed text itself.

API surface

The Enterprise package that defines each tool also ships its exact request and response schema. This page documents the Connect invocation contract, not a fixed parameter list. The tool names verified against the running registry are redact_pdf, zone_redact_pdf, and deidentify_pdf, all in the document category with destructiveHint: true. The catalog of record is /connect/tool-catalog/. This recipe does not restate a tool count, because that value is a runtime property of the deployment.

Code sample — Quick start

Detect and remove content over Model Context Protocol (MCP) (tools/call). The arguments below show the call shape. The authoritative argument schema is the one that tools/list returns on your deployment:

{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "tools/call",
  "params": {
    "name": "redact_pdf",
    "arguments": {
      "source": "/var/lib/nextpdf/in/employee-directory.pdf"
    }
  }
}

A successful call returns a report. For each removal, an entry records the page, a category label, the original character count, and a bounding box, not the removed text.

Code sample — Production

Treat the redaction call as a destructive operation, and inspect the report before you release the document. Over a networked transport, handle a transport failure and a tool-level error as separate cases:

curl -sS -X POST https://connect.example.com/v1/tools/redact_pdf \
  -H 'Authorization: Bearer '"$NEXTPDF_CONNECT_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"source":"/var/lib/nextpdf/in/legal-discovery-batch.pdf"}' \
  -o /tmp/redaction-report.json -w '%{http_code}' > /tmp/redaction-status

status="$(cat /tmp/redaction-status)"
if [ "$status" != "200" ]; then
  # 4xx/5xx is a normal HTTP outcome the caller inspects, not a transport
  # failure. A connection error (curl non-zero exit) is the separate case.
  echo "redact_pdf returned HTTP $status; inspect the body, do not release the document" >&2
  exit 1
fi

Release the edited document only after a human or downstream control has reviewed the report. Holding the release behind that review places the control where the automated edit introduces residual-data risk (IEC 31010:2019).

Edge cases & gotchas

Scanned PDF with no text layer. Detection runs on the extractable text layer. An image-only page yields zero removals and is not an error. If the content is rasterized, run optical character recognition (OCR) on the document before redaction.
Encrypted source. Supply the document password through the tool’s argument schema. Without it, the call fails instead of processing only part of the document.
Tool absent. On an open-source-only install, the Enterprise privacy classes do not resolve and redact_pdf is not registered, so the call fails with an unknown-tool error. This is the intended boundary, not a degradation.
Overlapping detections. When more than one detector matches the same region, the tool removes the region once and de-duplicates the report.

Performance

The performance budget in front-matter is a documentation cap, not a service-level guarantee. Large documents are processed page by page. Plan to re-run the call on a page-range subset instead of raising a global timeout.

Security notes

Data Residency & PII Mitigations

The Connect host processes document text in-process. The report deliberately omits removed text and reports only counts and positions, so the report does not re-introduce the personal data it describes. Deployment-level data residency for the input and the edited output is the integrator’s responsibility, not a property of the tool.

Safe Telemetry & Log Scrubbing

Do not log the source document path or the report body at an externally shipped log level. Log only the tool name, the request id, and the pass/fail outcome.

Threat model

A redaction that visually covers text but does not remove it leaves the data extractable. These tools rewrite the affected content stream instead of overlaying a rectangle; recovering removed bytes from the edited document is not possible (ISO 32000-2 §14.11). Residual risk remains when the detector misses content: a pattern outside its rules, or text present only as a rasterized image. The workflow mitigates that risk with the mandatory report review, not with a claim of completeness.

FIPS-mode behavior

Redaction performs no cryptographic operation and is unaffected by a Federal Information Processing Standards (FIPS)-mode policy on the host.

Conformance

Claim	Clause	reference_id
Removing content rewrites the affected content stream	ISO 32000-2 §14.11
Redaction marks then removes; the removal is a content edit	ISO 32000-2 §14.11
Control placed at the point the automated edit introduces risk	IEC 31010:2019

Support for the redaction tools does not certify that a processed document is free of personal data. An independent review makes that determination.

Commercial context

The redaction tools are Enterprise-tier. They register only when nextpdf/premium is installed alongside the server. See the conversion link in front-matter.

Connect specifics

Transport availability (MCP / REST / gRPC)

You invoke the tools the same way over every transport that drives the shared tool executor: MCP tools/call, the REST tool endpoint, and the gRPC service. The argument schema is transport-independent. It is the one returned by tools/list (MCP) or the service descriptor (gRPC).

HITL risk tier

All three tools declare destructiveHint: true. When an operator raises a tool to the approval_required risk level through a configuration override, the call is gated behind the ConfirmationGate. The override may only raise risk, never lower it. See /connect/hitl-risk-tiers/.

Confirmation gate JSON envelope

When the tool is gated and invoked without a valid token, the gate returns a challenge envelope of this form:

{ "allowed": false, "challenge": "<human-readable text>", "token": "confirm_<nonce>" }

The caller re-invokes the same tool with arguments._confirmation_token set to the issued token. The token binds the tool name, a nonce, and a 300-second TTL — not the arguments — and is single-use.

Redact PII from a PDF over Connect

Redact PII from a PDF over Connect

At a glance

Install

Conceptual overview

API surface

Code sample — Quick start

Code sample — Production

Edge cases & gotchas

Performance

Security notes

Data Residency & PII Mitigations

Safe Telemetry & Log Scrubbing

Threat model

FIPS-mode behavior

Conformance

Commercial context

Connect specifics

Transport availability (MCP / REST / gRPC)

HITL risk tier

Confirmation gate JSON envelope

See also