Redactor, Local PII Scrubber

The friction-killer for Module 2. Abstracting client facts by hand is the step people skip, so they paste the raw file instead. This tool does it automatically: drop in a client note, get back an abstracted fact pattern that's safe to share plus a local key that stays on your machine.

Demo vs. take-home, read this first

There are two browser files here, and the difference matters:

redactor.html is the live demo hosted on the Lab site. It loads a fake sample and includes an "Ask Claude" step that sends the redacted sample to the cloud through the Lab's API key. It exists to show how the workflow feels. It is not for real client data — don't paste real PII into the hosted page.
redactor-local.html is the take-home tool. A member downloads it, saves it, and runs it fully offline (no server, no external scripts, no cloud step — scrub + re-attach only). This is the one that actually honors "the client's identity never leaves your machine." It's what real client work uses.

Three ways to run it, pick by who you are

	Best for	How
`redactor.html`	A live demo to understand the workflow (fake data only)	Open it on the Lab site; click See it in action. The "Ask Claude" step is cloud-backed — demo only.
`redactor-local.html` ⭐	The take-home tool, real files, no install, no terminal	Download it, save to your machine, turn WiFi off, double-click. Paste or drop text, click Redact it. Runs 100% in-browser, no cloud step.
`redactor.py`	The validated pipeline, `.pdf`/`.docx`/batch/scripting	`python redactor.py file.pdf` (see below)

The two browser files share the same local-only scrub engine (regex + light NER). The offline HTML is the quick, no-install take-home to hand to any Lab member; the Python CLI is the validated pipeline, it runs on Microsoft Presidio + spaCy (the detection stack the Lab community settled on) and reads PDFs.

Detection engine: Microsoft Presidio + pymupdf (the validated stack)

Earlier the Python tool used a hand-rolled regex + optional spaCy script. It now uses Microsoft Presidio (presidio-analyzer + presidio-anonymizer) for detection and anonymization, the approach the AI Lab community validated instead of free-styling our own NER. On top of Presidio's built-in recognizers we add custom recognizers for tax-specific identifiers (EIN 12-3456789, bank routing/account numbers, and labeled ID / license / passport / policy numbers like ID card: 0234379SM) that Presidio doesn't ship by default. PDF text extraction uses pymupdf (fitz), since tax docs are mostly PDFs.

Everything still runs offline: Presidio and its spaCy model are local. The only network step is the one-time model download at install (see Setup).

Why it runs locally (this is the whole point)

A tool that removes PII has to touch the PII for a moment. If that happened in the cloud, the data would already be disclosed, defeating the purpose. So redactor.py makes zero network calls at runtime, Presidio and its spaCy model run locally. It reads a file on your computer and writes the scrubbed output back to your computer. Nothing is sent anywhere. Then the safe, abstracted text is what you bring to an AI tool. (The only time anything touches the network is the one-time spacy download at install.)

client doc (.pdf/.docx/.txt) → [redactor.py, Presidio, local, no network] → ┬→  *-abstracted.txt   (safe to share)
                                                                               └→  *-LOCALKEY.csv     (token→identity, KEEP LOCAL)

What it removes vs. keeps

Auto-removed (tokenized)	Kept (tax-determinative)
SSN / TIN, EIN	Dollar amounts & magnitudes
Email, phone, URLs	Percentages, ownership %
Account / long ID numbers	Dates, holding periods, sequence
Names, orgs & locations (Presidio NER)	Entity type, filing status
Street addresses / places (Presidio LOCATION)	State / jurisdiction names

Consistent tokens ([Person_1], [Entity_1], [SSN_1]) preserve every relationship the tax law cares about, so the research still fits your client. The same value always maps to the same token, and that map is saved to the local key file.

Setup (one time)

Requires Python 3 (you have Anaconda). From this folder:

pip install -r requirements.txt
python -m spacy download en_core_web_lg

That spacy download is the only network step, after it, the tool runs fully offline. If disk is tight, en_core_web_sm also works; install it instead and pass --model en_core_web_sm.

Usage

# scrub a PDF, Word doc, or text file
python redactor.py return_1040.pdf
python redactor.py client_notes.docx
python redactor.py notes.txt --outdir ./scrubbed

# scrub pasted text directly
python redactor.py --text "Jane Doe, SSN 123-45-6789, 60% owner of Acme LLC..."

# tune sensitivity (higher = fewer false positives, lower = catches more)
python redactor.py notes.txt --min-score 0.5

Outputs land next to your file (or in --outdir):

<name>-abstracted.txt, paste this into your AI tool.
<name>-LOCALKEY.csv, keep with the client file; never share. Re-attach identities here when you write up the memo.

Flags

Flag	What it does
`--text "..."`	Scrub literal text instead of a file
`--outdir DIR`	Where to write outputs (default: current folder)
`--min-score N`	Presidio confidence threshold, 0–1 (default `0.35`). Raise for precision, lower for recall
`--model NAME`	spaCy model (default `en_core_web_lg`; use `en_core_web_sm` if that's what you installed)

Changed from the old version: the --aggressive flag is gone. Presidio's NER replaces the old heuristic name/address pass, so names, orgs, and locations are now detected by default, > tune recall with --min-score instead. EIN and bank routing/account numbers are caught by custom recognizers (account/routing detection leans on nearby words like "account" or "routing" to avoid tokenizing every long number).

Reliability, be honest about the limits

High confidence (reliably removed): SSN/ITIN, EIN, email, phone, URLs, labeled bank account / routing numbers, and labeled ID / license / passport / policy numbers, structured patterns Presidio + our custom recognizers catch well.
Best-effort (review required): names, organizations, and addresses come from Presidio's NER. It's much stronger than the old heuristic, but no NER is perfect, verify the output.
Image-only / scanned PDFs: pymupdf extracts text, not pictures. A scanned PDF with no text layer yields little or nothing to scrub, OCR it or export to text first.

This is a safety floor, not a guarantee. Always eyeball the -abstracted.txt before pasting it anywhere. It dramatically reduces the chance of leaking PII; it does not replace your review.

Where it fits

Scrub → abstracted text + local key (this tool)
Bring the abstracted text to the tax-research skill
Re-attach the real identities, then document the result in the Research Memo template.

Round-trip (re-attach) is a feature of the browser tools only. Both redactor-local.html and the redactor.html demo have a "Bring the answer back" step that swaps tokens back to real names using your key. The Python CLI is scrub-only (one-way), it has no re-attach command; if you use the CLI, do the re-attachment by find-and-replace from the *-LOCALKEY.csv it produces.

See also the Fact Sheet template for the manual version of the same idea.