Redactor, Local PII Scrubber
The friction-killer for Module 2. Abstracting client facts by hand is the step people skip, so they paste the raw file instead. This tool does it automatically: drop in a client note, get back an abstracted fact pattern that's safe to share plus a local key that stays on your machine.
Demo vs. take-home, read this first
There are two browser files here, and the difference matters:
-
redactor.htmlis the live demo hosted on the Lab site. It loads a fake sample and includes an "Ask Claude" step that sends the redacted sample to the cloud through the Lab's API key. It exists to show how the workflow feels. It is not for real client data — don't paste real PII into the hosted page. -
redactor-local.htmlis the take-home tool. A member downloads it, saves it, and runs it fully offline (no server, no external scripts, no cloud step — scrub + re-attach only). This is the one that actually honors "the client's identity never leaves your machine." It's what real client work uses.
Three ways to run it, pick by who you are
| Best for | How | |
|---|---|---|
redactor.html |
A live demo to understand the workflow (fake data only) | Open it on the Lab site; click See it in action. The "Ask Claude" step is cloud-backed — demo only. |
redactor-local.html ⭐ |
The take-home tool, real files, no install, no terminal | Download it, save to your machine, turn WiFi off, double-click. Paste or drop text, click Redact it. Runs 100% in-browser, no cloud step. |
redactor.py |
The validated pipeline, .pdf/.docx/batch/scripting |
python redactor.py file.pdf (see below) |
The two browser files share the same local-only scrub engine (regex + light NER). The offline HTML is the quick, no-install take-home to hand to any Lab member; the Python CLI is the validated pipeline, it runs on Microsoft Presidio + spaCy (the detection stack the Lab community settled on) and reads PDFs.
Detection engine: Microsoft Presidio + pymupdf (the validated stack)
Earlier the Python tool used a hand-rolled regex + optional spaCy script. It now uses
Microsoft Presidio (presidio-analyzer +
presidio-anonymizer) for detection and anonymization, the approach the AI Lab community
validated instead of free-styling our own NER. On top of Presidio's built-in recognizers we add
custom recognizers for tax-specific identifiers (EIN 12-3456789, bank routing/account
numbers, and labeled ID / license / passport / policy numbers like ID card: 0234379SM) that
Presidio doesn't ship by default. PDF text extraction uses pymupdf (fitz), since tax docs
are mostly PDFs.
Everything still runs offline: Presidio and its spaCy model are local. The only network step is the one-time model download at install (see Setup).
Why it runs locally (this is the whole point)
A tool that removes PII has to touch the PII for a moment. If that happened in the cloud, the data
would already be disclosed, defeating the purpose. So redactor.py makes zero network calls at
runtime, Presidio and its spaCy model run locally. It reads a file on your computer and writes the
scrubbed output back to your computer. Nothing is sent anywhere. Then the safe, abstracted text is
what you bring to an AI tool. (The only time anything touches the network is the one-time
spacy download at install.)
client doc (.pdf/.docx/.txt) → [redactor.py, Presidio, local, no network] → ┬→ *-abstracted.txt (safe to share)
└→ *-LOCALKEY.csv (token→identity, KEEP LOCAL)
What it removes vs. keeps
| Auto-removed (tokenized) | Kept (tax-determinative) |
|---|---|
| SSN / TIN, EIN | Dollar amounts & magnitudes |
| Email, phone, URLs | Percentages, ownership % |
| Account / long ID numbers | Dates, holding periods, sequence |
| Names, orgs & locations (Presidio NER) | Entity type, filing status |
| Street addresses / places (Presidio LOCATION) | State / jurisdiction names |
Consistent tokens ([Person_1], [Entity_1], [SSN_1]) preserve every relationship the tax law
cares about, so the research still fits your client. The same value always maps to the same token,
and that map is saved to the local key file.
Setup (one time)
Requires Python 3 (you have Anaconda). From this folder:
pip install -r requirements.txt
python -m spacy download en_core_web_lg
That spacy download is the only network step, after it, the tool runs fully offline. If disk is
tight, en_core_web_sm also works; install it instead and pass --model en_core_web_sm.
Usage
# scrub a PDF, Word doc, or text file
python redactor.py return_1040.pdf
python redactor.py client_notes.docx
python redactor.py notes.txt --outdir ./scrubbed
# scrub pasted text directly
python redactor.py --text "Jane Doe, SSN 123-45-6789, 60% owner of Acme LLC..."
# tune sensitivity (higher = fewer false positives, lower = catches more)
python redactor.py notes.txt --min-score 0.5
Outputs land next to your file (or in --outdir):
<name>-abstracted.txt, paste this into your AI tool.<name>-LOCALKEY.csv, keep with the client file; never share. Re-attach identities here when you write up the memo.
Flags
| Flag | What it does |
|---|---|
--text "..." |
Scrub literal text instead of a file |
--outdir DIR |
Where to write outputs (default: current folder) |
--min-score N |
Presidio confidence threshold, 0–1 (default 0.35). Raise for precision, lower for recall |
--model NAME |
spaCy model (default en_core_web_lg; use en_core_web_sm if that's what you installed) |
Changed from the old version: the
--aggressiveflag is gone. Presidio's NER replaces the old heuristic name/address pass, so names, orgs, and locations are now detected by default, > tune recall with--min-scoreinstead. EIN and bank routing/account numbers are caught by custom recognizers (account/routing detection leans on nearby words like "account" or "routing" to avoid tokenizing every long number).
Reliability, be honest about the limits
-
High confidence (reliably removed): SSN/ITIN, EIN, email, phone, URLs, labeled bank account / routing numbers, and labeled ID / license / passport / policy numbers, structured patterns Presidio + our custom recognizers catch well.
-
Best-effort (review required): names, organizations, and addresses come from Presidio's NER. It's much stronger than the old heuristic, but no NER is perfect, verify the output.
-
Image-only / scanned PDFs: pymupdf extracts text, not pictures. A scanned PDF with no text layer yields little or nothing to scrub, OCR it or export to text first.
This is a safety floor, not a guarantee. Always eyeball the
-abstracted.txtbefore pasting it anywhere. It dramatically reduces the chance of leaking PII; it does not replace your review.
Where it fits
- Scrub → abstracted text + local key (this tool)
- Bring the abstracted text to the
tax-researchskill - Re-attach the real identities, then document the result in the Research Memo template.
Round-trip (re-attach) is a feature of the browser tools only. Both
redactor-local.htmland theredactor.htmldemo have a "Bring the answer back" step that swaps tokens back to real names using your key. The Python CLI is scrub-only (one-way), it has no re-attach command; if you use the CLI, do the re-attachment by find-and-replace from the*-LOCALKEY.csvit produces.
See also the Fact Sheet template for the manual version of the same idea.