Module 5, Data Cleanup & Extraction

⚡ The 30-second version. Use it for: extracting and cleaning messy data, bank statements, PDFs, ledgers, exports. The method: let AI do the tedious 80% (extract, categorize, reshape), then verify every number ties to the source before it posts. Start here: the 15-minute starter at the bottom.

This is turning messy source data into structured, usable data: extracting from bank statements, PDFs, and handwritten ledgers; categorizing transactions; cleaning and reshaping spreadsheets; OCR; data conversion. Some of the most tedious work in the firm, and AI is genuinely fast at it. It's also where a quiet error does the most damage, because the output is numbers that go on the books.

The one-sentence method: AI turns messy into structured fast, but every extracted number is wrong until it ties to the source. Trust the structure; verify the figures.

This module has two hard edges the others don't share as sharply:

Extraction errors are silent. AI will transpose a digit, drop a row, or confidently misread a handwritten 7 as a 1, and the output looks perfect. In accounting, a wrong number is the cardinal sin.
You usually can't anonymize the source. Unlike tax research, you can't abstract a bank statement you're trying to digitize, you need the real numbers. So this module leans on the firm-approved-tool path, not the anonymize path.

What "good" looks like

The win is "AI does the tedious 80%, extract, categorize, reshape, and I verify, not retype." Four jobs AI does well:

Extract, pull structured data from PDFs, statements, and exports (and, with more care, images / handwriting).
Categorize & classify, code transactions, map accounts, label line items.
Clean & standardize, fix formats, dedupe, normalize dates/names, reshape a sheet.
Spreadsheet help, build formulas, pivots, and VBA.

What AI is not doing: guaranteeing the numbers. That's the verification step, and it's non-negotiable.

The two-layer approach (method vs. real data)

Same idea as Module 2, applied to data, it lets you use any tool for the hard-thinking part while keeping real client data in a safe lane:

Layer 1, Build the method on dummy/sample data (any enabled tool). Work out the extraction format, the categorization rules, the formula, or the reshape logic using a few fake or genericized rows. No real client data needed to design the process.
Layer 2, Run the real data through a firm-approved tool. Apply the proven method to the actual client file in a firm-approved/controlled tool, never public/personal AI, because the source is full of PII/TRI you can't strip.

Design once on safe data; run the sensitive data only where it's controlled.

The verification protocol (the core, do this every time)

Extracted data is unverified until it ties to the source. Before anything posts to the books:

Foot & cross-foot, do the extracted totals add up, and match the source totals?
[ ] Reconcile to a control, tie to the statement ending balance, the GL total, or the document's stated total. A control that doesn't tie means an error upstream.
[ ] Completeness, row counts match? Nothing dropped or duplicated? (AI silently drops rows.)
Spot-check, sample several rows against the source, including any that look "too clean."
Handwriting/image = extra scrutiny, vision extraction has a higher error rate; verify more, not less.

If it doesn't tie, don't post it. This is the data-world equivalent of Module 2's "no unverified citation."

Prompt library

1. Extract structured data from a statement/export (use a firm-approved tool for real data)

Extract this [bank statement / invoice / export] into a clean table with columns [date, description,
amount...]. Preserve every row. Do not infer or fill in missing values, mark them [BLANK]. After
the table, give me the row count and the total of the amount column so I can foot it to the source.

2. Categorize / code transactions (build the rules on dummy data first)

Here are transactions: [paste]. Categorize each into [my chart of accounts / categories: ...].
Where a transaction is ambiguous, flag it [REVIEW] rather than guessing. Return a table with my
category and a confidence flag; don't change any amounts or descriptions.

3. Clean & reshape a messy spreadsheet

This data is messy: [paste]. Standardize [dates to YYYY-MM-DD / names / formats], remove exact
duplicates (show me what you removed), and reshape it to columns [...]. Don't alter any values, only formatting and structure. List every change category you made.

4. Build a formula / pivot (no client data needed)

In Excel, I have columns [describe]. Write a formula that [goal]. Explain what it does in one line
so I can verify it's doing what I think.

5. The verification step

Here's the source total: [X] and the extracted data: [paste]. Foot the extracted amount column,
compare to the source total, and tell me the difference and which rows to check if it doesn't tie.

Tool picks

Claude (Lab default), strong at parsing pasted/exported text, categorizing, and building formulas; vision can read statements/images (verify more). Use a firm-approved/Team tier for real client data.
Dedicated extraction / bank-feed / OCR tools, purpose-built ingestion (and the "automated bank feeds → classify → financials" pipeline several of you asked for) is often more reliable and auditable than general AI for high-volume work.
QBO / Xero rules & the Claude Excel add-in, in-platform categorization and in-sheet cleanup where your data already lives.
Build it (Module 4), a recurring cleanup is a great rung-1 skill; the redactor tool is an example of a small local data utility.

Guardrails

The source is loaded with PII/TRI, and you usually can't strip it. Bank statements and ledgers carry account numbers, names, and balances you need to keep. So real-data extraction belongs in a firm-approved tool, with the §7216 / FTC Safeguards-WISP / AICPA Confidentiality Rule analysis applied (see Regulatory Foundation and the AI Acceptable-Use Policy). Design the method on dummy data anywhere; run the real data only where it's controlled. The redactor tool can still strip stray identifiers (names, account #s) from notes and descriptions where you don't need them.
Verify every figure, extracted numbers are unverified until they tie. Foot, reconcile, count, and spot-check before anything posts. AI errors here are silent and look correct.
Never let it invent or "fix" values. Use [BLANK] / [REVIEW] flags so gaps and ambiguities surface instead of getting silently filled. You decide the treatment.
You are the reviewer of record. Categorization is accounting judgment; extraction is data you attest to by posting it. AI accelerates; you verify and own it (SSTS §1.4; SSARS/GAAS for attest).

Your 15-minute starter

Grab a small, low-stakes export or a dummy/sample statement (or genericize a few rows).
Run Prompt 1 to extract it into a clean table with a row count and a column total.
Foot it to the source with Prompt 5 (or by hand). Notice whether it ties, and if it doesn't, you just watched the verification step earn its place.
Once you trust the process on safe data, run real data only in a firm-approved tool, every time ending with the tie-out.

Win condition: you stop retyping data and start verifying it, and you never post an AI-extracted figure that hasn't tied to its source.

That's the set, all five modules

You've now got the full top-five library: Client Communication · Tax Research · Writing & Deliverables · Workflow Automation · Data Cleanup & Extraction.

Every module rests on the same spine: anonymize or use a firm-approved tool; verify before you rely; the license and the signature are yours, not the AI's. Start at Guardrails and Regulatory Foundation, and use the templates, skills, and tools to put it to work.