← Back to the Library
Library · Guide

Module 5, Data Cleanup & Extraction

⚡ The 30-second version. Use it for: extracting and cleaning messy data, bank statements, PDFs, ledgers, exports. The method: let AI do the tedious 80% (extract, categorize, reshape), then verify every number ties to the source before it posts. Start here: the 15-minute starter at the bottom.

This is turning messy source data into structured, usable data: extracting from bank statements, PDFs, and handwritten ledgers; categorizing transactions; cleaning and reshaping spreadsheets; OCR; data conversion. Some of the most tedious work in the firm, and AI is genuinely fast at it. It's also where a quiet error does the most damage, because the output is numbers that go on the books.

The one-sentence method: AI turns messy into structured fast, but every extracted number is wrong until it ties to the source. Trust the structure; verify the figures.

This module has two hard edges the others don't share as sharply:

  1. Extraction errors are silent. AI will transpose a digit, drop a row, or confidently misread a handwritten 7 as a 1, and the output looks perfect. In accounting, a wrong number is the cardinal sin.

  2. You usually can't anonymize the source. Unlike tax research, you can't abstract a bank statement you're trying to digitize, you need the real numbers. So this module leans on the firm-approved-tool path, not the anonymize path.


What "good" looks like

The win is "AI does the tedious 80%, extract, categorize, reshape, and I verify, not retype." Four jobs AI does well:

  1. Extract, pull structured data from PDFs, statements, and exports (and, with more care, images / handwriting).

  2. Categorize & classify, code transactions, map accounts, label line items.

  3. Clean & standardize, fix formats, dedupe, normalize dates/names, reshape a sheet.
  4. Spreadsheet help, build formulas, pivots, and VBA.

What AI is not doing: guaranteeing the numbers. That's the verification step, and it's non-negotiable.


The two-layer approach (method vs. real data)

Same idea as Module 2, applied to data, it lets you use any tool for the hard-thinking part while keeping real client data in a safe lane:

  1. Layer 1, Build the method on dummy/sample data (any enabled tool). Work out the extraction format, the categorization rules, the formula, or the reshape logic using a few fake or genericized rows. No real client data needed to design the process.

  2. Layer 2, Run the real data through a firm-approved tool. Apply the proven method to the actual client file in a firm-approved/controlled tool, never public/personal AI, because the source is full of PII/TRI you can't strip.

Design once on safe data; run the sensitive data only where it's controlled.


The verification protocol (the core, do this every time)

Extracted data is unverified until it ties to the source. Before anything posts to the books:

If it doesn't tie, don't post it. This is the data-world equivalent of Module 2's "no unverified citation."


Prompt library

1. Extract structured data from a statement/export (use a firm-approved tool for real data)

Extract this [bank statement / invoice / export] into a clean table with columns [date, description,
amount...]. Preserve every row. Do not infer or fill in missing values, mark them [BLANK]. After
the table, give me the row count and the total of the amount column so I can foot it to the source.

2. Categorize / code transactions (build the rules on dummy data first)

Here are transactions: [paste]. Categorize each into [my chart of accounts / categories: ...].
Where a transaction is ambiguous, flag it [REVIEW] rather than guessing. Return a table with my
category and a confidence flag; don't change any amounts or descriptions.

3. Clean & reshape a messy spreadsheet

This data is messy: [paste]. Standardize [dates to YYYY-MM-DD / names / formats], remove exact
duplicates (show me what you removed), and reshape it to columns [...]. Don't alter any values, only formatting and structure. List every change category you made.

4. Build a formula / pivot (no client data needed)

In Excel, I have columns [describe]. Write a formula that [goal]. Explain what it does in one line
so I can verify it's doing what I think.

5. The verification step

Here's the source total: [X] and the extracted data: [paste]. Foot the extracted amount column,
compare to the source total, and tell me the difference and which rows to check if it doesn't tie.

Tool picks


Guardrails


Your 15-minute starter

  1. Grab a small, low-stakes export or a dummy/sample statement (or genericize a few rows).
  2. Run Prompt 1 to extract it into a clean table with a row count and a column total.
  3. Foot it to the source with Prompt 5 (or by hand). Notice whether it ties, and if it doesn't, you just watched the verification step earn its place.

  4. Once you trust the process on safe data, run real data only in a firm-approved tool, every time ending with the tie-out.

Win condition: you stop retyping data and start verifying it, and you never post an AI-extracted figure that hasn't tied to its source.


That's the set, all five modules

You've now got the full top-five library: Client Communication · Tax Research · Writing & Deliverables · Workflow Automation · Data Cleanup & Extraction.

Every module rests on the same spine: anonymize or use a firm-approved tool; verify before you rely; the license and the signature are yours, not the AI's. Start at Guardrails and Regulatory Foundation, and use the templates, skills, and tools to put it to work.

The AI Lab for Accountants · An educational resource, not legal or tax advice.