← Back to the Library
Library · Guide

Open Lab, Skills & Architecture: Code Computes, AI Drafts, You Decide

⚡ For builders. The engineering principle the Lab's most accurate builders converged on, > independently, is a three-way split: deterministic code does the mechanics, the LLM does the legwork (drafts, classifies, flags), and the licensed professional makes every judgment call. Firms running it report near-zero errors on real work. This module is how to build that way, plus the write-safety pattern that makes AI-to-ledger automation safe.

🧪 See it in action: the Bookkeeping Pipeline lab is this whole architecture made runnable, extract (AI drafts), categorize (AI proposes, you decide), trial balance (deterministic), all on 12 months of synthetic statements.


The core principle, three layers, not two

It's tempting to say "code does the mechanics, the LLM does the judgment." That's wrong, and it's dangerous, professional judgment is yours, by license and by every standard in Guardrails. The LLM doesn't make the call; it does the legwork that feeds your call. So split the work three ways:

Code computes. AI drafts. You decide. - Mechanics → deterministic scripts, anything with a single correct answer (arithmetic, lookups, formatting, moving data without dropping a row). - Legwork → the LLM, classify, draft, summarize, flag anomalies, surface the rule that might apply. It proposes; it never decides. - Judgment → the licensed professional, review the proposals, weigh them, make the call, sign. Non-delegable.

The failure mode isn't only asking one model to do mechanics and reasoning, it's letting the model's output stand as the decision. It never does. (This is exactly why the staging-table pattern below works: AI proposes, you decide, code executes.)

This is why members reviewing 1040s with a library of focused skills get consistent results: the skills surface what a reviewer should look at; the reviewer still makes the judgment. "Just ask the model to review this return" both produces a different answer every time and quietly puts the model in the chair that belongs to you.

What a "skill" is, and why a library beats a mega-prompt

A skill is a saved, reusable instruction set scoped to one job (see the tax-research skill for a worked example). The builders getting real mileage aren't writing one giant prompt, they're assembling many small skills, each doing one thing well (one firm runs ~27; another reviews 1040s with ~9). Small skills are:

(Reality check from the room: skill saving/versioning is still largely manual today, keep your skills in a known place and version them like code.)

Anatomy of a production skill

A throwaway prompt and a production skill look different. A skill you'll trust on real work spells out six things, so it behaves the same way every run:

  1. Scope, the one job, and explicitly what it does not do.
  2. Inputs, exactly what it needs, and what to do when something's missing (flag, don't guess).
  3. Steps, the procedure, in order.
  4. Guardrails, the hard rules (no invented figures/cites; flag ambiguity [REVIEW]; never decide a position).

  5. Output format, a fixed shape you can check or feed to the next step.

  6. Examples, one or two worked input→output pairs. This single addition does more for consistency than any amount of prose.

The tax-research skill is a worked example of all six.

The architecture, concretely

A reliable AI workflow usually looks like this:

1. INGEST       (script)  pull/normalize the data, OCR, parse, dedupe, foot the totals
2. DRAFT & FLAG (LLM)     classify, flag anomalies, draft, surface the rule that may apply, proposes
3. COMPUTE      (script)  any math, thresholds, rollups, never the LLM
4. JUDGE & SIGN (human)   the professional weighs the flags, makes the calls, signs
5. WRITE        (script)  only after the human decides, see the staging-table pattern below

The decision lives in step 4, with you, never in step 2. The LLM hands you a better-prepared desk; it doesn't clear it.

Worked example, a 1040 review skill set: scripts extract and normalize the return data and foot the schedules (steps 1, 3); LLM skills flag "this looks like a missing 1099," "Schedule C home-office math doesn't tie," "consider a QBI issue" (step 2); the preparer decides what each flag means and what to do (step 4). The LLM never computes the tax and never decides a position, it spots what a reviewer should look at. The reviewer still reviews. That's what gets the error rate near zero and keeps the judgment where it legally belongs.

The write-safety pattern (don't let AI touch the books directly)

The most important pattern for anyone connecting AI to a ledger or any system of record, the Lab's best-in-class answer to write-hallucination:

The staging table. The LLM never writes to the live ledger. It writes a proposed change to a staging table (or a draft JE, a review queue). A deterministic script validates and executes it only after a human approves. The model proposes; code disposes; a person decides.

This gives you AI's speed on the proposal with zero risk of a hallucinated write hitting a client's books. If you build one safety pattern, build this one.

Testing & managing a skill library (hobby vs. production)

The difference between someone's 3 fun skills and a firm's 27 reliable ones is evaluation and version control. Non-determinism means "it worked when I tried it" is not evidence, you need to know it works repeatably.

When to reach for a script vs. the LLM (rules of thumb)

Task Use
Arithmetic, footing, thresholds, rollups Script (computes)
Exact reformatting, file moves, dedup Script (computes)
Pulling/normalizing data from a source Script (computes)
Classifying, coding, flagging anomalies LLM (proposes)
Drafting language, summarizing, explaining LLM (proposes)
Choosing the rule/treatment/position LLM surfaces options → you decide
Writing to a ledger / system of record Script, only after you decide (staging table)

Guardrails for skill-based systems

Your starter

Take one task you currently do with a long do-everything prompt. Split it three ways: which steps have a single correct answer (→ script), which are reasoning legwork the AI can draft (→ LLM), and which are the calls only you should make (→ you). Build just the LLM legwork step as a small skill, do the mechanics in a spreadsheet or script you control, and keep the decisions yours. Notice how much more consistent the result gets, and how the judgment never left your desk.


Open Lab track · pairs with Build vs. Buy and Workpaper automation.

The AI Lab for Accountants · An educational resource, not legal or tax advice.