Open Lab, Skills & Architecture: Code Computes, AI Drafts, You Decide
⚡ For builders. The engineering principle the Lab's most accurate builders converged on, > independently, is a three-way split: deterministic code does the mechanics, the LLM does the legwork (drafts, classifies, flags), and the licensed professional makes every judgment call. Firms running it report near-zero errors on real work. This module is how to build that way, plus the write-safety pattern that makes AI-to-ledger automation safe.
🧪 See it in action: the Bookkeeping Pipeline lab is this whole architecture made runnable, extract (AI drafts), categorize (AI proposes, you decide), trial balance (deterministic), all on 12 months of synthetic statements.
The core principle, three layers, not two
It's tempting to say "code does the mechanics, the LLM does the judgment." That's wrong, and it's dangerous, professional judgment is yours, by license and by every standard in Guardrails. The LLM doesn't make the call; it does the legwork that feeds your call. So split the work three ways:
Code computes. AI drafts. You decide. - Mechanics → deterministic scripts, anything with a single correct answer (arithmetic, lookups, formatting, moving data without dropping a row). - Legwork → the LLM, classify, draft, summarize, flag anomalies, surface the rule that might apply. It proposes; it never decides. - Judgment → the licensed professional, review the proposals, weigh them, make the call, sign. Non-delegable.
The failure mode isn't only asking one model to do mechanics and reasoning, it's letting the model's output stand as the decision. It never does. (This is exactly why the staging-table pattern below works: AI proposes, you decide, code executes.)
This is why members reviewing 1040s with a library of focused skills get consistent results: the skills surface what a reviewer should look at; the reviewer still makes the judgment. "Just ask the model to review this return" both produces a different answer every time and quietly puts the model in the chair that belongs to you.
What a "skill" is, and why a library beats a mega-prompt
A skill is a saved, reusable instruction set scoped to one job (see the
tax-research skill for a worked example). The builders getting real
mileage aren't writing one giant prompt, they're assembling many small skills, each doing one
thing well (one firm runs ~27; another reviews 1040s with ~9). Small skills are:
- Testable, you can tell whether this step is right.
- Composable, chain them; reuse across clients.
- Debuggable, when output is wrong, you know which skill to fix.
(Reality check from the room: skill saving/versioning is still largely manual today, keep your skills in a known place and version them like code.)
Anatomy of a production skill
A throwaway prompt and a production skill look different. A skill you'll trust on real work spells out six things, so it behaves the same way every run:
- Scope, the one job, and explicitly what it does not do.
- Inputs, exactly what it needs, and what to do when something's missing (flag, don't guess).
- Steps, the procedure, in order.
-
Guardrails, the hard rules (no invented figures/cites; flag ambiguity
[REVIEW]; never decide a position). -
Output format, a fixed shape you can check or feed to the next step.
- Examples, one or two worked input→output pairs. This single addition does more for consistency than any amount of prose.
The tax-research skill is a worked example of all six.
The architecture, concretely
A reliable AI workflow usually looks like this:
1. INGEST (script) pull/normalize the data, OCR, parse, dedupe, foot the totals
2. DRAFT & FLAG (LLM) classify, flag anomalies, draft, surface the rule that may apply, proposes
3. COMPUTE (script) any math, thresholds, rollups, never the LLM
4. JUDGE & SIGN (human) the professional weighs the flags, makes the calls, signs
5. WRITE (script) only after the human decides, see the staging-table pattern below
The decision lives in step 4, with you, never in step 2. The LLM hands you a better-prepared desk; it doesn't clear it.
Worked example, a 1040 review skill set: scripts extract and normalize the return data and foot the schedules (steps 1, 3); LLM skills flag "this looks like a missing 1099," "Schedule C home-office math doesn't tie," "consider a QBI issue" (step 2); the preparer decides what each flag means and what to do (step 4). The LLM never computes the tax and never decides a position, it spots what a reviewer should look at. The reviewer still reviews. That's what gets the error rate near zero and keeps the judgment where it legally belongs.
The write-safety pattern (don't let AI touch the books directly)
The most important pattern for anyone connecting AI to a ledger or any system of record, the Lab's best-in-class answer to write-hallucination:
The staging table. The LLM never writes to the live ledger. It writes a proposed change to a staging table (or a draft JE, a review queue). A deterministic script validates and executes it only after a human approves. The model proposes; code disposes; a person decides.
This gives you AI's speed on the proposal with zero risk of a hallucinated write hitting a client's books. If you build one safety pattern, build this one.
Testing & managing a skill library (hobby vs. production)
The difference between someone's 3 fun skills and a firm's 27 reliable ones is evaluation and version control. Non-determinism means "it worked when I tried it" is not evidence, you need to know it works repeatably.
-
Golden test cases. For each skill, keep a few fixed input→expected-output pairs (anonymized). Before you change a skill, or trust it on a new client type, run them and confirm the output still holds. This is your regression test against the model drifting or a prompt edit breaking something.
-
Version skills like code. A skill is source. Keep them in one place (a folder, ideally under git), date/version them, and write down what changed and why. "Which version reviewed that return?" should have an answer.
-
Pin the model where it matters. The same skill can behave differently across models/versions. For anything load-bearing, note which model it was validated on, and re-test when you upgrade.
-
Compose, don't bloat. When a skill starts doing three jobs, split it. Orchestrate small skills in sequence rather than growing one into a mega-prompt, you keep the testability and debuggability.
-
Token note: small, scoped skills usually cost fewer tokens than one do-everything prompt, and they let you right-size the model per step (heavy reasoning where needed, light elsewhere).
When to reach for a script vs. the LLM (rules of thumb)
| Task | Use |
|---|---|
| Arithmetic, footing, thresholds, rollups | Script (computes) |
| Exact reformatting, file moves, dedup | Script (computes) |
| Pulling/normalizing data from a source | Script (computes) |
| Classifying, coding, flagging anomalies | LLM (proposes) |
| Drafting language, summarizing, explaining | LLM (proposes) |
| Choosing the rule/treatment/position | LLM surfaces options → you decide |
| Writing to a ledger / system of record | Script, only after you decide (staging table) |
Guardrails for skill-based systems
-
Non-determinism is a professional-reliability issue. The same input can yield different LLM output. Pin the mechanics in scripts; keep humans on the judgments; don't promise a return is "reviewed" because a skill ran.
-
Verify authority regardless of the skill. A grounded skill still isn't primary authority, every tax cite gets checked (Module 2).
- Reviewer of record. Skills accelerate the reviewer; they don't replace the signature.
- The compliance stack still applies to data your skills touch, scrub or approved-tool, WISP, §7216 (Guardrails).
Your starter
Take one task you currently do with a long do-everything prompt. Split it three ways: which steps have a single correct answer (→ script), which are reasoning legwork the AI can draft (→ LLM), and which are the calls only you should make (→ you). Build just the LLM legwork step as a small skill, do the mechanics in a spreadsheet or script you control, and keep the decisions yours. Notice how much more consistent the result gets, and how the judgment never left your desk.
Open Lab track · pairs with Build vs. Buy and Workpaper automation.