AI Fundamentals for Finance

How LLMs actually behave — probabilistic outputs, hallucinations, context, intent, agents — and the one rule that keeps AI usable when numbers must be correct, not approximately correct.

Most people meet AI through a chat box. You type a question, you get an answer, and it feels like a calculator that talks. It is not. Working with AI in accounting — where numbers must be correct, not approximately correct — requires a different mental model.

This page covers the minimum you need: how Large Language Models behave, what to feed them, how to ask, how agents work, and the one rule that keeps AI usable for finance.

LLM Nature

A Large Language Model does not look up answers. It predicts the most likely next word, piece by piece, with randomness baked in. Different models trade off speed for depth, but all share this nature.

The consequence is unintuitive: the same prompt can produce different answers. Ask an LLM to compute a tax three times and you may get three slightly different numbers. The fourth try might be wrong by a wider margin.

LLMs are probabilistic, not deterministic. Treat their direct output as a draft, never as a verdict.

Hallucinations

When an LLM does not know something, it does not stop. It guesses — fluently, confidently, and often wrongly. It will invent account names that don’t exist, cite tax rules that were never written, and reference invoices it has never seen. This is called hallucination, and it is not a bug to be patched away. It is a direct consequence of how the model works.

The lesson is simple: never trust raw LLM output for facts. Verify, ground, or — better — route the work through something deterministic.

Context

Hallucinations get worse when the model has nothing to ground itself on. That is where context comes in.

An LLM only knows what is in its context window right now: your current prompt, the files you attached, the recent conversation. It does not know your books. It does not remember last week. Each session starts blank.

You build context by handing it relevant pieces — a chart of accounts, a transaction list, a policy document, a project’s AGENTS.md, an installed skill. Persistent context, such as skills or project files, saves you from pasting the same information every time.

But context has a sweet spot. Too little, and the model invents. Too much, and it loses focus, mixes unrelated pieces, and slows down. The key is curation — give the model exactly what it needs to answer the question in front of it, nothing more.

The curve never touches zero because better context reduces hallucination risk; it does not make the model deterministic.

Intent

Intent means telling AI what done looks like, not listing every step.

A useful prompt names four things: outcome, reason, source of truth, and success criteria.

Old habit: step-by-step

“Open my book, filter transactions tagged #sales for Jan–Mar, sum the VAT column, convert to EUR, give me the total.”

You are scripting the work. The model can still misread a step, skip one, or invent around it.

Better habit: intent

“I need the VAT I owe for Q1 2025, in EUR, ready to file. Use my Bkper book as the source of truth.”

You describe the outcome. The agent can decide which transactions to pull, which tag to trust, which math to run, and whether to answer directly or write a small script.

Success criteria

Pair intent with a concrete check: an expected total or range, a report shape that matches last quarter’s, a reconciliation that should come out to zero, or a specific account whose closing balance you know. Without that, the model has no way to know when it is done — and neither do you.

Agents

An agent is an LLM running in a loop with tools. At each step the model proposes an action, runs a tool — a CLI command, a script, an API call — and observes the result. That observation feeds the next step, which may be progress, a correction, a retry, or a different approach. The loop keeps turning until the success criteria are met.

This is the shape behind the Bkper CLI Agent and other tool-using AI assistants. The success criteria is what closes the loop — without it, a probabilistic engine running freely produces drift, not progress. And a loop is only as trustworthy as the tools inside it.

AI in Accounting

AI fundamentals apply across finance. The accounting layer is where they get strict — because accounting numbers don’t have a tolerance band.

Accounting cannot be 99% right. A balance sheet that is mostly correct is wrong. A tax filing that is approximately accurate is a problem. And no technique — better prompts, richer context, smarter agents — makes an LLM’s output guaranteed correct. Errors will happen, and inside an agent loop they compound silently between checks.

So the rule is not make the AI correct. Nothing makes the AI correct. The rule is:

Never let unverified LLM output be the final word on a number.

The practical question is how to keep verification cheap. That is what code is for.

When an LLM writes a script that computes the answer, you stop verifying outputs and start verifying the script. You read it once, test it, and trust it as long as it doesn’t change. From then on the same inputs give the same outputs, auditable line by line. Verification becomes a one-time cost instead of a per-result cost.

That shifts the rule into a practical split:

Deterministic work — tax calculations, reports, reconciliations, financial statements, balance computations — has a single correct answer that must be reproducible. Have the LLM write code or call a deterministic tool, then verify the code, not each output. The work becomes repeatable, auditable, and reusable.
Non-deterministic work — spotting suspicious transactions, surfacing business insights, bootstrapping a chart of accounts, summarizing a period — has no single correct answer. Direct LLM output is acceptable here, but only as a draft for a human to review and decide on.

In both cases the human stays in the loop. AI doesn’t remove the reviewer; it changes what arrives for review. With code carrying the deterministic load, the human is checking artifacts a human can actually check — a script, a report engine, an app — instead of re-checking every number the model emits.

Further watching

“Never Trust An LLM” by Matt Pocock — a developer-oriented explanation of why LLM output must be verified instead of trusted directly.

What’s next

Docs for AI — get Bkper docs and context into AI tools.
CLI vs MCP — choose how an AI assistant should use Bkper.
Coding Agents — build Bkper integrations with grounded coding agents.