AI document extraction vs OCR: line items are the real test

If you've ever exported an OCR result for a supplier invoice and tried to post it directly into your accounts package, you already know the gap. The total is right. The supplier name is right. But the line items are a wall of text, the VAT breakdown is in the wrong column, and the page number from page 2 has ended up inside the unit price field. So you re-key the whole thing.

This post is about why that gap exists, what changes when you swap traditional OCR for AI-based extraction, and where each approach actually fits in a working bookkeeping process.

What OCR is — and what it isn't

OCR (Optical Character Recognition) is character recognition. You give it a scan or a PDF, it gives you back the text. Modern OCR engines are very good at this — accuracy on a clean, machine-printed invoice has been north of 99% per character for years.

But character recognition is not the same as document understanding. OCR will tell you that the page contains the string Total €1,247.50 VAT @ 23% €233.13 Net €1,014.37. It will not tell you which of those three numbers is the gross, which is the VAT, and which is the net. It also will not tell you whether the VAT figure is itself part of the gross or stated separately.

For continuous prose — a contract, a letter, a bank statement narrative — that's fine. The text is the content. For an invoice, where the layout is the meaning, raw text is only the start of the job.

Why pure OCR struggles on real invoices

The invoices that come through a typical Irish or UK practice are not clean reference documents. They are:

Photographs from a phone, often slightly rotated, sometimes with a finger in the corner.
Scans through a mobile app with hard JPEG compression, banding, and uneven lighting.
PDFs exported from line-of-business systems with the text positioned by absolute coordinates rather than sensible reading order.
Multi-page statements where the totals are on page 1 and the line items are on pages 2–4, or vice versa.
Multi-currency — a euro invoice with a sterling amount in brackets, or a dollar quote with the converted euro figure stamped on the bottom.
Bilingual — Irish/English on Revenue-related invoices, German/English on EU acquisitions.

A pure OCR pipeline that returns a flat string doesn't help you pick the right number out of any of those. You still need a second layer — usually a stack of regular expressions, keyword matching, and bounding-box heuristics — to convert "the page contains these characters" into "the gross total is €1,247.50". That second layer is where most invoice automation has historically lived, and it is fragile. A new template from a familiar supplier will quietly break it. A change of position for the VAT number will route the wrong figure into the VAT field. The first you hear about it is when the VAT3 doesn't reconcile.

What AI-based extraction does differently

AI-based extraction — using vision-capable language models — collapses those two layers into one. Instead of "read the characters, then guess the structure", the model is given the document directly and asked for the structured fields you care about.

The practical difference is that the model has been trained on enough invoices, statements, and receipts to understand that:

A label like Total Due, Amount Payable, Balance, or À régler is the gross.
A figure preceded by @ 23% or VAT 23% is the tax, not the net.
A column with descriptions, quantities, unit prices, and line totals is a line-item table — even if the rules between rows are missing or the column headers are on a different page.
A photograph rotated 90° still contains the same invoice; the orientation is incidental.
Two values that look similar (1,247.50 and 1.247,50) are the same amount expressed in different decimal conventions.

None of that requires the document to match a specific template. The model handles a new supplier the same way a human does — by reading what's on the page and assigning meaning to it.

Where the difference actually shows up

The headline accuracy numbers ("99% on totals" / "97% on supplier name") are not where the difference is felt. Both approaches do well on those fields when the document is clean. The difference shows up in three places:

1. Line-item tables. A pure OCR pipeline will give you the line items as a stream of words and hope that whitespace tells you where the columns are. That works for a clean software-generated PDF and falls apart on anything else. AI-based extraction returns the line items as actual rows — description, quantity, unit price, VAT rate, line total — without you having to write a parser per supplier.

2. Multi-page documents. Invoice on page 1, terms and conditions on page 2, second invoice on page 3 — a single PDF, three documents. OCR sees one big string. AI-based extraction can be asked to return each invoice as a separate object, with the right pages assigned to each. Same goes for bank statements: the model can return one transaction per row across 30 pages without you stitching them together.

3. Edge cases. Credit notes posted as negative invoices. Discounts applied at the line level vs. the document level. VAT reverse-charge wording. Foreign-currency invoices with a stamped IE conversion. The long tail of "weird stuff a human would see and interpret correctly" is where templated OCR pipelines accumulate quiet errors and where AI-based extraction holds up much better.

Where OCR is still the right tool

This is not a "OCR is dead" argument. There are jobs where pure OCR is still the right answer:

High-volume, single-template documents — utility bills from one provider, lab forms from one lab. If the layout is fixed and the volume is high enough to justify a custom parser, OCR + a template will be faster and cheaper than running every page through a model.
Searchability — making a PDF text-searchable so users can grep through an archive. You don't need structure for that. You need text.
Compliance copies — converting paper to a text-extractable digital format for retention. Again, the goal is the text, not the meaning.
Pre-processing — high-quality OCR text can usefully be fed into an AI extraction step on documents where the visual layout is degraded, as a cross-check.

What changes is the calculus on multi-template, mixed-supplier work — exactly the workload of a typical bookkeeping practice — where the variety of layouts means template-per-supplier OCR doesn't pay off.

The hybrid reality

In practice, most production document pipelines run both. OCR for what OCR is good at — searchable text, fast first-pass character recognition, a fallback when image quality is poor. AI-based extraction for what it's good at — structured fields, tables, multi-document handling, rate detection, edge cases.

The user-facing outcome is what matters: did the line items, totals, VAT codes, and supplier details land in the accounts package with the right values, and did the human reviewing the extraction need to fix one field or twenty?

What to look for if you're evaluating tools

If you're comparing invoice automation tools and the marketing all looks the same, here are the questions that actually separate them:

Show me a line-item table from a multi-line invoice. Are descriptions, quantities, unit prices, VAT rates, and line totals all returned as discrete fields, or is the whole table dumped into one description field?
What happens with a phone photograph of a crumpled receipt? A slightly rotated, badly lit photograph is the realistic worst case in a typical practice. If the tool requires a clean PDF, it's not going to survive contact with reality.
What does it do with a multi-invoice PDF? Some practices receive a monthly statement that contains six separate invoices. Does the tool split them, or does it return one giant blob?
How does it handle VAT? Specifically, does it pull the VAT rate per line (23%, 13.5%, 9%, 0%), or just the VAT amount? Rates are what matter for export to QuickBooks, Xero, or Sage — and the Revenue VAT rates database is the only authoritative source for which rate applies to which supply. (See our post on Irish VAT mapping for why the rate, not the amount, is the load-bearing field.)
What's the review experience? Extraction is never going to be 100%. The question is how fast you can scan the result, accept the right fields, and correct the wrong ones. A tool with a great extraction rate and a slow review UI loses to one with a slightly worse extraction rate and a fast review UI.
Where does the data live? For Irish and UK practices subject to GDPR, EU data residency is not optional. Ask, get the answer in writing.

What this means in practice

If you're posting fewer than a hundred invoices a month from a fixed set of suppliers, a templated OCR setup will probably work well enough. If you're posting hundreds a month across dozens of suppliers — or if you're doing it for multiple clients and the variety multiplies — the friction of maintaining templates is the thing that quietly burns time.

KrinoDoc was built around the second case. Documents go in, structured fields come out, you review, you export to QuickBooks Online, Xero, or Sage with the codes already mapped. The bit that used to require a person reading the page and typing the numbers is the bit we've moved off the human's plate.

There is still a human in the loop. Extraction is not magic, and any tool that claims 100% accuracy on real-world documents is selling you something. But the loop is much shorter than it used to be — and the gap between what OCR gives you and what your accounts package actually needs is the gap that AI-based extraction has finally started to close.