Smartscan lets you upload an invoice or a receipt and it will automatically find the information in the document that you need. This includes things such as the total amount, order date, and more. Let’s see a small example of the information that Smartscan can find for you:

When Smartscan scans an invoice, it pulls out information such as invoice number, order date, and the total VAT.

This saves people many hours of manually typing information into accounting systems which directly translates into cost savings for companies and helps them to become paperless.

Why is invoice scanning difficult to solve?

We, humans, excel at finding structure and benefit from already understanding language and knowing what to look for. This may make invoice and receipt scanning seem easy from a human perspective. There are several reasons why this task is not as easy for a computer:

Language: A computer has no prior knowledge about language (grammar, semantics, numbers, and so on)
Multilinguality: We consider it impressive when a human understands a handful of languages, but Smartscan has to understand documents in more than 10 languages.
Layout: Humans use the layout of a document to quickly identify relevant information. A computer has no prior notion of what layout is and how it is used to relate words to each other in a document. On top of this, there are no industry standards for what an invoice has to look like.
Implicit information: Some information cannot be found explicitly in a document. An example of this is the billing date which is frequently noted as being 30 days from the order date. The VAT amount is another example as it often occurs implicitly based on the VAT rules of the country of the supplier.
Speed: Smartscan has to be able to do this up to several 100,000 times a day.

It has to learn all of these concepts (and more) from scratch. Teaching a computer how to do this is like teaching a toddler to do the same.

The Sesame Street revolution

Smartscan works by finding all words in a document through optical character recognition (OCR) and identifying which of the words are relevant to the user. This means that we can benefit greatly from innovations in natural language processing (NLP) by using NLP techniques to identify which words are relevant.

Our journey begins in NLP where a revolution has taken place in recent years. The revolution was kick-started by a series of models named after characters from Sesame Street like ELMo, Big Bird, and RoBERTa, but it was BERT that fast-tracked the revolution.

The Sesame Street characters, from the left: Grover, Bert, Abby Cadabby, Ernie, Kami, Elmo, Big Bird, the Cookie Monster.

While being a full-time muppet, BERT also enjoys his role as a large language model based on Transformers. Transformers is a topic that deserves its own post and we will be talking about them in the next post. For now, we will not worry about what exactly a Transformer is. Instead, we will simply think of it as a black box that encodes a vector representation of a sentence:

Transformer as a black box that encodes a vector representation of a sentence.

We actually get three of the letters in BERT from this simple understanding of Transformers: BERT Encodes vector Representations of sentences using Transformers.

BERT fast-tracked the revolution by showing how transfer learning can be greatly improved. Transfer learning is about how one machine learning model can transfer what it has learned to another model.

The first step is to train a general-purpose model on a huge text corpus (pertaining). Later, pre-trained model can be trained to do more specific tasks using labelled data (fine-tuning). The process looks like this:

BERT workflow: Pretrain a Transformer on a huge dataset and fine-tune to do specific tasks.

BERT workflow: Pretrain a Transformer on a huge dataset and fine-tune it to do specific tasks.

A task could for example be filtering email spam, finding names in a sentence, translation, writing a summary of an article, and more. We will be talking a lot more about how we fine-tune in a later post. In this post, we will focus on the pretraining step, how it works, and what benefits we gain from it.

You might be asking why we want to pretrain in the first place. After all, it is much simpler to skip the pretraining step and go directly to training models for the tasks that we care about (supervised learning). We can understand how pretraining help by understanding some of the challenges of supervised learning:

Supervised learning requires labelled data which is time-consuming to acquire. In the previous examples, we would have to involve humans. These humans would be used to provide a label of “spam” or “not spam” for every email we want to use for training a spam filter. In the translation example, we would have to request professional translators to provide correct translations for a large text corpus.

Labelled text data is hard to come by which restricts the size of labelled datasets. Unlabelled text data, on the other hand, is vastly easier to obtain at a large scale.

We know that language contains a lot of knowledge and structure (e.g. grammar). One would expect that language-based tasks would benefit from not having to learn these concepts from scratch for every task. The restriction on the size of labelled datasets also increases the difficulty of learning these concepts using supervised learning alone.

Pretraining addresses these challenges by enriching a model with language understanding and only using unlabelled data. We should mention though that while the NLP community has some understanding of why BERT pretraining is beneficial, we do not yet fully understand why it is so.

How to Pretrain Your Transformer

Pretraining is not an idea that was introduced with BERT. Most pretraining algorithms prior to BERT worked in a unidirectional manner. In practice, unidirectionality means that the model is only allowed to use the left context of a sentence to predict which word is most likely to follow. Consider this example where “sheep” is the most likely word to follow:

A unidirectional model predicts “sheep” to be the next word in the sentence.

In fact, OpenAI’s famous GPT-3 uses a unidirectional approach with a Transformer decoder as the model.

The authors behind BERT argued that the unidirectional approach limited the model’s capacity in several tasks where the entire context is needed. This is why they opted for a bidirectional (the B in BERT) approach.

BERT adopts the bidirectional approach through a masked language model (MLM). A MLM takes a sequence as input, but masks about 15% of the items randomly and attempts to predict what the masked items are. Let’s see how this works for the sentence in the unidirectional example:

Masked language model.

BERT masks “know” and “tells” and gives the Transformer the sentence with the two masked words. The Transformer then attempts to figure out which words are likely to be where the masks are. We see that BERT uses the masked words as labels to train against.

Incidentally, this shows why MLM is a form of semi-supervised learning: MLM creates labels from an unlabelled dataset (the semi part) and uses supervised learning to train the model.

MLM gives an efficient way of pretraining a bidirectional language model without using any labelled data! There is a problem though that masking is only relevant when pretraining and therefore creates a mismatch between pretraining and fine-tuning.

BERT makes a few changes and additions to minimise the mismatch, such that the words chosen to be masked are…

masked 80% of the time.
replaced by another randomly selected word 10% of the time.
left unchanged 10% of the time.

This forces the model to learn how likely the chosen words are to belong in the sentence (regardless of whether they are masked). This requires a fairly high level of language understanding to predict confidently.

(BERT actually also uses another objective called Next Sentence Prediction in its pretraining, but we will not talk about that here because it is irrelevant to Smartscan Premium. We still encourage you to learn more. There are many excellent resources about BERT and the research paper is freely available.)

We have now seen how to pretrain with the MLM, but we have not talked about the scale of pretraining. BERT came in two variants where the difference is the size of the Transformer: Base with 110 million parameters and Large with 340 million parameters.

The authors found Large to be more accurate than Base in most cases which agrees with other research about how models become more accurate as they become larger. Recent years have seen a parameter explosion with models that scale to trillions of parameters.

A crucial aspect of this scaling, however, is that the dataset size has to increase along with the model size in order to reap the benefits of scale. Indeed, BERT used a huge pretraining dataset consisting of more than 3 billion words—more than the entire English Wikipedia at the time (2018).

The large model and dataset sizes make pretraining a costly affair, but it pays off as we will see towards the end of this post.

Wait, what about documents?

We now have a pretty good idea about what BERT does and why it is a good idea to pretrain. You may have noticed though that BERT doesn’t actually work on documents like invoices and receipts out of the box. That is because it was only designed for pure text documents and not for documents where the layout is an important factor.

We can understand this problem from how BERT creates the input vectors to the Transformer for an invoice. There is quite a bit of information in the following figure, but do not worry about it, we will walk through it bit by bit:

BERT makes vector inputs using the words and the word order in the invoice.

We start with the invoice on the left and let the OCR find the words and assign an order to the words. The words are “Customer Supplier Total amount” in that order in this example. The words are turned into vectors through an embedding table in the middle that maps from words to vectors. The same goes for the word order.

The final input is the addition of the word and order vectors. The problem is now that the input vectors to the Transformer only contain information about the words as if they occurred in a linear order.

However, that is not the way they are presented in the invoice. “Customer” is in the top left corner, “Supplier” is in the top right corner, and “Total amount” is in the bottom right corner. This is a simplified example, but real invoices will typically contain much more complicated layouts that combine tables, boxes, and more. We need a way to make the Transformer aware of the layout.

We do this by taking inspiration from LayoutLM (Layout Language Model) that extends BERT to visually rich documents, such as invoices/receipts.

The key finding of LayoutLM is that BERT can transition from a language model to a layout-aware language model. That is done by imbuing the inputs to the Transformer with the 2D position of each word. Let’s see how that is done:

LayoutLM makes vector inputs using the words, word order, and 2D position in the invoice.

LayoutLM largely does the same as BERT, but also adds layout embeddings based on the x- and y-coordinates of the words. Now we have input vectors that contain all the information about the words and their positions.

The next step is to define how to pretrain LayoutLM. It turns out that the MLM can easily be extended to include layout. It works the same way as it does for BERT, but we never mask the layout embeddings. This way the MLM has to predict the chosen words to be masked, but it remains aware of their positions in the document.

The idea is that this should allow the model to learn the importance of the word’s position in the document in addition to gaining language understanding.

We could actually have used more information in an invoice or receipt to potentially improve Smartscan even more. Specifically, we could have used visual features such as colour, font size, formatting, and so on.

The authors of LayoutLM found that such features can improve the model in some cases, but the improvements are not always consistent. We decided to leave visual features for a future version of Smartscan for this reason.

We now know how to pretrain a layout-aware language model. Our model contains over 170 million parameters in total and we pretrained it on more than 23 million invoices and receipts! This is a substantial amount that required 4 days of pretraining on a machine with 16 A100 GPUs. Let’s now look at the benefits we gain from doing this.

Benefits of Pretraining

We evaluate the effect of pretraining by comparing how well the fine-tuned model performs when compared to standard Smartscan. It would take too long to explain exactly how Standard works.

Instead, we will simply summarise that Standard does not use pretraining and uses a recurrent neural network (specifically an LSTM) instead of a Transformer.

We will not go into the details here. However, we have tested that the majority of the improvements in Premium stem from the pretraining and not just from using a Transformer instead of an LSTM.

We will focus on two fields that users are frequently interested in: Order Date and Total Including VAT. The results are consistent—or better—across all fields, but we only show these two due to brevity. We choose to focus on these two fields because Standard is already quite good on them.

This is because we have many labelled examples of Order Date and Total Including VAT. The difference between Standard and Premium is even greater when compared to fields with only a few labelled examples. Let’s return to that point a little later below. We fine-tune two models to learn how to find Order Date and Total Including VAT.

Let’s start by looking at the accuracy of the models by looking at the error rate. The error rate is the percentage of documents where the field is predicted incorrectly (lower is better).

Error Rates of Standard and Premium.

Premium halves the error rate of Standard. We already now see that pretraining has paid off as Premium is twice as good as Standard. There is more though: We are also interested in how well Standard and Premium compares on documents that are different from what it would typically see in a dataset.

This could, for example, be documents in a language that is underrepresented in the training set, or documents with atypical layouts. We devised a dataset of such documents and measured the change in error rate compared to the previous figure. Let’s have a look at the numbers:

Change in error rate for atypical documents.

Standard behaves the way that we would expect a machine learning model to behave for atypical documents: Its error rate increases. Premium behaves quite astonishing though.

It hardly changes the error rate for Total Including VAT and even slightly decreases for Order Date. We think that this is because the pretrained model has already seen a vast amount of different documents, making the fine-tuned models able to generalise very well.

We see that Premium obtains very high accuracy and generalises to atypical documents. These are great results. We are also interested in how many labelled examples are required to learn to recognise a field. This is known as sample efficiency:

Sample efficiency of Standard and Premium.

It turns out that Premium is highly sample efficient. We typically consider a field to have minimum acceptable accuracy when it enters an error rate around 5% to 10%. Premium obtains this level between 100 and 1,000 labelled examples.

Standard, however, needs between 50,000 and 100,000 labelled examples to obtain it. Premium is around 100 times more sample efficient.

The ramifications of this are very important to our business case as it allows Smartscan to add more fields at a very fast rate. It is a tedious and time-consuming process to have humans label between 50,000 and 100,000 documents. However, 100 to 1,000 documents can be labelled within a week or two.

The last point about sample efficiency brings us to the final benefit of pretraining. We have focused on using the pretrained model to find specific fields in invoices and receipts. The pretrained model is not only made for this specific purpose though.

In fact, it is a general model of documents that can be fine-tuned to a large variety of tasks: Maybe we want to summarise an invoice in the future? Recognise lines and map them to specific accounts? Identify fraudulent activity? The high sample efficiency of the pretrained model allows us to quickly reach high accuracy on many tasks related to invoices and receipts.

Wrap-up

That was quite a bit. Thank you for reading if you have made it this far! You now know what pretraining is, how we use it for invoices and receipts, and what benefits we get from it.

Stay tuned for our next post where we will be talking about Transformers and how we use them in Premium. Please get in contact with us by visiting our Smartscan page if you have questions or are interested in Smartscan for your company.

Learn more about Smartscan and the team behind

Voice of Visma

We're sitting down with leaders and colleagues from around Visma to share their stories, industry knowledge, and valuable career lessons. With the Voice of Visma podcast, we’re bringing our people and culture closer to you. Welcome!

More about the podcast

christoffer.ohrstrom@visma.com

Published by

Diana Kigyossy

diana.kigyossy@visma.com

Diana is the Employer Branding Lead for Visma Group. She joined Visma in May 2022.

Published by

Benedikte Lyngbø

benedikte.lyngbo@visma.com

Benedikte Lyngbø is a Student Assistant Graduate Recruitment for Visma Group. She joined Visma in 2023.

Published by

Holli Hatherly

holli.hatherly@visma.com

Holli Hatherly joined Visma in 2024 and currently works as Marketing Network Manager for Visma Group.

Published by

Berit Braut

berit.braut@visma.com

Berit Braut is the Director of Commercial Growth and CX at Visma Group. She joined Visma in 2009 as a Management Trainee and believes the magic happens where technology meets customer expectations and business goals.

Published by

Mandy Burger

mandy.burger@visma.com

Mandy Burger is an Internal Communication Specialist for Visma Group. She joined Visma in 2022.

Published by

Victoria Bondarchuk

victoria.bondarchuk@visma.com

Victoria Bondarchuk worked as the Director of Product Discovery in Visma from 2017 to 2023.

Published by

Jet Bouwman

jet.bouwman@visma.com

Jet Bouwman is the Head of Employer Branding & Sponsorship Activation at Visma Group. She joined Visma in 2021.

Published by

Brian Ye

brian.ye@visma.com

Brian Ye is the Director of Product Analytics at Visma. He joined Visma in 2017 as a Management Trainee and has a Master of Science in Computer Science from the Royal Institute of Technology in Stockholm, Sweden.

Published by

Lilja Helgadottir

lilja.helgadottir@visma.com

Lilja is the Commercial Lead at Machine Learning Assets. She joined Visma in 2020 as a Management Trainee and has a Master of Science in Industrial Engineering and Management from the Technical University of Denmark.

Published by

Carina Ramsøy

carina.ramsoy@visma.com

Carina Ramsøy worked as a Marketing Manager at Visma from 2019 to 2022.

Published by

Kejsi Gjordeni

kejsi.gjordeni@visma.com

Kejsi Gjordeni worked as a Strategy and Business Manager at Visma from 2019 to 2023.

Published by

Martin Olstad

martin.olstad@visma.com

Martin Olstad worked as an Optimisation Consultant and AI Project Manager at Visma from 2020 to 2022.

Published by

Claus Dahl

claus.dahl@visma.com

Claus Dahl is the Director of Machine Learning Assets at Visma. A mathematician by training, Claus has 20 years of software experience — about half of that doing mathematics on computers. He has been with Visma since 2016.

Published by

Vlad Boldura

vlad.boldura@visma.com

Vlad Boldura worked as a Security Service Owner at Visma from 2021 to 2023.

‍

Published by

Kaja Falck-Ytter

kaja.falck-ytter@visma.com

Kaja Falck-Ytter worked as a Content Producer and Communications Specialist at Visma from 2015 to 2022.

Published by

Linn Kjærland

linn.kjarland@visma.com

Linn Kjærland worked as a Content Producer and Commercial Advisor at Visma from 2018 to 2023.

Published by

Ida Nicoline Glosimot

ida.nicoline.glosimot@visma.com

Ida Nicoline Glosimot worked as a Copywriter at Visma from 2019 to 2023.

Published by

Stina Wahlsten

stina.wahlsten@visma.com

Stina Wahlsten worked as a Sustainability Manager and Ops Lead at Visma from 2022 to 2024.

Published by

Pieter Bijl

pieter.bijl@visma.com

Pieter Bijl is an AI Developer at Visma Resolve. He joined Visma in 2022 as a Management Trainee.

Published by

Emily Northway

Emily.Northway@visma.com

Emily Northway worked as a Marketing Specialist at Visma from 2015 to 2018.

Published by

Rie Jørgensen

rie.jorgensen@visma.com

Rie Jørgensen joined Visma in 2014 and has held various roles in marketing across Visma's business units. She now works as Marketing Director at Visma Group.

Published by

Henriette Veiby

henriette.veiby@visma.com

Henriette Veiby works as a Product Owner at Visma Enterprise AS. She joined Visma in 2019 and has a Bachelor in Business Administration and a Master in Information Systems.

Published by

Magnus August Lien

magnus.lien@visma.com

Magnus August Lien is a Product Director at Visma Resolve. He joined Visma in 2019 and has a specialisation within machine learning and mathematical optimisation.

Published by

Philip Scarampi

philip.scarampi@visma.com

Philip Scarampi is a Sustainability Manager at Visma Group, with a background in communication and content marketing. He joined Visma in 2019.

Published by

Ari-Pekka Salovaara

ari-pekka.salovaara@visma.com

Ari-Pekka Salovaara is the Segment Director of Small Businesses at Visma. This segment provides cloud accounting and payroll software for over 2 million small businesses and accounting offices in 20 countries.

Published by

Helle Hobbelhagen

helle.hobbelhagen@visma.com

Helle Hobbelhagen is a Security Marketing Coordinator for Visma Group. She joined Visma in 2020 and specialises in security awareness and culture building.

Published by

Alexander Lystad

alexander.lystad@visma.com

T. Alexander Lystad is the Chief Technology Officer at Visma. He's been with Visma since 2012, implementing and extracting value from Continuous Delivery, DevOps, and cloud technology.

Published by

Ida Strande Markman

ida.strandemarkman@visma.com

Ida Strande Markman is a People Operations Manager at Visma Group, with a passion for diversity and inclusion in the workplace. She joined Visma in 2019.

Published by

Amanda Lundius Mörck

amanda.lundius.morck@visma.com

Amanda Lundius is the Director of User Experience at Visma. She holds a B.Sc. in Interaction Design and believes that good UX design is responsible and mindful, meeting the goals of both people and businesses.

Published by

Isabel Arkvik

isabel.arkvik@visma.com

Isabel Arkvik is the Head of Security Awareness and Training at Visma Group. She joined Visma in 2019.

Published by

Gina Ross Eriksen

gina.eriksen@visma.com

Gina Ross Eriksen is a Content Specialist at Visma Group.

Published by

Juliette Prado

juliette.prado@visma.com

Juliette Prado was a Senior Content Writer for Visma Group from 2023 to 2025.

‍

Published by

Anne-Grethe Thomle Karlsen

anne-grethe.thomle.karlsen@visma.com

Anne-Grethe Thomle Karlsen joined Visma in 2001 and most recently held the role of Director of Content & Sponsorship at Visma Group.

Published by

Christoffer Øhrstrøm

Leading with agentic AI: Why every business needs a bold AI ambition now

Visma's CTO, T. Alexander Lystad, emphasises the transformative power of agentic AI in reshaping business strategies and urges leaders to adopt ambitious AI goals to maintain competitiveness.

What is Smartscan?

Why is invoice scanning difficult to solve?

The Sesame Street revolution

How to Pretrain Your Transformer

Wait, what about documents?

Benefits of Pretraining

Wrap-up

About the episode

Voice of Visma

Related content

Leading with agentic AI: Why every business needs a bold AI ambition now

AI for leaders: A practical guide

Visma’s Machine Learning Assets Team make innovation in NLP

Voice of Visma Ep 13: Building partnerships beyond software with Daniel Ognøy Kaspersen

Voice of Visma Ep 12: AI in the accounting sphere with Joris Joppe

Voice of Visma Ep 01: An optimistic look at the future of AI with Jacob Nyman

Voice of Visma Ep 01: Show notes

The state of AI in 2025

AI explained: What is generative AI?

How AI optimises performance for Team Visma | Lease a Bike

AV1: A robot specifically developed for the classroom and school life

Optimised kindergarten admissions with AI

What is Robotic Process Automation (RPA)?

Machine learning of Smartscan part 2: Transformers

The AI ​​dictionary: Demystification of AI concepts

What’s the deal with AI copilots?

How AI is transforming Visma’s software development

Making Visma more efficient and sustainable with the help of robots

Optimising small businesses across Europe: 5 countries, 5 Visma products

6 ways to optimise customer support content for AI

An inside look at AI in the public sector

5 elements to include in your AI strategy

Working with AI is working with people

Tackling society’s healthcare challenges through innovation

4 Visma products using AI assistants and chatbots to elevate customer support

Democratising Machine Learning through Hugging Face

Garbage in, Gospel Out – The Unrealistic Expectations of AI

Introducing Machine Learning as a Service – a scalable, automatic way of delivering AI to customers

How Visma implemented a chatbot that nearly doubled the goal

How Visma makes invoicing tasks a thing of the past

Planning work schedules gets even easier with automated shift design

Machine learning of Smartscan Part 1: Bert awakens

Can a robot be sustainable?

The importance of ethics in Artificial Intelligence

Which AI product ideas are worth exploring?

Designing conversational AI from a UX perspective

Meet our AI Director: Jacob Nyman

Meet our AI talents: Martin Sommerseth

Voice of Visma Ep 04: How do you make people care about security? with Joakim Tauren

Voice of Visma Ep 04: Show notes

Strengthening cybersecurity across Europe with the NIS2 Directive

7 essential tips for maintaining good cyber hygiene

Partnering with ethical hackers to keep our software secure

What is ethical hacking?

How the Visma Security Program strengthens cybersecurity in the public sector

Establishing a cybersecurity programme: A step-by-step guide

Understanding the psychological barriers to 2FA and how we can overcome them

Cyber security: Can we trust cloud computing services?

3 ways employees pose a security threat for the corporation

Inspiring women to choose a career in tech and cyber security

What is ransomware and how does it work?

Visma’s contribution to European Cyber Security Month

European Cyber Security Month 2022

What is an IT Security Policy?

How to fix security issues in code faster

7 essential cybersecurity tips for summer travel

How on earth do cyber attacks cost almost €7 trillion a year?

What is Data Protection Day?

Meet our tech talents: Lars Holtar & Ioana Piroska

Cloud security: Frequently asked questions

What is social engineering and how to prevent such attacks?

Password manager – what is it and what are the benefits?

What are data backups and why are they important?

How to turn business security challenges into opportunities

The unpaved road of an agile Security Operational Centre

Why have a security awareness strategy in place?

Password tips: How to create a strong password

What is phishing?

Better user experience with Passwordless Authentication

Meet our employees: Women are the future of cyber security (part 2)

The AI dictionary: Demystification of AI concepts