Let us first understand what Smartscan does and why it is a difficult problem to solve.
What is Smartscan?
Smartscan lets you upload an invoice or a receipt and it will automatically find the information in the document that you need. This includes things such as the total amount, order date, and more. Let’s see a small example of the information that Smartscan can find for you:
This saves people many hours of manually typing information into accounting systems which directly translates into cost savings for companies and helps them to become paperless.
Why is invoice scanning difficult to solve?
We, humans, excel at finding structure and benefit from already understanding language and knowing what to look for. This may make invoice and receipt scanning seem easy from a human perspective. There are several reasons why this task is not as easy for a computer:
- Language: A computer has no prior knowledge about language (grammar, semantics, numbers, and so on)
- Multilinguality: We consider it impressive when a human understands a handful of languages, but Smartscan has to understand documents in more than 10 languages.
- Layout: Humans use the layout of a document to quickly identify relevant information. A computer has no prior notion of what layout is and how it is used to relate words to each other in a document. On top of this, there are no industry standards for what an invoice has to look like.
- Implicit information: Some information cannot be found explicitly in a document. An example of this is the billing date which is frequently noted as being 30 days from the order date. The VAT amount is another example as it often occurs implicitly based on the VAT rules of the country of the supplier.
- Speed: Smartscan has to be able to do this up to several 100,000 times a day.
It has to learn all of these concepts (and more) from scratch. Teaching a computer how to do this is like teaching a toddler to do the same.
The Sesame Street revolution
Smartscan works by finding all words in a document through optical character recognition (OCR) and identifying which of the words are relevant to the user. This means that we can benefit greatly from innovations in natural language processing (NLP) by using NLP techniques to identify which words are relevant.
Our journey begins in NLP where a revolution has taken place in recent years. The revolution was kick-started by a series of models named after characters from Sesame Street like ELMo, Big Bird, and RoBERTa, but it was BERT that fast-tracked the revolution.
While being a full-time muppet, BERT also enjoys his role as a large language model based on Transformers. Transformers is a topic that deserves its own post and we will be talking about them in the next post. For now, we will not worry about what exactly a Transformer is. Instead, we will simply think of it as a black box that encodes a vector representation of a sentence:
Transformer as a black box that encodes a vector representation of a sentence.
We actually get three of the letters in BERT from this simple understanding of Transformers: BERT Encodes vector Representations of sentences using Transformers.
BERT fast-tracked the revolution by showing how transfer learning can be greatly improved. Transfer learning is about how one machine learning model can transfer what it has learned to another model.
The first step is to train a general-purpose model on a huge text corpus (pertaining). Later, pre-trained model can be trained to do more specific tasks using labelled data (fine-tuning). The process looks like this:
BERT workflow: Pretrain a Transformer on a huge dataset and fine-tune it to do specific tasks.
A task could for example be filtering email spam, finding names in a sentence, translation, writing a summary of an article, and more. We will be talking a lot more about how we fine-tune in a later post. In this post, we will focus on the pretraining step, how it works, and what benefits we gain from it.
You might be asking why we want to pretrain in the first place. After all, it is much simpler to skip the pretraining step and go directly to training models for the tasks that we care about (supervised learning). We can understand how pretraining help by understanding some of the challenges of supervised learning:
Supervised learning requires labelled data which is time-consuming to acquire. In the previous examples, we would have to involve humans. These humans would be used to provide a label of “spam” or “not spam” for every email we want to use for training a spam filter. In the translation example, we would have to request professional translators to provide correct translations for a large text corpus.
Labelled text data is hard to come by which restricts the size of labelled datasets. Unlabelled text data, on the other hand, is vastly easier to obtain at a large scale.
We know that language contains a lot of knowledge and structure (e.g. grammar). One would expect that language-based tasks would benefit from not having to learn these concepts from scratch for every task. The restriction on the size of labelled datasets also increases the difficulty of learning these concepts using supervised learning alone.
Pretraining addresses these challenges by enriching a model with language understanding and only using unlabelled data. We should mention though that while the NLP community has some understanding of why BERT pretraining is beneficial, we do not yet fully understand why it is so.
How to Pretrain Your Transformer
Pretraining is not an idea that was introduced with BERT. Most pretraining algorithms prior to BERT worked in a unidirectional manner. In practice, unidirectionality means that the model is only allowed to use the left context of a sentence to predict which word is most likely to follow. Consider this example where “sheep” is the most likely word to follow:
A unidirectional model predicts “sheep” to be the next word in the sentence.
In fact, OpenAI’s famous GPT-3 uses a unidirectional approach with a Transformer decoder as the model.
The authors behind BERT argued that the unidirectional approach limited the model’s capacity in several tasks where the entire context is needed. This is why they opted for a bidirectional (the B in BERT) approach.
BERT adopts the bidirectional approach through a masked language model (MLM). A MLM takes a sequence as input, but masks about 15% of the items randomly and attempts to predict what the masked items are. Let’s see how this works for the sentence in the unidirectional example:
Masked language model.
BERT masks “know” and “tells” and gives the Transformer the sentence with the two masked words. The Transformer then attempts to figure out which words are likely to be where the masks are. We see that BERT uses the masked words as labels to train against.
Incidentally, this shows why MLM is a form of semi-supervised learning: MLM creates labels from an unlabelled dataset (the semi part) and uses supervised learning to train the model.
MLM gives an efficient way of pretraining a bidirectional language model without using any labelled data! There is a problem though that masking is only relevant when pretraining and therefore creates a mismatch between pretraining and fine-tuning.
BERT makes a few changes and additions to minimise the mismatch, such that the words chosen to be masked are…
- masked 80% of the time.
- replaced by another randomly selected word 10% of the time.
- left unchanged 10% of the time.
This forces the model to learn how likely the chosen words are to belong in the sentence (regardless of whether they are masked). This requires a fairly high level of language understanding to predict confidently.
(BERT actually also uses another objective called Next Sentence Prediction in its pretraining, but we will not talk about that here because it is irrelevant to Smartscan Premium. We still encourage you to learn more. There are many excellent resources about BERT and the research paper is freely available.)
We have now seen how to pretrain with the MLM, but we have not talked about the scale of pretraining. BERT came in two variants where the difference is the size of the Transformer: Base with 110 million parameters and Large with 340 million parameters.
The authors found Large to be more accurate than Base in most cases which agrees with other research about how models become more accurate as they become larger. Recent years have seen a parameter explosion with models that scale to trillions of parameters.
A crucial aspect of this scaling, however, is that the dataset size has to increase along with the model size in order to reap the benefits of scale. Indeed, BERT used a huge pretraining dataset consisting of more than 3 billion words—more than the entire English Wikipedia at the time (2018).
The large model and dataset sizes make pretraining a costly affair, but it pays off as we will see towards the end of this post.
Wait, what about documents?
We now have a pretty good idea about what BERT does and why it is a good idea to pretrain. You may have noticed though that BERT doesn’t actually work on documents like invoices and receipts out of the box. That is because it was only designed for pure text documents and not for documents where the layout is an important factor.
We can understand this problem from how BERT creates the input vectors to the Transformer for an invoice. There is quite a bit of information in the following figure, but do not worry about it, we will walk through it bit by bit:
BERT makes vector inputs using the words and the word order in the invoice.
We start with the invoice on the left and let the OCR find the words and assign an order to the words. The words are “Customer Supplier Total amount” in that order in this example. The words are turned into vectors through an embedding table in the middle that maps from words to vectors. The same goes for the word order.
The final input is the addition of the word and order vectors. The problem is now that the input vectors to the Transformer only contain information about the words as if they occurred in a linear order.
However, that is not the way they are presented in the invoice. “Customer” is in the top left corner, “Supplier” is in the top right corner, and “Total amount” is in the bottom right corner. This is a simplified example, but real invoices will typically contain much more complicated layouts that combine tables, boxes, and more. We need a way to make the Transformer aware of the layout.
We do this by taking inspiration from LayoutLM (Layout Language Model) that extends BERT to visually rich documents, such as invoices/receipts.
The key finding of LayoutLM is that BERT can transition from a language model to a layout-aware language model. That is done by imbuing the inputs to the Transformer with the 2D position of each word. Let’s see how that is done:
LayoutLM makes vector inputs using the words, word order, and 2D position in the invoice.
LayoutLM largely does the same as BERT, but also adds layout embeddings based on the x- and y-coordinates of the words. Now we have input vectors that contain all the information about the words and their positions.
The next step is to define how to pretrain LayoutLM. It turns out that the MLM can easily be extended to include layout. It works the same way as it does for BERT, but we never mask the layout embeddings. This way the MLM has to predict the chosen words to be masked, but it remains aware of their positions in the document.
The idea is that this should allow the model to learn the importance of the word’s position in the document in addition to gaining language understanding.
We could actually have used more information in an invoice or receipt to potentially improve Smartscan even more. Specifically, we could have used visual features such as colour, font size, formatting, and so on.
The authors of LayoutLM found that such features can improve the model in some cases, but the improvements are not always consistent. We decided to leave visual features for a future version of Smartscan for this reason.
We now know how to pretrain a layout-aware language model. Our model contains over 170 million parameters in total and we pretrained it on more than 23 million invoices and receipts! This is a substantial amount that required 4 days of pretraining on a machine with 16 A100 GPUs. Let’s now look at the benefits we gain from doing this.
Benefits of Pretraining
We evaluate the effect of pretraining by comparing how well the fine-tuned model performs when compared to standard Smartscan. It would take too long to explain exactly how Standard works.
Instead, we will simply summarise that Standard does not use pretraining and uses a recurrent neural network (specifically an LSTM) instead of a Transformer.
We will not go into the details here. However, we have tested that the majority of the improvements in Premium stem from the pretraining and not just from using a Transformer instead of an LSTM.
We will focus on two fields that users are frequently interested in: Order Date and Total Including VAT. The results are consistent—or better—across all fields, but we only show these two due to brevity. We choose to focus on these two fields because Standard is already quite good on them.
This is because we have many labelled examples of Order Date and Total Including VAT. The difference between Standard and Premium is even greater when compared to fields with only a few labelled examples. Let’s return to that point a little later below. We fine-tune two models to learn how to find Order Date and Total Including VAT.
Let’s start by looking at the accuracy of the models by looking at the error rate. The error rate is the percentage of documents where the field is predicted incorrectly (lower is better).
Error Rates of Standard and Premium.
Premium halves the error rate of Standard. We already now see that pretraining has paid off as Premium is twice as good as Standard. There is more though: We are also interested in how well Standard and Premium compares on documents that are different from what it would typically see in a dataset.
This could, for example, be documents in a language that is underrepresented in the training set, or documents with atypical layouts. We devised a dataset of such documents and measured the change in error rate compared to the previous figure. Let’s have a look at the numbers:
Change in error rate for atypical documents.
Standard behaves the way that we would expect a machine learning model to behave for atypical documents: Its error rate increases. Premium behaves quite astonishing though.
It hardly changes the error rate for Total Including VAT and even slightly decreases for Order Date. We think that this is because the pretrained model has already seen a vast amount of different documents, making the fine-tuned models able to generalise very well.
We see that Premium obtains very high accuracy and generalises to atypical documents. These are great results. We are also interested in how many labelled examples are required to learn to recognise a field. This is known as sample efficiency:
Sample efficiency of Standard and Premium.
It turns out that Premium is highly sample efficient. We typically consider a field to have minimum acceptable accuracy when it enters an error rate around 5% to 10%. Premium obtains this level between 100 and 1,000 labelled examples.
Standard, however, needs between 50,000 and 100,000 labelled examples to obtain it. Premium is around 100 times more sample efficient.
The ramifications of this are very important to our business case as it allows Smartscan to add more fields at a very fast rate. It is a tedious and time-consuming process to have humans label between 50,000 and 100,000 documents. However, 100 to 1,000 documents can be labelled within a week or two.
The last point about sample efficiency brings us to the final benefit of pretraining. We have focused on using the pretrained model to find specific fields in invoices and receipts. The pretrained model is not only made for this specific purpose though.
In fact, it is a general model of documents that can be fine-tuned to a large variety of tasks: Maybe we want to summarise an invoice in the future? Recognise lines and map them to specific accounts? Identify fraudulent activity? The high sample efficiency of the pretrained model allows us to quickly reach high accuracy on many tasks related to invoices and receipts.
That was quite a bit. Thank you for reading if you have made it this far! You now know what pretraining is, how we use it for invoices and receipts, and what benefits we get from it.
Stay tuned for our next post where we will be talking about Transformers and how we use them in Premium. Please get in contact with us by visiting our Smartscan page if you have questions or are interested in Smartscan for your company.