Article

Machine learning of Smartscan part 2: Transformers

19/9/2021

min read

The recently released Smartscan Premium by our Visma Machine Learning Assets team sets a new standard for automated document information extraction. In this blog post series the team explains the transformer architecture of the model that sits at the core of this vastly improved product.

Article

Machine learning of Smartscan part 2: Transformers

September 19, 2021

Article

Machine learning of Smartscan part 2: Transformers

episode

September 19, 2021

Listen on Spotify

Listen on Apple Podcasts

Listen on Amazon Music

Listen on SoundCloud

Watch on Youtube

Article

Machine learning of Smartscan part 2: Transformers

September 19, 2021

Duration:

Machine learning of Smartscan part 2: Transformers

This is the second blog post in a series of three where we detail the research that forms the basis of Smartscan Premium. In this article, we will dive deeper into the details of the architecture, focusing on the problem of handling long input sequences.

Read the first blog post in this series about the problem Smartscan tries to solve, the background of transformers, and key machine learning concepts, here.

Be warned: We will go into technical terms and math during the blogpost. Don’t worry—we try to give intuitive explanations wherever possible. We mark particularly technical sections with this “nerd alert”: 🚨. These sections provide additional details for the curious and technically inclined reader.

The Transformer Architecture

Let us start by understanding the architecture of a transformer. We will first take a look at how text is made into vectors. Then, we will dive into the architecture of the encoder stack of the transformer where these vectors are transformed by the model. Finally, we will see how the transformed vectors from the encoder stack are turned back into text.

From Text to Vectors

A machine learning model is essentially a mathematical function. So, how exactly do we turn a piece of text into something numerical that can be passed through such a function? Let us take a moment to understand this. The steps are the following:

Tokenizing and encoding
Embedding the encoded tokens

Let’s see how these steps would apply to a well-known sentence: “Visma: Make progress happen!”.

Tokenizing and Encoding

A tokenizer segments the sentence into smaller pieces (tokens). The possibly simplest tokenization strategies are to split by spaces:

visma:

make

progress

happen!

or into characters:

…

The desirable properties of a good tokenizer for natural language processing (NLP) are that it achieves a good tradeoff between the following:

1) the vocabulary size needed to properly represent the training corpus

2) the resulting tokenized sequence lengths

3) how well each token manages to separate or gather linguistic information

Returning to the examples above, you will see that this tradeoff is drastically different between the two. Splitting by spaces will result in manageable sequence lengths but will require a huge vocabulary (just think of combinations resulting from punctuation; “happen!” – “happen.” – “happen,”).

Also, this splitting wouldn’t manage to separate important language features—punctuation in a sentence matters, but splitting by spaces bundles that up with words. Contrarily splitting into characters will result in a small vocabulary. 256 characters would probably give great coverage but the resulting sequences could hardly get any longer. Furthermore, each character carries almost no information by itself.

One of the most popular tokenizers in NLP is the Word Piece tokenizer, and this is also what we use in Smartscan Premium. The word piece tokenization of the example sentence looks like this:

vis

##ma

make

progress

happen

Here, you can see that common words like “make”, “progress”, and “happen” become tokens of their own and so do punctuation characters. You can also see that the name “Visma” gets split into more common pieces (sorry, not every company gets its own token). This type of tokenizer relies on a trained vocabulary whereas the previous examples do not.

Finally, each token is mapped to token ids (positions in the vocabulary). The encoded example sentence then becomes:

25292

2863

1024

2191

5082

4148

999

An important takeaway from the above is that any given sequence of “words” will ultimately be mapped to a sequence of tokens. This token sequence will usually be longer than the word sequence. This will especially be the case when the word sequence consists of a lot of uncommon words that won’t be found in the vocabulary of the tokenizer (think invoice language: org. numbers, names, addresses, amounts and so on).

Embedding the Encoded Tokens

The encoded tokens are now embedded into high dimensional vectors. We do this essentially as a table lookup. The so-called token embeddings are held in a large matrix of “weights” with shape

[n_vocab , d] where n_vocabis the size of the vocabulary and d is the hidden state size. Both are hyper-parameters (model settings).

A mathematical representation of the entire sentence is then built simply by gathering rows from this matrix using the encoded token ids as indices, as shown in the figure below. The n_vocab*d weights are trainable model parameters; they will be learned from data during training of the model.

The token vectors are gathered from the token embeddings using the token ids as indices.

The matrix with embedded token ids will have the shape [n_tokens , d] where n_tokens refers to the sequence length of the tokenized text (up to 1496 in Smartscan Premium, but 7 in our example).

If you read part 1 of this blog post series, you will know that additional per token information such as layout information (bounding box coordinates and size of the text) is also embedded. This is done in a completely analogous way as described above. All of these embedded representations are of the same shape and are simply added element-wise to form our first hidden state.

This hidden state should be seen as a matrix representation of the input sequence, which will be passed through and manipulated by the layers in the encoder stack of the transformer model. The hidden state will roughly speaking have the same shape at every point throughout the model.

The Encoder Stack

Now that we know how text is vectorized, let’s have a look at the architecture of the encoder stack of the transformer, which is shown in the following figure:

The encoder stack consists of 12 identical encoder layers.

As you can see in the left part of the figure, the encoder stack consists of a number of encoder layers. In Smartscan Premium we use a stack of 12 layers. All these layers are identical in terms of architecture, but each has its own set of trainable model parameters.

The embedded sequence (hidden state 0) described above enters the stack in the bottom and passes through each layer, being manipulated along the way. Out comes the encoded sequence, which should be understood as a matrix representation of the model’s understanding of the layout and language of the input text.

In the right part of the figure, the architecture of a single encoder layer is shown. The most important thing to realise from this is that self-attention is an integral part of the encoder layers that the encoder stack is composed of.

This layer serves the important purpose of “mixing” the information of all the tokens in the sequence to form an understanding of the context of the entire sequence. Earlier sequence model architectures used so-called recurrent neural networks that process elements of the sequence sequentially in a strictly left-to-right or bidirectional fashion.

This had significant drawbacks in terms of difficulty in benefiting from long-range context in the sequence and poor parallelisation/GPU acceleration potential. In contrast, the self-attention based transformer architecture allows for random sequence access and is extremely parallelisable.

🚨 Each encoder layer more specifically consists of two sub-layers. The first is a multi-headed dot-product self-attention layer. The second is a position-wise (applied to each element in the sequence, not across elements) feed-forward neural network.

The red arrows indicate so-called residual- or skip-connections. They help mitigate the ‘vanishing gradient’ problem during training by providing a more direct pathway between the model’s output and the deeper layers.

Each sub-layer also applies layer-normalisation with the purpose of ensuring consistent scaling of the sub-layer output, i.e. ensuring that the layer outputs are kept at small magnitudes in an average sense. This helps speed up convergence during training.

From Representation to Text

Now that we understand how text is vectorised and then transformed by the encoder stack, let’s finally see how the transformed sequence can be turned back into text again.

Recall from part 1 of this blog post series that pre-training is carried out using the masked language model (MLM) objective. Here, a fraction of tokens in the input sequence are randomly replaced or masked. The model is asked to predict the proper tokens at those positions.

We, therefore, need a way of converting the encoded sequence representation output from the encoder stack back into text. We do this by applying a neural network with n_vocab outputs to each position in the encoded sequence.

Through the use of a special function called softmax, the relative magnitudes of these outputs can be interpreted as a probability distribution over the vocabulary. This way, it lets us pick the most likely word to output.

We call the collection of objective specific model layers a model head. The MLM model head will only be kept during pre-training. Later, when we want to fine-tune the model, a model head that is suitable for the fine-tuning objective is used instead.

Transformers and Long Sequences

Traditional transformer architectures suffer from a particular and significant limitation that arises from the use of self-attention. The problem is that the compute and memory requirements for self-attention layers scale quadratically with the sequence length.

Why is this the case? Intuitively this is relatively easy to understand. As we mentioned earlier, the self-attention layers have the important task of mixing the information from all the token-vectors in the sequence.

Practically, we do this by letting the layer look at (attend to) every single token-vector of its input when calculating an output token-vector. The sequence length is the same everywhere in the encoder stack. This means that every self-attention layer needs to consider n_tokens input token-vectors for every of the n_tokens output token-vectors. Therefore, every self-attention layer needs to “look” n_tokens² times.

This is really bad news for problems where we need to treat long sequences. Let’s consider a (very) simple example: Assume that a model prediction at a sequence length of 100 takes 0.1s (no base cost for simplicity, so 0.0s at 0 sequence length). In the figure below, linear and quadratic scaling behaviours are compared for this situation. As you can see, when increasing the sequence length, prediction time quickly explodes and becomes unreasonably long with quadratic scaling.

For the interested reader, we describe self-attention and the scaling issue in much greater detail in the following section. Feel free to skip it if math is not your thing:

Comparison of quadratic and linear scaling behaviours.

Dot-Product Self-Attention and Quadratic Scaling

🚨 As mentioned above, the sequence length scaling issue in traditional transformer architectures arises from the dot-product self-attention mechanism of the encoder layers. The purpose of this attention mechanism is essentially to allow the model to compare every token in the input sequence with all other tokens in order to understand the context.

Let’s have a look at how this mechanism works. To keep it simple, we will only consider single-headed self-attention even though multi-headed self-attention is typically used in transformers. The move to multi-headed self-attention is only a minor complication that is conceptually easy to understand, but it will complicate the explanations unnecessarily. Furthermore, considering single-headed self-attention is entirely sufficient for understanding the scaling issue.

Consider the following equation that constitutes (scaled) dot-product self-attention:

Math formula: Attention of Q, K, and V equals softmax open paren Q multiplied by K transposed over square root of d close paren multiplied by V

Let’s break that apart a bit. Q (query), K (key), and V (value) can in principle be based on different inputs. But in the simpler case of self-attention used in the encoder stack, they are all linear projections of the hidden state, H, discussed previously. Q, K, and V are projected into a d dimensional space like this:

Q=HW_Q

K=HW_K

V=HW_V

(2)

Recall that the shape of the hidden state is [n_tokens , d]. The projection matrices W_Q, W_K, and W_V have the shape [d , d]. Therefore Q, K, and V all have the shape [n_tokens , d]. The three projection matrices W_Q, W_K, and W_V are the trainable parameters of the layer.

What will the term QK^T in (1) look like? Let q₀, q₁, … be the rows of Q, k₀, k₁, … the rows of K, and v₀, v₁, … the rows of V. Then q_i can be seen as a projection of the hidden state for token i in the sequence. Similarly, k_i and v_i also relate to the hidden state for token i. Then we can see that the i’th row of QK^T term will look like this:

Math formula: Row i of Q times K transposed equals open bracket dot product of qi and k0 comma dot product of qi and k1 comma dot dot dot comma dot product of qi and k ntokens close bracket

Row i of the term can, therefore, be seen as a “comparison” of token i with every other token in the sequence. The projection matrices W_Q and W_K of trainable weights define the kind of “comparison” to be made.

The softmax operation is a nonlinear but monotonous function that normalises each row of the argument (matrix) to sum to 1.

Math formula: A equals softmax of Q times K transposed over square root of d

The elements of the matrix seen above are called the attention weights. Row a_i of A is a vector that sums to 1 and determines which rows of V to pay the most attention to when considering token i.

The scaling issue in dot-product self-attention is that the term QK^T has to be explicitly calculated and held in memory during training and inference. As mentioned previously, Q and K both have the shape [n_tokens , d].

Therefore, the matrix product QK^T has the shape [n_tokens , n_tokens]. The complete memory and time complexities of dot-product self-attention are O(n_tokens² +n_tokens d) and O(n_tokens²d] respectively. The memory and compute resources needed to run a traditional transformer architecture scales quadratically with sequence length.

Why all this Matters to Smartscan

The BERT model that Smartscan Premium is based on by default accepts sequences of up to 512 tokens. This is completely sufficient for use-cases such as e.g. translation. However, in Smartscan we want the model to consider the entirety of (possibly multi-page) invoices.

The previous LSTM based Smartscan version, now known as Smartscan Standard, accepts sequence lengths of 1000 words (split on spaces). We established that as being sufficient for most documents.

As we saw in the section on tokenization, tokenizing with Word Piece will certainly increase the sequence lengths required for the task. The default of 512 tokens must therefore be expected to be entirely insufficient for the problem Smartscan aims to solve.

Below we show the distribution of the token index, at which a certain piece of invoice information is found (total including vat/Swedish bankgiro creditor id) over a sample of documents:

Distributions of the index at which two types of invoice information is found.

We can see that for a large fraction of the documents, only considering 512 tokens would make it impossible to find the total including vat. The Swedish bankgiro creditor id is an OCR line type of information.

This type of information is commonly found close to the bottom of an invoice. We can see that on the right figure in the form of a distinct bump in the upper tail of the distribution. We can also see it from the fact that the entire distribution is shifted upwards compared to the distribution for total including vat.

Clearly, it would be beneficial with a model that can consider at least 1000 tokens, and preferably even more.

Inferencing with 1000+ sequence lengths with a traditional transformer architecture is possible but very slow. Smartscan is used in near-real-time use cases, e.g. for interactive use in travel expense phone apps, and that puts strict requirements on prediction time.

Furthermore, the training set size matters. Smartscan Premium is trained on 20+ million documents on expensive GPU instances. In the future, the training set size is certain to increase, so every training efficiency improvement counts. Even with the efficient architecture, we settled on for Smartscan Premium, a single pre-training alone costs 20k+ DKK in compute.

🚨Efficient Transformer Architectures

The solution to the scalability problem discussed above is to replace the traditional self-attention mechanism with a more efficient implementation. We will now briefly describe two of the more well-known possibilities. For the interested reader, a full survey of efficient architectures can be found in ‘Efficient Transformers: A Survey’.

Longformer

The Longformer uses a fixed-pattern sparsity-based type of attention mechanism. Consider the figures below. The figures illustrate the scope of self-attention in the following way: Each row and column refers to a token in the sequence. A coloured cell in e.g. row 10, column 15 indicates that token 10 can attend to (or see) token 15. In other words, the figures illustrate the non-zero content of the attention matrix A discussed earlier:

Non-zero values of the attention matrix for different types of fixed pattern self-attention

We can see that for traditional self-attention shown in (a) every token can attend to every other token. In contrast, the Longformer uses combinations of sliding window (b) and dilated sliding window (c) patterns of varying widths and dilations throughout the encoder stack and across attention heads. In order to enable global-context aware classification models, global+sliding window (d) attention is used in task-specific model-heads.

The key advantages of the longformer are that it scales linearly in the sequence length and that reuse of weights from pretrained traditional transformers is possible (for weight initialisation). One major disadvantage is that the dilated sliding window (c) type of attention-sparsity requires the use of a custom purpose developed CUDA kernel for efficient use. Another disadvantage is that window sizes and dilations per encoder layer become additional (possibly task-specific) hyper-parameters.

Linformer

In the Linformer paper it is proven that the following problematic matrix (discussed earlier) is inherently of low rank:

Math formula: softmax of Q times K transposed over square root of d

The idea of the Linformer layer is, therefore, to substitute this term with a low-rank approximation that can be computed in linear time and memory wrt. sequence length.

This is concretely done by linearly projecting down from [n_tokens , d] to [r , d] for some r<<n_tokens . We then calculate the dot-product self-attention in this space and then project back up to [n_tokens , d]. The self-attention is calculated as:

Math formula: Attention of Q, K and V equals softmax open paren Q times open paren E times K close paren transposed over square root of d close paren times F times V

Here, E and F are projection matrices of shape [r , n_tokens] which are additional trainable parameters of the layer. This way, we replace the problematic [n_tokens , n_tokens] shaped QK^T term with the much smaller [n_tokens , r]. This is shaped Q(EK)^T term (provided we select r<<n_tokens) yielding O(n_tokensr) time and memory complexities.

The key advantages of the Linformer layer are that it scales linearly in the sequence length both with respect to time and memory and that implementation is trivial. One disadvantage is that it introduces an additional hyper-parameter, r, per self-attention layer.

This is a small disadvantage however since it is shown in the paper that the value of r should just be “big enough”. The much bigger disadvantage is that the E and F projection matrices add additional trainable model parameters.

This means that models based on this layer cannot easily benefit from already pretrained traditional transformer models, but must instead be pretrained from scratch.

Another disadvantage is that, since projection happens on the sequence length axis, it becomes non-trivial to implement causal masking, which we expect to need in the near future of Smartscan Premium.

The Performer Layer

Good alternatives to the traditional attention implementation exist. They each come with their own set of advantages and disadvantages. For Smartscan Premium, we have settled on the so-called Performer layer.

The Performer layer was introduced in ‘Rethinking Attention with Performers’ on September 30th 2020. At the time the code was only published in JAX, so we had to develop our own TensorFlow implementation of the layer. The details of the Performer layer are given in a section for the adventurous reader below. However, all you need to understand is the advantages and disadvantages associated with using this layer.

The Performer layer introduces a few additional hyper-parameters (model settings). This can be seen as a disadvantage. However, it turns out that proper values for these hyper-parameters are really easy to select, so that’s not an issue in practice.

The benefits of using the Performer layer are really great though. The main benefit is that it achieves linear scaling with respect to the sequence length. Another key advantage is that there is a one-to-one correspondence between the trainable parameters of a Performer layer and a conventional self-attention layer.

This means that the Performer layer can be used as a true drop-in replacement for conventional attention layers. No tricks are required. This allows us to benefit greatly from freely available pre-trained transformers.

🚨 Performer Details for the Nerds

The Performer layer is an implementation that relies on a kernel-based approximation of the attention mechanism. The approach is called ‘Fast Attention Via positive Orthogonal Random features’ (FAVOR+). The traditional attention mechanism in (1) can be rewritten like this:

Math formula: Attention of Q, K and V equals softmax open paren Q times K transposed over square root of d close paren times V equals D inverse times A times V

where:

Math formula: A equals exponential of Q times K transposed over square root of d

and:

Math formula: D equals diagonal matrix of A times 1 n tokens

Here, 1n_tokens is a column vector of length n_tokens with all 1’s. All that happened here is just that the definition of the softmax function is explicitly written in matrix notation. The main idea of the performer layer is this: Assume that the matrix A can be expressed in terms of a kernel, Φ, such that

Math formula: A i comma j equals Phi of qi transposed comma kj transposed where Phi of x and y equals expectation of the dot product of phi of x transposed and phi of y

Here, E denotes the statistical expectation and φ(…) is an appropriate random feature map. For Q‘ and K’ calculated by row-wise application of φ to Q and K, respectively, (3) can then be written as

Math formula: Attention of Q, K and V equals D hat inverse times open paren Q prime times open paren K prime transposed times V close paren close paren where D hat equals diagonal matrix of Q prime times open paren K prime transposed times 1 n tokens close paren

This achieves an important change in the order of computation. In terms of shapes, the inner matrix product K’^TV now looks like [r,n_tokens] x [n_tokens , d]⟶[r , d] . The following left multiplication with Q‘ then becomes [n_tokens , r] x [r , d]⟶[n_tokens , d]. Here, r is the dimensionality of the random feature map φ.

The memory and time complexities, therefore, reduce to O(n_tokensr+n_tokensd+rd) and O(n_tokensrd), respectively. The above constitutes the FA part of FAVOR+.

‘Rethinking Attention with Performers’ goes on to introduce a number of random feature maps, φ, that allows for approximation of the softmax kernel for self-attention. To give an example, the so-called regularised orthogonal hyperbolic softmax-kernel, that we have settled on in Smartscan Premium is defined like this:

Math formula: h of x equals exponential of minus open paren cartesian norm of x squared over two close paren f1 of u equals exponential of u f2 of u equals exponential of minus u D equals uniform distribution over the d minus one dimensional sphere with radius square root of d phi of x equals h of x over square root of open paren r over 2 close paren times open paren f1 of the dot product of w1 transposed and x comma dot dot dot, f1 of the dot product of w r over two transposed and x, f2 of the dot product of w1 transposed and x comma dot dot dot f2 of the dot product of w r over two transposed and x close paren

where w₁, …, w_r/2 are drawn iid. from D. This constitutes the R+ part of FAVOR+. The choice of the functions h, f₁, and f₂ ensures that the random feature map is strictly positive, hence the + in FAVOR+. This is shown by ‘Rethinking Attention with Performers’ to be important for stability of the approximation.

The O part of FAVOR+ comes from orthogonalisation of the random vectors w₁, …, w_r/2. ‘Rethinking Attention with Performers’ shows that using orthogonal random vectors reduces the mean-squared error of the approximation.

Without going into further details we can mention that in addition to the advantages discussed earlier, causal masking is possible for the Performer layer.

Conclusion

You have reached the end; thank you for reading. In case you skipped some of it or if some parts were a bit hard to follow, here is the TL;DR.

Smartscan Premium needs to consider all the text in potentially very long text sequences from invoices in order to make good predictions. Tokenization of this text results in even longer sequences.

Long input sequences pose a problem for traditional transformer architectures since these rely on dot-product self-attention that scales quadratically with sequence length. To alleviate this problem, Smartscan Premium’s transformer stack is instead based on the Performer self-attention layer, which scales linearly with sequence length. This allows us to efficiently train Smartscan Premium, and to provide near-real-time predictions for our consumers.

Stay tuned for the last blog post in this series, which will explain the mechanics of fine-tuning and how we achieve great prediction efficiency for our multi-field Smartscan Premium model.

Learn more about Smartscan and the team behind

About the episode

Published by

Esben Toke Christensen

Voice of Visma

We're sitting down with leaders and colleagues from around Visma to share their stories, industry knowledge, and valuable career lessons. With the Voice of Visma podcast, we’re bringing our people and culture closer to you. Welcome!

More about the podcast

esben.christensen@visma.com

Published by

Diana Kigyossy

diana.kigyossy@visma.com

Diana is the Employer Branding Lead for Visma Group. She joined Visma in May 2022.

Published by

Benedikte Lyngbø

benedikte.lyngbo@visma.com

Benedikte Lyngbø is a Student Assistant Graduate Recruitment for Visma Group. She joined Visma in 2023.

Published by

Holli Hatherly

holli.hatherly@visma.com

Holli Hatherly joined Visma in 2024 and currently works as Marketing Network Manager for Visma Group.

Published by

Berit Braut

berit.braut@visma.com

Berit Braut is the Director of Commercial Growth and CX at Visma Group. She joined Visma in 2009 as a Management Trainee and believes the magic happens where technology meets customer expectations and business goals.

Published by

Mandy Burger

mandy.burger@visma.com

Mandy Burger is an Internal Communication Specialist for Visma Group. She joined Visma in 2022.

Published by

Victoria Bondarchuk

victoria.bondarchuk@visma.com

Victoria Bondarchuk worked as the Director of Product Discovery in Visma from 2017 to 2023.

Published by

Jet Bouwman

jet.bouwman@visma.com

Jet Bouwman is the Head of Employer Branding & Sponsorship Activation at Visma Group. She joined Visma in 2021.

Published by

Brian Ye

brian.ye@visma.com

Brian Ye is the Director of Product Analytics at Visma. He joined Visma in 2017 as a Management Trainee and has a Master of Science in Computer Science from the Royal Institute of Technology in Stockholm, Sweden.

Published by

Lilja Helgadottir

lilja.helgadottir@visma.com

Lilja is the Commercial Lead at Machine Learning Assets. She joined Visma in 2020 as a Management Trainee and has a Master of Science in Industrial Engineering and Management from the Technical University of Denmark.

Published by

Carina Ramsøy

carina.ramsoy@visma.com

Carina Ramsøy worked as a Marketing Manager at Visma from 2019 to 2022.

Published by

Kejsi Gjordeni

kejsi.gjordeni@visma.com

Kejsi Gjordeni worked as a Strategy and Business Manager at Visma from 2019 to 2023.

Published by

Martin Olstad

martin.olstad@visma.com

Martin Olstad worked as an Optimisation Consultant and AI Project Manager at Visma from 2020 to 2022.

Published by

Claus Dahl

claus.dahl@visma.com

Claus Dahl is the Director of Machine Learning Assets at Visma. A mathematician by training, Claus has 20 years of software experience — about half of that doing mathematics on computers. He has been with Visma since 2016.

Published by

Vlad Boldura

vlad.boldura@visma.com

Vlad Boldura worked as a Security Service Owner at Visma from 2021 to 2023.

‍

Published by

Kaja Falck-Ytter

kaja.falck-ytter@visma.com

Kaja Falck-Ytter worked as a Content Producer and Communications Specialist at Visma from 2015 to 2022.

Published by

Linn Kjærland

linn.kjarland@visma.com

Linn Kjærland worked as a Content Producer and Commercial Advisor at Visma from 2018 to 2023.

Published by

Ida Nicoline Glosimot

ida.nicoline.glosimot@visma.com

Ida Nicoline Glosimot worked as a Copywriter at Visma from 2019 to 2023.

Published by

Stina Wahlsten

stina.wahlsten@visma.com

Stina Wahlsten worked as a Sustainability Manager and Ops Lead at Visma from 2022 to 2024.

Published by

Pieter Bijl

pieter.bijl@visma.com

Pieter Bijl is an AI Developer at Visma Resolve. He joined Visma in 2022 as a Management Trainee.

Published by

Emily Northway

Emily.Northway@visma.com

Emily Northway worked as a Marketing Specialist at Visma from 2015 to 2018.

Published by

Rie Jørgensen

rie.jorgensen@visma.com

Rie Jørgensen joined Visma in 2014 and has held various roles in marketing across Visma's business units. She now works as Marketing Director at Visma Group.

Published by

Henriette Veiby

henriette.veiby@visma.com

Henriette Veiby works as a Product Owner at Visma Enterprise AS. She joined Visma in 2019 and has a Bachelor in Business Administration and a Master in Information Systems.

Published by

Magnus August Lien

magnus.lien@visma.com

Magnus August Lien is a Product Director at Visma Resolve. He joined Visma in 2019 and has a specialisation within machine learning and mathematical optimisation.

Published by

Philip Scarampi

philip.scarampi@visma.com

Philip Scarampi is a Sustainability Manager at Visma Group, with a background in communication and content marketing. He joined Visma in 2019.

Published by

Ari-Pekka Salovaara

ari-pekka.salovaara@visma.com

Ari-Pekka Salovaara is the Segment Director of Small Businesses at Visma. This segment provides cloud accounting and payroll software for over 2 million small businesses and accounting offices in 20 countries.

Published by

Helle Hobbelhagen

helle.hobbelhagen@visma.com

Helle Hobbelhagen is a Security Marketing Coordinator for Visma Group. She joined Visma in 2020 and specialises in security awareness and culture building.

Published by

Alexander Lystad

alexander.lystad@visma.com

T. Alexander Lystad is the Chief Technology Officer at Visma. He's been with Visma since 2012, implementing and extracting value from Continuous Delivery, DevOps, and cloud technology.

Published by

Ida Strande Markman

ida.strandemarkman@visma.com

Ida Strande Markman is a People Operations Manager at Visma Group, with a passion for diversity and inclusion in the workplace. She joined Visma in 2019.

Published by

Amanda Lundius Mörck

amanda.lundius.morck@visma.com

Amanda Lundius is the Director of User Experience at Visma. She holds a B.Sc. in Interaction Design and believes that good UX design is responsible and mindful, meeting the goals of both people and businesses.

Published by

Isabel Arkvik

isabel.arkvik@visma.com

Isabel Arkvik is the Head of Security Awareness and Training at Visma Group. She joined Visma in 2019.

Published by

Gina Ross Eriksen

gina.eriksen@visma.com

Gina Ross Eriksen is a Content Specialist at Visma Group.

Published by

Juliette Prado

juliette.prado@visma.com

Juliette Prado was a Senior Content Writer for Visma Group from 2023 to 2025.

‍

Published by

Anne-Grethe Thomle Karlsen

anne-grethe.thomle.karlsen@visma.com

Anne-Grethe Thomle Karlsen joined Visma in 2001 and most recently held the role of Director of Content & Sponsorship at Visma Group.

Published by

Esben Toke Christensen

Leading with agentic AI: Why every business needs a bold AI ambition now

Visma's CTO, T. Alexander Lystad, emphasises the transformative power of agentic AI in reshaping business strategies and urges leaders to adopt ambitious AI goals to maintain competitiveness.

The Transformer Architecture

From Text to Vectors

Tokenizing and Encoding

Embedding the Encoded Tokens

The Encoder Stack

From Representation to Text

Transformers and Long Sequences

Dot-Product Self-Attention and Quadratic Scaling

Why all this Matters to Smartscan

🚨Efficient Transformer Architectures

Longformer

Linformer

The Performer Layer

🚨 Performer Details for the Nerds

Conclusion

About the episode

Voice of Visma

Related content

Leading with agentic AI: Why every business needs a bold AI ambition now

AI for leaders: A practical guide

Visma’s Machine Learning Assets Team make innovation in NLP

Voice of Visma Ep 13: Building partnerships beyond software with Daniel Ognøy Kaspersen

Voice of Visma Ep 12: AI in the accounting sphere with Joris Joppe

Voice of Visma Ep 01: An optimistic look at the future of AI with Jacob Nyman

Voice of Visma Ep 01: Show notes

The state of AI in 2025

AI explained: What is generative AI?

How AI optimises performance for Team Visma | Lease a Bike

AV1: A robot specifically developed for the classroom and school life

Optimised kindergarten admissions with AI

What is Robotic Process Automation (RPA)?

Machine learning of Smartscan part 2: Transformers

The AI ​​dictionary: Demystification of AI concepts

What’s the deal with AI copilots?

How AI is transforming Visma’s software development

Making Visma more efficient and sustainable with the help of robots

Optimising small businesses across Europe: 5 countries, 5 Visma products

6 ways to optimise customer support content for AI

An inside look at AI in the public sector

5 elements to include in your AI strategy

Working with AI is working with people

Tackling society’s healthcare challenges through innovation

4 Visma products using AI assistants and chatbots to elevate customer support

Democratising Machine Learning through Hugging Face

Garbage in, Gospel Out – The Unrealistic Expectations of AI

Introducing Machine Learning as a Service – a scalable, automatic way of delivering AI to customers

How Visma implemented a chatbot that nearly doubled the goal

How Visma makes invoicing tasks a thing of the past

Planning work schedules gets even easier with automated shift design

Machine learning of Smartscan Part 1: Bert awakens

Can a robot be sustainable?

The importance of ethics in Artificial Intelligence

Which AI product ideas are worth exploring?

Designing conversational AI from a UX perspective

Meet our AI Director: Jacob Nyman

Meet our AI talents: Martin Sommerseth

Voice of Visma Ep 04: How do you make people care about security? with Joakim Tauren

Voice of Visma Ep 04: Show notes

Strengthening cybersecurity across Europe with the NIS2 Directive

7 essential tips for maintaining good cyber hygiene

Partnering with ethical hackers to keep our software secure

What is ethical hacking?

How the Visma Security Program strengthens cybersecurity in the public sector

Establishing a cybersecurity programme: A step-by-step guide

Understanding the psychological barriers to 2FA and how we can overcome them

Cyber security: Can we trust cloud computing services?

3 ways employees pose a security threat for the corporation

Inspiring women to choose a career in tech and cyber security

What is ransomware and how does it work?

Visma’s contribution to European Cyber Security Month

European Cyber Security Month 2022

What is an IT Security Policy?

How to fix security issues in code faster

7 essential cybersecurity tips for summer travel

How on earth do cyber attacks cost almost €7 trillion a year?

What is Data Protection Day?

Meet our tech talents: Lars Holtar & Ioana Piroska

Cloud security: Frequently asked questions

What is social engineering and how to prevent such attacks?

Password manager – what is it and what are the benefits?

The AI dictionary: Demystification of AI concepts