Skip to content

This blog is an attempt to explain the Transformer and it's Attention mechanism .

Notifications You must be signed in to change notification settings

agrawalnaman/Transformer-and-Attention

Repository files navigation

Transformer-and-Attention

By Naman Agrawal & Priyanka Cornelius

This blog is aimed at explaining the Transformer and it's Attention mechanism in a lucid and intuitive manner.

Knowing first things first:

To get most out of this post it is recommended that you are comfortable and acquainted with these terms :

Training a typical neural network involves the following steps:

  • 1)Input an example from a dataset.
  • 2)The network will take that example and apply some complex computations to it using randomly initialised variables (called weights and biases).
  • 3)A predicted result will be produced.
  • 4)Comparing that result to the expected value will give us an error.
  • 5)Propagating the error back through the same path will adjust the variables.
  • 6)Steps 1–5 are repeated until we are confident to say that our variables are well-defined.
  • 7)A predication is made by applying these variables to a new unseen input.

Of course, that is a quite naive explanation of a neural network, but, at least, gives a good overview and might be useful for someone completely new to the field.

Recurrent neural networks work similarly but, in order to get a clear understanding of the difference, we will go through the simplest model using the task of predicting the next word in a sequence based on the previous ones.

First, we need to train the network using a large dataset. For the purpose, we can choose any large text (“War and Peace” by Leo Tolstoy is a good choice). When done training, we can input the sentence “Napoleon was the Emperor of…” and expect a reasonable prediction based on the knowledge from the book.

So, how do we start? As explained above, we input one example at a time and produce one result, both of which are single words. The difference with a feedforward network comes in the fact that we also need to be informed about the previous inputs before evaluating the result. So you can view RNNs as multiple feedforward neural networks, passing information from one to the other.

RNNs can be used as language models for predicting future elements of a sequence given prior elements of the sequence. However, we are still missing the components necessary for building translation models since we can only operate on a single sequence, while translation operates on two sequences – the input sequence and the translated sequence.

Sequence to sequence models build on top of language models by adding an encoder step and a decoder step. In the encoder step, a model converts an input sequence (such as an English sentence) into a fixed representation. In the decoder step, a language model is trained on both the output sequence (such as the translated sentence) as well as the fixed representation from the encoder. Since the decoder model sees an encoded representation of the input sequence as well as the translation sequence, it can make more intelligent predictions about future words based on the current word. For example, in a standard language model, we might see the word “crane” and not be sure if the next word should be about the bird or heavy machinery. However, if we also pass an encoder context, the decoder might realize that the input sequence was about construction, not flying animals. Given the context, the decoder can choose the appropriate next word and provide more accurate translations.

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

When I’m translating a sentence, I pay special attention to the word I’m presently translating. When I’m transcribing an audio recording, I listen carefully to the segment I’m actively writing down. And if you ask me to describe the room I’m sitting in, I’ll glance around at the objects I’m describing as I do so.

Neural networks can achieve this same behavior using attention, focusing on part of a subset of the information they’re given. For example, an RNN can attend over the output of another RNN. At every time step, it focuses on different positions in the other RNN.

We’d like attention to be differentiable, so that we can learn where to focus. To do this, we use the same trick Neural Turing Machines use: we focus everywhere, just to different extents.

The attention distribution is usually generated with content-based attention. The attending RNN generates a query describing what it wants to focus on. Each item is dot-producted with the query to produce a score, describing how well it matches the query. The scores are fed into a softmax to create the attention distribution.

One use of attention between RNNs is translation. A traditional sequence-to-sequence model has to boil the entire input down into a single vector and then expands it back out. Attention avoids this by allowing the RNN processing the input to pass along information about each word it sees, and then for the RNN generating the output to focus on words as they become relevant.

Let's start with the basics of a Transformer:

Why do we need a Transformer?

referred from Harvard NLP

Sequence modeling and transduction (e.g. language modeling, machine translation) problems solutions have been dominated by RNN (especially gated RNN) or LSTM, CNN including encoder and decoder, additionally employing the attention mechanism.

When RNN (or CNN) takes a sequence as an input, it handles sentences word by word. In cases when such sequences are too long, the model is prone to forgetting the content of distant positions in sequence or mix it with following positions’ content. This is where LSTM comes into picture. Also,the sequentiality is an obstacle toward parallelization of the process.

The goal of reducing sequential computation forms the foundation of the Extended Neural GPU, ByteNet and ConvS2S, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions.

As an alternative to convolutions, the Transformer is a new approach that proposes to encode each position and apply the attention mechanism, to relate two distant words, which then can be parallelized thus, accelerating the training. Transformer uses the multi-head attention mechanism that allows to model dependencies regardless of their distance in input or output sentence, reducing the number of sequential operations to relate two symbols from input/output sequences

Why the name Transformer?

The Transformer architecture is aimed at the problem of sequence transduction (by Alex Graves), meaning any task where input sequences are transformed into output sequences. This includes speech recognition, text-to-speech transformation, machine translation, protein secondary structure prediction, Turing machines etc. Basically the goal is to design a single framework to handle as many sequences as possible.

What does a Transformer do?

  • Transformer is based on sequence-to-sequence model for Statistical Machine Translation (SMT) as introduced in Cho et al., 2014 . It includes two RNNs, one for encoder to process the input and the other as a decoder, for generating the output.

  • In general, transformer’s encoder maps input sequence to its continuous representation z which in turn is used by decoder to generate output, one symbol at a time.

  • The final state of the encoder is a fixed size vector z that must encode entire source sentence which includes the sentence meaning. This final state is therefore called sentence embedding1.

  • The encoder-decoder model is designed at its each step to be auto-regressive - i.e. use previously generated symbols as extra input while generating next symbol. Thus, xi+yi−1→yi

Neural Encoder-Decoder Model

referred from Graham Neubig CMU tutorial

The Encoder-Decoder model aims at tackling the statistical machine translation problem of modeling the probability P(E|F) of the output E given the input F. The name “encoder-decoder” comes from the idea that the first neural network running over F “encodes” its information as a vector of real-valued numbers (the hidden state), then the second neural network used to predict E “decodes” this information into the target sentence.

If the encoder is expressed as RNN(f)(·), the decoder is expressed as RNN(e)(·), and we have a softmax that takes RNN(e)’s hidden state at time step t and turns it into a probability, then our model is expressed as follows :

In the first two lines, we look up the embedding mt(f) and calculate the encoder hidden state ht(f) for the tth word in the source sequence F. We start with an empty vector h0(f) = 0, and by h|F|(f), the encoder has seen all the words in the source sentence. Thus, this hidden state should theoretically be able to encode all of the information in the source sentence.

In the decoder phase, we predict the probability of word et at each time step. First, we similarly look up mt(e), but this time use the previous word et-1, as we must condition the probability of et on the previous word, not on itself. Then, we run the decoder to calculate ht(e). This is very similar to the encoder step, with the important difference that h0(e) is set to the final state of the encoder h(f)|F|, allowing us to condition on F. Finally, we calculate the probability pt(e) by using a softmax on the hidden state ht(e). While this model is quite simple (only 5 lines of equations), it gives us a straightforward and powerful way to model P(E|F).

To get a deeper insight into the Transformer in a more illustrated format we read The illustrated Transformer - by Jay Alammar, However we were left curious with a few unanswered questions after reading it.

In this Blog we will attempt to answer those questions.

A deeper look at Attention!

The basic idea behind the attention is that it tells us how much we are “focusing” on a particular source word at a particular time step. The encoder-decoder will only be able to access information about the first encoded word in the source by passing it over |F| time steps. The attention mechanism allows for the source encoding to be accessed (in a weighted manner) through the context vector.

If H(f), a matrix of vectors encoding each word in the input sentence F, is the output of the encoder, we calculate an attention vector αt that can be used to combine together the columns of H into a context vector ct.

ct = H(f)αt.

Attention between encoder and decoder is crucial in NMT.Attention is a function that maps the 2-element input (query, key-value pairs) to an output. The output given by the mapping function is a weighted sum of the values, where weights for each value measures how much each input key interacts with (or answers) the query. While the attention is a goal for many research, the novelty about transformer attention is that it is multi-head self-attention.

Basic Idea: (Bahdanau et al. 2015)

  • Encode each word in the sentence into a vector
  • When decoding, perform a linear combination of these vectors, weighted by “attention weights”
  • Use this combination in picking the next word
After reading the above explanation we had two major concerns:

1)How are the attention weights obtained?

2)What are Key, Query and Value?

Calculating Attention weights

As before, the decoder’s hidden state ht(e) is a fixed-length continuous vector representing the previous target words e1t−1 , initialized as h0(e) = h(f)|F|+1. This is used to calculate a context vector ct that is used to summarize the source attentional context used in choosing target word et , and initialized as c0=0.

First, we update the hidden state to ht(e) based on the word representation and context vectors from the previous target time step

ht(e) = enc([embed(et−1); ct−1],ht−1(e)).

Based on this ht(e), we calculate an attention score at, with each element equal to

at,j = attn score(hj(f),ht(e)).

attn score(·) can be an arbitrary function that takes two vectors as input and outputs a score about how much we should focus on this particular input word encoding hj(f) at the time step ht(e). We then normalize this into the actual attention vector itself by taking a softmax over the scores:

αt = softmax(at).

This attention vector is then used to weight the encoded representation H(f) to create a context vector ct for the current time step.

Following are three different attention functions:

1)Dot product:

This is the simplest of the functions, as it simply calculates the similarity between ht(e) and hj(f) as measured by the dot product:

attn score(hj(f),ht(e)) := hj(f)ᵀht(e).

2)Bilinear functions:

This function helps relax the restriction that the source and target embeddings must be in the same space by performing a linear transform parameterized by Wa before taking the dot product:

attn score(hj(f),ht(e)) := hj(f)ᵀWaht(e).

3)Multi-layer perceptrons:

This was the method employed by Bahdanau et al. 2015 in their original implementation of attention:

attn score(hj(f),ht(e)) := wa2tanh(Wa1[ht(e) ; hj(f)]),

where Wa1 and wa2 are the weight matrix and vector of the first and second layers of the MLP respectively.

Key, Query and Value

In terms of encoder-decoder, the query is usually the hidden state of the decoder. Whereas key, is the hidden state of the encoder, and the corresponding value is normalized weight, representing how much attention a key gets. Output is calculated as a wighted sum – here the dot product of query and key is used to get a value.

It is assumed that queries and keys are of dk dimension and values are of dv dimension. Those dimensions are imposed by the linear projection discussed in the multi-head attention section. The input is represented by three matrices: queries’ matrix Q, keys’ matrix K and values’ matrix V.

The compatibility function (see Attention primer) is considered in terms of two, additive and multiplicative (dot-product) variants Bahdanau et al. 2014 with similar theoretical complexity.

About

This blog is an attempt to explain the Transformer and it's Attention mechanism .

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published