Transformers: Attention Is All You Need

“Attention Is All You Need” by Ashish Vaswani et al., 2017.

Importing packages

import torch
import numpy as np

Introduction

The transformer is an attention-based network architecture that learns context and meaning by tracking relationships in sequential data like words in a sentence. Sequence modeling and transduction problems such as machine translation have previously been solved by recurrent neural networks (RNNs) and long short-term memory networks (LSTMs). However, due to sequential dependency, the inherent nature of RNNs prevents its training from being parallelized. Transformers also solve the vaninishing and exploding gradient problems from RNNs.

Building Blocks

Positional Encoding

Positional encoding, also known as Sinusoidal Positional Encoding, is added to both input and output embeddings. Positional encodings also remain constant, so they only need to be compute once, whether it is for training or inference. Below is the formula:

\[ \begin{aligned} PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}}) \\ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}}) \end{aligned} \]

The shape of the position encoding will match the shape of the input embeddings. \(pos\) represents the position of the token in the sequence, \(i\) represents the current dimension index, and \(d_model\) represents the total number of dimensions in the vector embedding.

Scaled Dot-Product Attention

\[ \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})V \]

Scaled Dot-Product Attention takes in three inputs: query \(\bf{Q}\), key \(\bf{K}\), and value \(\bf{V}\), where \(X\) is the input and \(W^Q\), \(W^K\), \(W^V\) are learnable weight matrices specific to queries, keys, and values.

\[ \begin{aligned} Q = XW^Q\\ K = XW^K\\ V = XW^V\\ \end{aligned} \]

\(\bf{X}\) represents the input embeddings to the transformer layer. For the first layer, \(X\) = Word Embedding + Positional Encoding. In subsequent layers, \(X\) is the output of the previous transformer layer. Word embedding is a vector that represents the semantic meaning of a word (or token). Positional encoding is a vector that contains informations about a token’s position in a sequence by mapping that information to a latent space.
\(\bf{Q}\) represents the current token’s representation to “query” information from other tokens. It’s derived from the combined word and positional information.
\(\bf{K}\) represents all tokens in the space and helps determine the relevance of each token to the query token. Like \(Q\), it’s based on the same input embeddings (word + positional) but transformed different via \(W^K\).
\(\bf{V}\) represents the information associated with each token. It is the actual data that gets weighted and aggregated based on attention scores.

The dimensions of \(Q\), \(K\), and \(V\) are determined by the model’s latent space dimensionality (\(d_{model}\)) and the number of attention heads (\(h\)). In this example, we initialize the latent space dimension to be 512 and the number of attention heads to be 8.

def scaled_dot_product_attention(Q, K, V):
  """ 
  Args:
  Q, K, V: Query, Key, and Value matrices (numpy arrays).
      Shapes: (N, d_k) for Q and K, (N, d_v) for V.

  Returns:
  output: Weighted sum of values after applying attention, shape (N, d_v)
  attention_weights: shape (N, N)
  """
  d_k = Q.shape[1] # Dimensionality of keys/queries
  scores = np.dot(Q, K.T) / np.sqrt(d_k) # Shape: (N,d_v) @ (d_v,N) -> (N, N)
  attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=1) # softmax equation
  output = np.dot(attention_weights, V) # Shape: (N,N) @ (N,d_v) -> (N,d_v)
  return output, attention_weights

Note: In the encoder and decoder’s self-attention layers, the input is passed to all three parameters \(Q\), \(K\), and \(V\). In the decoder’s encoder-decoder attention layer, the output of the final encoder in the stack is passed to the \(K\) and \(V\) parameters. The output of the previous decoder self-attention layer is passed to the \(Q\) parameter.

Multi-Head Attention

\[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{Attention}_1, \text{Attention}_2, ..., \text{Attention}_h)W^O \]

Multi-head attention scales the concept of scaled dot-product attention by running multiple attention heads in parallel. Each head operates in a different subspace which optimizes feature learning. For example, one head might focus on syntax while another head focuses on semantics.

Steps:

Linear Projections: Split \(Q, K, V\) into \(h\) subspaces, one for each head.
Parallel Attention: Perform scaled dot-product attention independently for each head.
Concatenation: Combine the outputs of all heads into a single matrix.
Final Linear Transformation: Project the concatenated matrix back to \(d_{model}\) using \(W^O\).

Encoder

The encoder stack consists of layers of encoders. Each encoder consists of a self-attention layer, two layer normalization layers, and a feed-forward layer. There is also residual connections that encapsulate the self-attention and feed-forward layers.

Decoder:

Applications and Insights

Where and how is this concept applied in the real world?
What problems does it solve, and what value does it provide?
What are some examples or use cases where this has been impactful?
What are the broader implications or insights gained from this work?
How does this contribute to advancements in the field or industry?

Conclusion

What are the key takeaways or lessons from this guide?
Why is this concept significant in the broader context of the field?
What questions remain unanswered or open for further exploration?
What resources or next steps can help deepen understanding?
How can this knowledge be applied or expanded upon in practice?