Artificial Intelligence

JetMoE

22 Apr 202429 Apr 2024 af

LLM training can be much cheaper than people generally thought

JetMoE is a recent Large Language Model (LLM) that supposedly outperforms LLaMA2-7B from Meta AI and was trained for 2 weeks using 96×H100 GPU cluster, spending only ~$80,000…

But how much does it cost to train a LLM?

Training costs

A first oddity is that the JetMoE article does not explicitly mention any training costs (except its own) for comparison with other models. Also, according to this page, Llama2-7B model requires less than $85,000 to train – so, if that were the case, what would be the big economic benefit of JetMoE? For example, where did the JetMoE staff get the amount of training costs for Llama2-7B and why didn’t they publish this data for direct comparison?

Anyway, the JetMoE article reports training costs as GPU hours (exactly, Nvidia H100 GPU hours). JetMoE training costs 30,000 H100 GPU Hours. A Microsoft “optimized version of the Llama 2 model” shows the table below expressed in A100 GPU Hours (the overall performance of the H100 is better than the previous generation A100)…

Meta’s largest LLaMA model, as of march 2023, used 2,048 Nvidia A100 GPUs to train on 1.4 trillion tokens (750 words is about 1,000 tokens), taking about 21 days: the cost was over $2.4 million. Analysts and technologists estimate that the critical process of training a large language model such as OpenAI’s GPT-3 could cost more than $4 million. You can find these numbers here.

GPT-4 training approximately costs over $100 million (here).

Key Messages

This is taken directly from the JetMoE’s page.

JetMoE-8B is trained with less than $ 0.1 million cost but outperforms LLaMA2-7B from Meta AI, who has multi-billion-dollar training resources. LLM training can be much cheaper than people generally thought.
JetMoE-8B is very open and academia-friendly because:
1. It only uses public datasets for training, and the code is open-sourced. No proprietary resource is needed.
2. It can be finetuned with very limited compute budget (e.g., consumer-grade GPU) that most labs can afford.
JetMoE-8B only has 2.2B active parameters during inference, which drastically lowers the computational cost. Compared to a model with similar inference computation, like Gemma-2B, JetMoE-8B achieves constantly better performance.

How JetMoE works

The JetMoE architecture is illustrated in the following figure.

JetMoE architecture takes advantage of sparse activation on both the attention and feed-forward layers, significantly reducing training and inference costs.

Let x be the input vector, consider a learnable matrix W_r that controls the routing. Let s be the routing output:

s = W_r x .

The Sparse Mixture of Experts (SMoE) output y is represented by a relation of the type

y = g₁ · f₁(x) + g₂ · f₂(x) + · · · + g_n · f_n(x) .

It’s just a weighted combination of n experts (these are normally 2-layer MLPs or, in case of Mixture of Attention, constructs of the type illustrated below) represented by the functions f_i with i = 1, 2, . . . , n with the various “weights” g_i as functions that select the top k logits (taking their softmax) from s, setting the rest to 0.

In essence, s is a vector whose larger components have a greater influence on the above combination defining output y. The usefulness of this approach lies in the fact that if g_i = 0 for several indices i, then all the corresponding f_i(x) will not be evaluated, thus reducing computation cost during training and inference. The mechanism of a single attention expert is illustrated in the following figure.

Matrices W_k and W_v are shared across experts to improve the training and inference efficiency, instead matrices W_q and W_o in orange vary from one expert to the other. a_e is obtained applying standard multi-head attention with RoPE to k, v and q_e .

A little coding

A very concise and quick PyTorch test Jupyter notebook for JetMoE can be found here (warning: you’ll need a lot of GPU memory). Alternatively, you can test the model directly using the Online Demo on Lepton AI (link).

Useful links

JetMoE article

H100 vs A100 performance comparison

Microsoft Llama-2-Onnx model details (link)

Article on training costs here

GPT-4 over $100 million training (here)

JetMoE’s page

Video – Rotary Positional Embedding (RoPE)

JetMoE Jupyter notebook

Online Demo on Lepton AI (link)

Also available on Substack

Support this blog

Concise Grok

15 Apr 202410 May 2024 af

Short code analysis for a huge model

This is a brief code review about the Open Release of Grok-1 (link), whose code is found here. We’ll start at the entry point (the main function of run.py file) and only go through the essential steps – delving into the details would take too much time and effort for a short post.

Intro to Grok-1

Grok is a generative artificial intelligence chatbot developed by xAI, based on a large language model (LLM). The engine powering Grok is Grok-1, a 314 billion parameter Mixture-of-Experts model trained from scratch by xAI, which became open source under the Apache-2.0 license on March 17, 2024, when xAI released the base model weights and network architecture.

The repository readme cites: “This repository contains JAX example code for loading and running the Grok-1 open-weights model. Make sure to download the checkpoint and place the ckpt-0 directory in checkpoints…”. So we know that it is relatively easy to place the weights, the difficult part is that the weights file, the result of a very expensive training, is about 318GB! In fact, the same page warns that due to the large size of the model, a machine with enough GPU memory is required to test the model with the example code… good, however we are only interested in the code for now!

Some model details: a) base model trained on a large amount of text data, not fine-tuned for any particular task; b) 314B parameter Mixture-of-Experts model with 25% of the weights active on a given token; c) trained from scratch by xAI using a custom training stack on top of JAX and Rust in October 2023.

Parameters: 314B
Architecture: Mixture of 8 Experts (MoE)
Experts Utilization: 2 experts used per token
Layers: 64
Attention Heads: 48 for queries, 8 for keys/values
Embedding Size: 6,144
Tokenization: SentencePiece tokenizer with 131,072 tokens
Additional Features:
- Rotary embeddings (RoPE)
- Supports activation sharding and 8-bit quantization
Maximum Sequence Length (context): 8,192 tokens

Grok’s performance is not superior to other particular models. On the Grok blog, they justify this as “It is only surpassed by models that were trained with a significantly larger amount of training data and compute resources like GPT-4. This showcases the rapid progress we are making at xAI in training LLMs with exceptional efficiency”. I would never have thought of producing an article on a model with a gargantuan number of parameters and unconvincing results but, at least, the code seems very understandable.

Code

The repository contains code that only needs 4 libraries other than Python: JAX, Haiku, Sentencepiece and NumPy. The script run.py simply 1) loads the checkpoint (weights file) and 2) samples from the model on a test input, i. e. after inserting some input text, the model returns a response. Clearly, we are only talking about inference, the training efforts are all wrapped up in the cumbersome checkpoint (weights) file.

Main function

The script run.py contains the main function, that is our entrypoint. A language model configuration (grok_1_model) is initialized using specific parameters. Inside, a Transformer model is defined with its parameters, together with MoE and sharding parameters (sharding is a technique used in distributed computing to partition data across multiple devices or processors, allowing for parallel processing). Then, an inference runner (“runner” refers to an instance of a class or object responsible for executing the language model for inference; it encapsulates functionalities such as loading the model, tokenizing input text, performing inference, and generating output) is set up using this model configuration. The InferenceRunner is initialized with certain parameters such as pad sizes, the actual runner (an object of the ModelRunner class located in the runners.py file), name, load, tokenizer path, local mesh configuration (this is the configuration of ) and between hosts configuration. Finally, the runner is initialized and executed (these two steps are dotted in the following picture and will be explored in the next sections) to generate text based on a given input prompt (inp).

Above, the main function essential view. There are two reduced parts (highlighted in grey) for Grok-1 model config and inference runner config; these parts are expanded below.

Inference Runner initialization

After setting all the necessary parameters, the initialize() function from the inference runner object, is executed. This triggers a cascading initialization sequence, involving multiple regions of code, which is difficult to fully describe in a few lines – we’ll try!

inference_runner.initialize() calls initialize function (1) from InferenceRunner class. In turn, this last initialization function calls another initialize function (2) from ModelRunner class.

Overall, the initialize function (1) sets up the necessary components for the inference runner, including the model, tokenizer, and associated operations. It also handles distributed computation and compilation of model functions for efficient execution during inference. Here’s a breakdown of its components:

Initialization of Runner and Data:
- The function starts by extracting the runner attribute and initializing a dummy data dictionary (dummy_data) containing placeholders for inputs and targets (in essence, two arrays of zeros).
Initialization of Model and Tokenizer:
- The model is initialized using the runner.initialize() method, passing the dummy data and configuration parameters.
- The SentencePiece tokenizer is initialized using the provided tokenizer_path.
Extraction of Model Parameters:
- Parameters such as max_len (maximum sequence length) and vocab_size are extracted from the model for further use.
Padding Function:
- Defines a pad_to_max_len function, which pads sequences to the maximum length expected by the model.
Functions for Model Operations:
- Defines several functions (hk_forward, hk_sample_step, hk_new_memory, hk_prefill_memory) to perform various model operations such as forward pass, sampling, memory initialization, and prefilling.
Sharding and Compilation:
- Prepares model sharding for distributed computation using jax.tree_util.tree_map_with_path.
- Compiles the functions for sampling and prefilling memory using hk.without_apply_rng and pjit.pjit.
Final Setup:
- Sets up RNG (random number generator) key for model initialization.
- Initializes dummy_tokens for evaluating shapes.
- Computes shapes using jax.eval_shape.
Parameter Sharding:
- Computes parameter sharding using jax.tree_util.tree_map_with_path and apply_rules from the model’s partition rules.
Compilation of Sampling and Prefill Memory Functions:
- Uses pjit.pjit to compile the sampling and prefill memory functions with appropriate input and output sharding configurations.

Inference Runner running

After initialization, the inference can begin. The run() method efficiently generates text by sampling from the language model in response to prompts and yields the generated text to the caller when requested. It handles multiple requests concurrently and efficiently utilizes resources through asynchronous data copying. Let’s break down its functionality:

Initialization: The method initializes various parameters and settings required for generating text, such as random number generators (rngs), memory buffers (memory), and sample settings.
Preparation: It prepares a prompt array and settings for sampling. The prompt is padded to a suitable length, and settings like temperature and nucleus probability are set.
Compiling: The method compiles the model for sampling. This might involve precompiling the model for different prompt sizes and compiling the model for actual sampling.
Sampling Loop: The method enters a loop where it continually samples tokens from the model in response to prompts. It yields generated text when requested by the caller.
Asynchronous Copying: During the sampling loop, it asynchronously copies data between devices and hosts to avoid blocking.
Handling Requests: It processes requests for text generation, updating the state accordingly.
Continuation: It continues the sampling loop until interrupted or until all requests are fulfilled.

Model

The model.py file contains all the architectural features and various support functions. It is about 1400 lines of code, we’ll try to summarize. We reiterate that the code uses JAX and Haiku, a library that simplifies the process of building and training neural networks in JAX, providing a high-level interface while maintaining the performance benefits of JAX’s low-level primitives. It’s commonly used in machine learning research and development where JAX is the preferred framework for its flexibility and performance.

The first lines are about 8bit quantizing weights, registering them as a Pytree nodes (enabling efficient processing and manipulation using JAX’s powerful array-based operations and transformations), sharding constraints based on the presence of a distributed computing environment, rules defining a specific pattern to enable efficient distribution and parallelization of computations in a transformer model across multiple devices or processors.

Then come all the classes strictly related to the definition of MoE + Transformer architecture: Router, MoE layer, MultiHead Attention, Decoder, Transformer, Attention Mask, RMS Norm, Rotary Embedding.

Finally, language model and its configuration are defined. These classes integrate embedding, positional encoding, transformer layers, and decoding logic to generate logits for next-token prediction. They ensures proper handling of padding, masking, and distributed computation if configured.

Conclusions

We briefly discussed the Grok-1 code, which is easy to understand and well written. The model should be tested on huge machines but it is probably still immature for a definitive and high-performance version. At the time of writing, a 1.5 version of Grok already appears to be on the way – again, not state of the art… size isn’t everything in artificial intelligence.

Useful links

xAI – Open Release of Grok-1 (link)

Open Release of Grok-1 – GitHub repository (link)

xAI – Announcing Grok (link)

Hugging Face – Mixture of Experts explained (link)

xAI – Announcing Grok-1.5 (link)

Also available on Substack

Support this blog

Mixture of Experts from scratch

8 Feb 202412 Apr 2024 af

This is a simple implementation of the Mixture of Experts (MoE) technique applied to language modeling tasks.

Evaluation and training of deep models can be computationally expensive and time-consuming. The Conditional Computation approach has been proposed to tackle this problem. Conditional Computation refers to a class of algorithms in which each input sample uses a different part of the model such that (on average) the compute, latency or power (depending on our objective) is reduced. It operates by selectively activating only parts of the network at a time.

Loading data

We will use the TinyStories dataset (info), it is is suitable and not overly large.

!wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStories_all_data.tar.gz

We import some modules providing operating system dependent functionality like operations on files, paths etc.

import os
import glob
import json

Now we create TinyStories folder and extract data inside it.

if not os.path.exists("./TinyStories"):
    os.makedirs("./TinyStories")

!tar -xzf TinyStories_all_data.tar.gz -C TinyStories

The following command returns a list of paths like

TinyStories/data00.json

TinyStories/data01.json

TinyStories/data02.json

. . .

and so on.

shard_filenames = sorted(glob.glob(os.path.join('TinyStories', "*.json")))

Let us check the first element of data.

data[0]

{'story': '\n\nLily and Ben are friends. They like to play in the park. One day, they see a big tree with a swing. Lily wants to try the swing. She runs to the tree and climbs on the swing.\n"Push me, Ben!" she says. Ben pushes her gently. Lily feels happy. She swings higher and higher. She laughs and shouts.\nBen watches Lily. He thinks she is cute. He wants to swing too. He waits for Lily to stop. But Lily does not stop. She swings faster and faster. She is having too much fun.\n"Can I swing too, Lily?" Ben asks. Lily does not hear him. She is too busy swinging. Ben feels sad. He walks away.\nLily swings so high that she loses her grip. She falls off the swing. She lands on the ground. She hurts her foot. She cries.\n"Ow, ow, ow!" she says. She looks for Ben. She wants him to help her. But Ben is not there. He is gone.\nLily feels sorry. She wishes she had shared the swing with Ben. She wishes he was there to hug her. She limps to the tree. She sees something hanging from a branch. It is Ben\'s hat. He left it for her.\nLily smiles. She thinks Ben is nice. She puts on his hat. She hopes he will come back. She wants to say sorry. She wants to be friends again.',
'instruction': {'prompt:': 'Write a short story (3-5 paragraphs) which only uses very simple words that a 3 year old child would understand. The story should use the verb "hang", the noun "foot" and the adjective "cute". The story has the following features: the story should contain at least one dialogue. Remember to only use simple words!\n\nPossible story:',
'words': ['hang', 'foot', 'cute'],
'features': ['Dialogue']},
'summary': 'Lily and Ben play in the park and Lily gets too caught up in swinging, causing Ben to leave. Lily falls off the swing and hurts herself, but Ben leaves his hat for her as a kind gesture.',
'source': 'GPT-4'}

We collect all stories in the stories list.

stories = [x['story'] for x in data]

A sample from stories is the following.

stories[42]


"Once upon a time, there was a little girl named Lily. Lily loved to play in the park with her friends. One day, Lily and her friends were playing hide and seek. Lily found a good hiding spot behind a big tree. As she was hiding, she started to yawn because she was very tired.\nSuddenly, Lily saw an enormous shadow coming towards her. She got scared and started to cry. It turned out that the shadow was just her friend, Timmy. Timmy had found her hiding spot and was trying to surprise her. \nLily learned that sometimes things that seem scary are not really scary at all. She also learned that it's important to get enough sleep so you don't yawn during the day. From that day on, Lily made sure to get plenty of rest before playing with her friends."

All the stories are joined together into the string called text. At the end of each story there is a new line \n escape sequence.

text = "\n".join(stories)

text is a very long string.

len(text)

77586884

print(text[:100])


Lily and Ben are friends. They like to play in the park. One day, they see a big tree with a swing

Character encoding

We are going to use PyTorch tensors to store data.

import torch

chars contains all the characters found in the text (joined stories). Its size is 97.

chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !"$%&'()*+,-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]`abcdefghijklmnopqrstuvwxyz|~ éñ–—‘’“”…
97

Below, two dictionaries. The first binds characters to integers and the second does the reverse.

ctoi = {ch:i for i, ch in enumerate(chars)}
itoc = {i:ch for i,ch in enumerate(chars)}

ctoi


{'\t': 0,
 '\n': 1,
 ' ': 2,
 '!': 3,
 '"': 4,
 '$': 5,
 '%': 6,
 '&': 7,
 "'": 8,
 '(': 9,
 ')': 10,
 '*': 11,
 '+': 12,

  ...

  ...

  ...

 '‘': 92,
 '’': 93,
 '“': 94,
 '”': 95,
 '…': 96}

The encoding function transforms a text s into a list of integer (one for each character). Decode works exactly in the reverse order: it takes a list of integers and returns the text composed of the characters obtained decoding these integers. For example

encode("Hello, world!")

returns the list

[37, 63, 70, 70, 73, 13, 2, 81, 73, 76, 70, 62, 3].

Likewise,

decode([37, 63, 70, 70, 73, 13, 2, 81, 73, 76, 70, 62, 3])

returns the string

'Hello, world!'

encode = lambda s: [ctoi[c] for c in s]
decode = lambda l: "".join([itoc[x] for x in l])

We store the encoded text into a tensor named data (that is not the variable encountered before).

data = torch.tensor(encode(text), dtype = torch.long)
data.shape, type(data)


(torch.Size([77586884]), torch.Tensor)

data[100]


tensor([ 1,  1, 41, 67, 70, 83,  2, 59, 72, 62,  2, 31, 63, 72,  2, 59, 76, 63,
         2, 64, 76, 67, 63, 72, 62, 77, 15,  2, 49, 66, 63, 83,  2, 70, 67, 69,
        63,  2, 78, 73,  2, 74, 70, 59, 83,  2, 67, 72,  2, 78, 66, 63,  2, 74,
        59, 76, 69, 15,  2, 44, 72, 63,  2, 62, 59, 83, 13,  2, 78, 66, 63, 83,
         2, 77, 63, 63,  2, 59,  2, 60, 67, 65,  2, 78, 76, 63, 63,  2, 81, 67,
        78, 66,  2, 59,  2, 77, 81, 67, 72, 65])

Data splitting

Now it’s time to create training and validation datasets. Training data amounts to 90% of all data, the rest is validation data.

n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

Let’s define a temporary block size, setting it equal to 8 for testing purposes only. Subsequently this parameter will be set to 256 because it represents the length of the context – it is the set of data that will be provided to the MoE model from time to time.

block_size = 8

# Training data block example
train_data[:block_size+1]


tensor([ 1,  1, 41, 67, 70, 83,  2, 59, 72])

Basically, these language models are trained to guess, given n elements of text – words, parts of words, or like in this character-level case, just characters – the next text element. We are going to train a character-level model so, for example, if the first 8 characters (the context) are your nam, the next (the 9th) should be e (the target). So we need integers x for the training data and integers y representing all the targets.

x = train_data[:block_size]
y = train_data[1:block_size+1]
x,y


(tensor([ 1,  1, 41, 67, 70, 83,  2, 59]),
 tensor([ 1, 41, 67, 70, 83,  2, 59, 72]))

Here are some examples of contexts-targets, as t varies, based on the two tensors x and y above.

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print("context", context, "target", target)


context tensor([1]) target tensor(1)
context tensor([1, 1]) target tensor(41)
context tensor([ 1,  1, 41]) target tensor(67)
context tensor([ 1,  1, 41, 67]) target tensor(70)
context tensor([ 1,  1, 41, 67, 70]) target tensor(83)
context tensor([ 1,  1, 41, 67, 70, 83]) target tensor(2)
context tensor([ 1,  1, 41, 67, 70, 83,  2]) target tensor(59)
context tensor([ 1,  1, 41, 67, 70, 83,  2, 59]) target tensor(72)

For reproducibility, we set a seed for PyTorch. Reproducibility is about limiting the number of sources of nondeterministic behavior for a specific platform, device, and PyTorch release. Often, it is possible to control sources of randomness that can cause multiple executions of your application to behave differently.

torch.manual_seed(0)

Creating batches

We set the batch size to 4 for testing (will be changed later). Batch size is how many independent sequences are going to be processed in parallel.

batch_size = 4

The following function splits the data into batches.

def get_batch(split):
    # generate a small bunch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')

yb


tensor([[71, 71, 83,  2, 64, 73, 76,  2],
        [67, 77,  2, 64, 59, 80, 73, 76],
        [59, 72, 65,  2, 59, 72, 62,  2],
        [ 2, 81, 73, 79, 70, 62,  2, 78]])

Below, examples of context-target sequences on 4 batches.

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b][:t+1]
        target = yb[b][t]
        print(context, "     ", target)
    print()


tensor([73])       tensor(71)
tensor([73, 71])       tensor(71)
tensor([73, 71, 71])       tensor(83)
tensor([73, 71, 71, 83])       tensor(2)
tensor([73, 71, 71, 83,  2])       tensor(64)
tensor([73, 71, 71, 83,  2, 64])       tensor(73)
tensor([73, 71, 71, 83,  2, 64, 73])       tensor(76)
tensor([73, 71, 71, 83,  2, 64, 73, 76])       tensor(2)

tensor([66])       tensor(67)
tensor([66, 67])       tensor(77)
tensor([66, 67, 77])       tensor(2)
tensor([66, 67, 77,  2])       tensor(64)
tensor([66, 67, 77,  2, 64])       tensor(59)
tensor([66, 67, 77,  2, 64, 59])       tensor(80)
tensor([66, 67, 77,  2, 64, 59, 80])       tensor(73)
tensor([66, 67, 77,  2, 64, 59, 80, 73])       tensor(76)

tensor([77])       tensor(59)
tensor([77, 59])       tensor(72)
tensor([77, 59, 72])       tensor(65)
tensor([77, 59, 72, 65])       tensor(2)
tensor([77, 59, 72, 65,  2])       tensor(59)
tensor([77, 59, 72, 65,  2, 59])       tensor(72)
tensor([77, 59, 72, 65,  2, 59, 72])       tensor(62)
tensor([77, 59, 72, 65,  2, 59, 72, 62])       tensor(2)

tensor([63])       tensor(2)
tensor([63,  2])       tensor(81)
tensor([63,  2, 81])       tensor(73)
tensor([63,  2, 81, 73])       tensor(79)
tensor([63,  2, 81, 73, 79])       tensor(70)
tensor([63,  2, 81, 73, 79, 70])       tensor(62)
tensor([63,  2, 81, 73, 79, 70, 62])       tensor(2)
tensor([63,  2, 81, 73, 79, 70, 62,  2])       tensor(78)

Models

Let’s import some PyTorch neural networks modules.

import torch.nn as nn
from torch.nn import functional as F

The core of MoE technique is provided by the following code. The MoE layer is a type of neural network layer that combines the predictions of multiple expert networks based a gating mechanism. The gating mechanism is learned.

The __init__ method initializes the MoeLayer class with a list of expert modules (experts), a gate module (gate), and a parameter k (default value 1). The experts are the individual neural networks that form the “experts” in the mixture, they are feed-forward neural networks. The gate is another neural network (a linear layer) responsible for producing gate logits, which are used to weight the contributions of the experts. The parameter k determines how many experts to select based on the gate logits (gate logits are the values that emerge from the application of gate module operations).

Let’s move on to discussing the mechanics of the forward method. At the beginning, the input tensor inputs is flattened (squashed) and passed through the gate module to obtain gate logits. The top-k experts with the highest gate logits are selected using torch.topk .

The gate logits are then normalized using the softmax function along the second dimension. This results in a probability distribution over the selected experts.

The selected experts and their corresponding weights are used to compute the weighted sum of the expert outputs. The final result is a tensor representing the output of the mixture of experts layer.

The output tensor is reshaped to match the shape of the input tensor and returned.

class MoeLayer(nn.Module):
    def __init__(self, experts, gate, k=1):
        super().__init__()
        assert len(experts) > 0
        self.experts = nn.ModuleList(experts)
        self.gate = gate
        self.k = k

    def forward(self, inputs: torch.Tensor):
        inputs_squashed = inputs.view(-1, inputs.shape[-1])
        gate_logits = self.gate(inputs_squashed)
        weights, selected_experts = torch.topk(
            gate_logits, self.k
        )
        weights = nn.functional.softmax(
            weights,
            dim=1,
            dtype=torch.float,
        ).type_as(inputs)
        results = torch.zeros_like(inputs_squashed)
        for i, expert in enumerate(self.experts):
            batch_idx, nth_expert = torch.where(selected_experts == i)
            results[batch_idx] += weights[batch_idx, nth_expert, None] * expert(
                inputs_squashed[batch_idx]
            )
        return results.view_as(inputs)

The picture below shows the plain Transformer encoder architecture (left) and its MoE modified version (right). Block module is implemented by the Block class, which we will see shortly (actually there are n Block modules, n is coded as n_layer).

Below, a more detailed picture highlighting MoE layer (taken from https://arxiv.org/pdf/2101.03961.pdf). “Router” represents the gating module, experts are Feed Forward Networks (FFN 1, 2, 3 and 4).

Below, the code for the Transformer model (modified to include MoE layer). The Transformer consists of several blocks. So, to implement Transformer class, we need to implement the Block class first. In turn, to implement the Block class, we need MultiHeadAttention and FeedForward classes (other than MoeLayer, already defined). To define MultiHeadAttention we need the class Head.

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias = False)
        self.query = nn.Linear(n_embed, head_size, bias = False)
        self.value = nn.Linear(n_embed, head_size, bias = False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        v = self.value(x)
        out = wei @ v
        return out

class MulitHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embed, n_embed)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x =  torch.cat([head(x) for head in self.heads], dim=-1)
        out = self.dropout(self.proj(x))
        return out


class FeedForward(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, 4* n_embed),
            nn.ReLU(),
            nn.Linear(4 * n_embed, n_embed),
         nn.Dropout(dropout))

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    def __init__(self, n_embed, n_head, num_experts=4):
        super().__init__()
        self.sa_head= MulitHeadAttention(n_head, n_embed//n_head)
        self.ffw = MoeLayer(
            experts=[FeedForward(n_embed) for _ in range(num_experts)],
            gate=nn.Linear(n_embed, num_experts, bias=False),
        )

        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)

    def forward(self, x):
        x = x + self.sa_head(self.ln1(x))
        x = x + self.ffw(self.ln2(x))
        return x


class Transformer(nn.Module):
    def __init__(self):
        super().__init__()

        self.token_embedding_table = nn.Embedding(vocab_size, n_embed, device=device)
        self.position_embedding_table = nn.Embedding(block_size, n_embed, device=device)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.lm_head = nn.Linear(n_embed, vocab_size)


    def forward(self, idx, targets=None):
        B, T = idx.shape

        token_emb = self.token_embedding_table(idx)
        pos_emb = self.position_embedding_table(torch.arange(T).to(device))
        x = token_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x)
        if targets == None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, idx, max_new_tokes):
        for _ in range(max_new_tokes):
            idx_cond = idx[:, -block_size:]
            logits, loss = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim = -1)
            idx_next = torch.multinomial(probs, num_samples = 1)
            idx = torch.cat((idx, idx_next), dim = 1)
        return idx

Here are all the necessary hyperparameters. max_iters is set to 3000 for testing (it will take some time to train). Probably things start to become significant for values larger than 5000…

# hyperparameters
batch_size = 64 # independent sequences processed in parallel
block_size = 256 # max context length
max_iters = 3000 
eval_interval = 100
learning_rate = 1e-3
eval_iters = 200
n_embd = 384
n_embed = 384
n_head = 6
n_layer = 6
dropout = 0.0

# set device
device = 'cuda' if torch.cuda.is_available() else 'cpu'

Model training

Our model is the previously defined Transformer.

model = Transformer()

The function below evaluates loss for training and validation data.

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            X = X.to(device)
            Y = Y.to(device)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

Move the model to the device and adopt AdamW optimizer.

model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(),lr=1e-4)

The training loop. If max_iters is large, it may take some time to complete.

for iter in range(max_iters):

    # print the loss on train and val datasets
    if iter % 100 == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f},
            val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')
    xb = xb.to(device)
    yb = yb.to(device)

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


step 0: train loss 4.9073, val loss 4.9073
step 100: train loss 2.3431, val loss 2.3454
step 200: train loss 2.3039, val loss 2.3042
step 300: train loss 2.2779, val loss 2.2779
step 400: train loss 2.2433, val loss 2.2438
step 500: train loss 2.1811, val loss 2.1828
step 600: train loss 2.0586, val loss 2.0600
step 700: train loss 1.8800, val loss 1.8853
step 800: train loss 1.7369, val loss 1.7424
step 900: train loss 1.6339, val loss 1.6397
step 1000: train loss 1.5603, val loss 1.5576
step 1100: train loss 1.4920, val loss 1.4932
step 1200: train loss 1.4438, val loss 1.4467
step 1300: train loss 1.3997, val loss 1.4049
step 1400: train loss 1.3656, val loss 1.3669
step 1500: train loss 1.3264, val loss 1.3289
step 1600: train loss 1.3024, val loss 1.2976
step 1700: train loss 1.2736, val loss 1.2743
step 1800: train loss 1.2499, val loss 1.2537
step 1900: train loss 1.2261, val loss 1.2253
step 2000: train loss 1.2046, val loss 1.2061
step 2100: train loss 1.1865, val loss 1.1890
step 2200: train loss 1.1698, val loss 1.1704
step 2300: train loss 1.1549, val loss 1.1545
step 2400: train loss 1.1383, val loss 1.1397
step 2500: train loss 1.1250, val loss 1.1214
step 2600: train loss 1.1100, val loss 1.1127
step 2700: train loss 1.0963, val loss 1.0971
step 2800: train loss 1.0880, val loss 1.0880
step 2900: train loss 1.0735, val loss 1.0768
step 2999: train loss 1.0622, val loss 1.0644

Model evaluation

We test our model first encoding some small sequence d to get started.

d = 'a long time ago, there was a '
x = torch.tensor(encode(d), dtype = torch.long,device=device).unsqueeze(0)
print(decode(model.generate(x, max_new_tokes=500)[0].tolist()))


a long time ago, there was a she what orn it was drawaying.
Lily said on the tress and went fast, what let so deep. So, he said, "From you, Max! I have full new get special?" But so atcher amaze her paint and hellped swing that mudre that he every day.
One Bunny day, a ball abloove make turn very thought animals alun. Lily asked the field mortor the ground, another of get theree were so aftul, scareful deond again.
One day, a mexe, something more sak yurng afr he could the make slove locks? 
Lily asked her for to man stook

Useful links

Code notebook (link)

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
W. Fedus, B. Zoph, N. Shazeer
arXiv:2101.03961v3 [cs.LG](2021, rev. 2022)

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, Z. Chen
arXiv:2006.16668v1 [cs.CL] (2020)

TinyStories dataset (link)

Mixture of Experts Explained (link)

Orca 2 on Colab

27 Nov 202312 Apr 2024 af

Teaching small language models how to reason…

The following contents are loosely based on the article “Orca 2: Teaching Small Language Models How to Reason” by A. Mitra et al. (link).

Orca 2 is a language model from Microsoft Research. It builds on the previous Orca model and aims to explore the capabilities of smaller language models (LMs) through improved training signals and methods. There are two versions, one with 7 billion parameters and the other with 13 billion parameters. The goal is to demonstrate that smaller LMs, typically with around 10 billion parameters or less, can achieve enhanced reasoning abilities comparable to much larger models. Orca 2 outperforms models of similar size, including the original Orca, and exhibits performance levels similar to or better than models 5-10 times larger. The evaluation is based on complex tasks testing advanced reasoning abilities in zero-shot settings. The models are fine-tuned on high-quality synthetic data, and the weights of Orca 2 are made publicly available to encourage further research on the development, evaluation, and alignment of smaller LMs.

Orca 2 is designed with the understanding that diverse tasks require different solution strategies. Unlike a one-size-fits-all approach, it recognizes that while larger models like GPT-4 may excel at direct answers for complex tasks, smaller models can benefit from breaking tasks into steps.

In Orca 2, like its predecessor Orca 1, more advanced Language Model (LLMs) are employed to showcase various reasoning strategies across different tasks. However, in Orca 2, these strategies are specifically tailored to the task at hand. To generate nuanced data, the proficient LLM is provided with intricate prompts designed to elicit specific strategic behaviors, leading to more accurate results. During the training phase, the smaller model is exposed solely to the task and its resulting behavior, without access to the original prompts that triggered such behavior, enhancing the model’s ability to generalize and adapt its reasoning capabilities.

From Instruction Tuning to Explanation Tuning

Instruction tuning involves learning from input-output pairs where the input is natural language task description,and the output is a demonstration of the desired behavior. A weakness in instruction tuning is that a student model may generate outputs that are stylistically correct but ultimately incorrect. To address this issue, the authors introduce Explanation Tuning in Orca 1. This approach involves training student models on richer and more expressive reasoning signals obtained through system instructions. These instructions are crafted to elicit detailed explanations from a teacher model as it reasons through a task.

The process of explanation tuning initiates by creating a collection of N manually crafted system instructions aimed at prompting careful reasoning. These instructions, such as “think step-by-step” and “generate detailed answers,” are designed to elicit thoughtful responses from advanced language models like GPT-4, emphasizing “Slow Thinking.” These system instructions are integrated with user prompts spanning a wide range of tasks, forming a dataset of triplets composed of system instruction, user prompt, and the corresponding language model LLM answer. The student model is then trained to predict the LLM answer based on the given system instruction and user prompt.

Teaching Orca 2 to be a “Cautious Reasoner”

An example of system instructions is the following:

You will be given a task. Use the following steps to solve it.

Identify the main theme or topic of the story.
Look for any cause and effect relationships between the sentences.
Find the sentence that could be the start of the story. Go through each of the answer
choices and analyze to figure it out.
Rearrange the sentences in the correct order based on the information gathered in
the previous steps.
Final answer: Write down the correct order of the sentences using their numbers,
such as ‘23415’.

Hence this instructions are small aids to conduct a sort of step-by-step procedure hoping to direct towards a correct answer. It’s about breaking down a big problem into multiple subproblems.

Despite the correctness of all the provided answers, the question remains: which is the best answer for training the smaller model? This is a key issue and it is argued that smaller models should be taught to select the most effective solution strategy based on the problem at hand. For example, GPT-4, being a larger model, can effortlessly produce a direct response, whereas a smaller model might not possess this capability and could necessitate an alternative approach, such as a step-by-step thought process. Consequently, instructing a smaller model to simply “mimic” the reasoning behavior of a more powerful counterpart may not be the optimal approach.

The term Cautious Reasoning is used to describe the process of determining the appropriate solution strategy for a given task. This choice is made among options such as direct answer generation or various ‘Slow Thinking’ strategies (such as step-by-step, guess and check, explain-then-answer, etc.)

The following illustrates the process of training a Cautious Reasoning LLM:

Start with a collection of diverse tasks
Guided by the performance of Orca, decide which tasks require which solution
strategy (e.g. direct-answer, step-by-step, explain-then-answer, etc.)
Write task-specific system instruction(s) corresponding to the chosen strategy
in order to obtain teacher responses for each task.
Prompt Erasing: at training time, replace the student’s system instruction
with a generic one vacated of details of how to approach the task.

For brevity we will not go into technical details. However, we will stop to examine an Orca 2 coding example using the appropriate model from Hugging Face.

Example code

The following code can be easily executed by harnessing the power of a T4 GPU on Colab (Jupyter Notebook here).

At first, we install the widely used transformers library from Hugging Face.

!pip install git+https://github.com/huggingface/transformers

The accelerate library takes care of the heavy lifting… it’s a library that enables the same PyTorch code to be run across any distributed configuration. The -q flag in pip is used to enable quiet mode during package installation (can be specified up to 3 times to remove messages of increasing levels of importance – warning, error, critical).

!pip install accelerate -qq

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training (cit.) .

!pip install SentencePiece -qq

Protocol buffers (protobuf) are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.

!pip install protobuf -qq

bitsandbytes brings quantization to your model. You can now load any pytorch model in 8-bit or 4-bit with a few lines of code.

!pip install bitsandbytes -qq

Import PyTorch and Transformers library of Hugging Face. Set the default PyTorch device to be the GPU if available.

import torch
import transformers

if torch.cuda.is_available():
    torch.set_default_device("cuda")

transformers.AutoModelForCausalLM.from_pretrained: this is a method from the Transformers library that loads a pre-trained model. In this case, it’s loading a model with causal language modeling capabilities. The term “causal language model” typically refers to a model that generates sequences of tokens in a way that respects causality, where each token is generated based on the preceding tokens.

microsoft/Orca-2-7b: this is the identifier for the specific pre-trained model to be loaded. It seems to be a model named “Orca-2-7b” provided by Microsoft.

device_map='auto': this parameter suggests that the code should automatically handle device placement, likely selecting the available GPU if one is present.

load_in_8bit=True: this parameter indicates that the model should be loaded using 8-bit quantization, which is a technique to reduce the memory and computational requirements of the model by using lower precision for certain calculations.

model = transformers.AutoModelForCausalLM.from_pretrained(
    "microsoft/Orca-2-7b",
    device_map='auto',
    load_in_8bit=True)

Setting the tokenizer: the tokenizer is loaded with the pre-trained model identified by microsoft/Orca-2-7b. The use_fast=False parameter indicates not to use the fast tokenizer implementation, since fast and slow tokenizers produce different tokens.

tokenizer = transformers.AutoTokenizer.from_pretrained(
        "microsoft/Orca-2-7b",
        use_fast=False,
    )

system message is used interchangeably with system instruction introduced before.

tokenizer tokenizes the combined prompt using the initialized tokenizer. The return_tensors='pt' parameter specifies that the output should be PyTorch tensors. The resulting inputs variable holds the tokenized representation of the prompt.

system_message = "You are Orca, an AI language model created by Microsoft. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior."

user_message = "How can you determine if a restaurant is popular among locals or mainly attracts tourists, and why might this information be useful?"

prompt = f"system\n{system_message}\nuser\n{user_message}\nassistant"

inputs = tokenizer(prompt, return_tensors='pt')

After tokenizing the inputs, the generate() method returns the generated tokens. The generated tokens then should be converted to text before printing.

The generate method of the model is used to generate a sequence of tokens based on the input input_ids. The resulting output_ids variable holds the generated token IDs.

The batch_decode method of the tokenizer is used to convert the generated token IDs (output_ids) back into human-readable text. The [0] indexing is used to get the decoded text from the first sequence in the batch. Finally, the decoded answer is printed.

output_ids = model.generate(inputs["input_ids"],)
answer = tokenizer.batch_decode(output_ids)[0]

print(answer)

[ANSWER]
<s>  <|im_start|> system
You are Orca, an AI language model created by Microsoft. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior. <|im_end|> 
 <|im_start|> user
How can you determine if a restaurant is popular among locals or mainly attracts tourists, and why might this information be useful? <|im_end|> 
 <|im_start|> assistant
There are different ways to find out if a restaurant is popular among locals or mainly attracts tourists, and some possible reasons why this information might be useful are:

- If a restaurant is popular among locals, it might offer better quality, variety, or value for money than a tourist-oriented one, as it can rely on repeat customers and word-of-mouth recommendations.
- If a restaurant is mainly attracting tourists, it might have a more limited or expensive menu, or charge higher prices, as it has to cater to the expectations and budgets of visitors who might not know much about the local cuisine or culture.
- If you are looking for a place to eat that reflects the local lifestyle and preferences, you might want to avoid tourist traps and seek out restaurants that are frequented by locals, as they can give you a more authentic and enjoyable dining experience.
- If you are a local or a traveler who wants to support the local economy and culture, you might want to patronize restaurants that are popular among locals, as they can help sustain the local food industry and traditions.</s>

Below, the construction of a second turn message followed by its tokenization. Special tokens are not tokenized: add_special_tokens, which defaults to True, adds the BOS (beginning of a sentence) token at the beginning and the EOS (end of a sentence) token at the end. If you do not want to use these symbols, you can set add_special_tokens to False.

Finally, the new input is formed (second_turn_input), followed by a new generation and decoding cycle.

# This example continues showing how to add a second
# turn message by the user to the conversation

second_turn_user_message = "Give me a list of the key points of your first answer."

# we set add_special_tokens=False because we dont want 
# to automatically add a bos_token between messages

second_turn_message_in_markup = f"\nuser\n{second_turn_user_message}\nassistant"

second_turn_tokens = tokenizer(
    second_turn_message_in_markup,
    return_tensors='pt',
    add_special_tokens=False)

second_turn_input = torch.cat([output_ids,
    second_turn_tokens['input_ids']], dim=1)

output_ids_2 = model.generate(second_turn_input,)
second_turn_answer = tokenizer.batch_decode(output_ids_2)[0]

print(second_turn_answer)

Conclusions

Orca 2 models, employing diverse reasoning techniques and selecting effective solution strategies for each task, achieve performance levels comparable to, and sometimes surpassing, much larger models, particularly in zero-shot reasoning tasks. Despite inherent limitations, these models exhibit promising potential for future improvement, particularly in terms of enhanced reasoning capabilities, control, and safety through post-training with synthetic data.

Additionally, the potential of using tailored and high-quality synthetic data, generated by more powerful models, for training smaller models in scenarios involving complex prompts and potentially multiple model calls should be underscored.

The authors believe that while frontier models will continue to showcase superior capabilities, research aimed at developing more capable smaller models will open new avenues for applications.

Useful links

Orca 2: Teaching Small Language Models How to Reason
A. Mitra et al. – Microsoft Research
arXiv:2311 .11045 [cs.AI] (2023).

Orca 2: Teaching Small Language Models How to Reason
Microsoft Research Blog (link)

Microsoft Orca-2-7b Model – Hugging Face page.

Jupyter Notebook here.

Support this blog.

Retentive Network – notes

19 Sep 202329 Apr 2024 af

A dubious successor to Transformer

AI models have come a long way. But , thinking about it, at the level of elementary architecture components we are stuck in 2017, i.e. the advent of the Transformer. There are a lot of papers and a lot of brilliant ideas, but none working… AI prefers simple constructs, sometimes such as Gramian matrices to express affinities between objects. It’s crazy but that’s how it is.

Here comes Retentive Network (RetNet), presented as the successor of the Transformer architecture. Revolution or yet another rip-off?

The impossible Triangle

In the paper “Retentive Network: A Successor to Transformer for Large Language Models” by Y. Sun et al., authors introduce RetNet as a “foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance”. How?

Transformers have high cost inference, Linear Transformers replace the standard attention scores with linear kernels obtaining low cost inference but the modeling capability and performance are worse than Transformers, Recurrent Neural Networks (RNNs) are just the past… unless they come back with some twist making them interesting.

Authors claim that RetNets achieve low-cost inference, efficient long-sequence modeling, Transformer-comparable performance, and parallel model training simultaneously. The main expedients allowing this are the following:

(a) a multi-scale retention mechanism to substitute multi-head attention and

(b) three computation paradigms for this mechanism, i.e., parallel, recurrent, and chunkwise recurrent representations.

According to the authors, the parallel representation empowers training parallelism to utilize GPU devices fully, the recurrent representation enables efficient O(1) inference in terms of memory and computation and the chunkwise recurrent representation can perform efficient long-sequence modeling.

Each local block is encoded with the parallel mechanism for computation speed; in the meantime, the global blocks are recurrently encoded to save GPU memory.

The crux of retention mechanism is to avoid softmax operation in Transformer, which cannot be computed recurrently without simplifications – you need a quadratic amount of values right away.

Overall architecture

Input sequence

is encoded autoregressively into vector sequence

and then packed into

that is something you can view like a matrix |x| × d, that is sequence length × embedding dimension (think of each x_ as a row).

RetNet architecture consists of L stacked blocks. Each block is composed by two main components: a Multi-Scale Retention module (MSR) and a Fast Forward Network (FFN). Before each component, Layer Norm (LN) is applied. Each block has two skip connections, architecture is depicted below.

Retention

To reveal the details of MSR, it is necessary to introduce the retention mechanism. Given input X as above (think it as a |x| by d matrix), we calculate elementary values

and we try to establish a mapping between v_n and output o_n through states s_n, via a recursive procedure:

In first equation v_n is mapped to the state vector s_n, and then a linear transform is implemented to encode sequence information recurrently. Next, we make the projection Q_n , K_n content-aware through the use of weight matrices for queries and values

where W_Q and W_K are learnable d × d, matrices. Then we diagonalize matrix A (guess that authors originally assumed that A was diagonalizable)

where γ and ϴ are two d-vectors. The geometric meaning of the complex exponential function will allow to add positional information later. Indeed, the transformation provides a rotation on vectors based on the angle ϴ.

Matrix diagonalization is useful in case one wants to evaluate a matrix power, in fact

because all the products of Λ matrices inside the second line result in a unit matrix. The matrix Λ and its inverse can be absorbed respectively into W_Q and W_K , so we can rewrite the equation for o_n as follows

where the highlighted quantities between brackets represent a a relative position embedding proposed for Transformer known as xPos, a transformation providing position encoding by rotating vectors. The above equation can be further simplified replacing the vector γ (a d components vector) with a single scalar value indicated with the same symbol γ

where † represents the conjugate transpose. This formulation is easily parallelizable within training instances.

Parallel representation

The parallel representation of retention is used for training. Look at the following architecture.

We see Q = (XW_Q) ⊙ ϴ, K = (XW_K) ⊙ ϴ and V = XW_V , where ϴ represents the “rotational” contribution exp(i·n·ϴ). Note that ϴ is the complex conjugate of ϴ. GN is short for GroupNorm. Retention of X is given by the expression

with matrix D combining a causal mask (zeros entries in upper right position) and decay along relative distance (gamma entries). This matrix simulates what in the Transformer was done by softmax, in particular, vectors are weighted with an exponentially decaying factor, so that past tokens are less important for the current time step.

Recurrent representation

The retention mechanism can also be written as Recurrent Neural Networks, which is favorable for inference. The recurrent representation of retention merges aspects of RNNs with those of Transformer architecture. Q, K, V and γ are the same as previous descriptions. Let n denote the n-th timestep (n = 1, . . . , |x|). Then

The mechanism is illustrated below.

The times product is the outer product of the transpose of K_n with V_n (a matrix) and the asterisk product denotes the component-wise product of the matrix S_n with transposed Q_n and then summed along each column (however, not all implementations include this sum in the recurrent retention function). This is best seen from the pseudocode (below) in the original article.

def RecurrentRetention(
    q, k, v, # bsz ∗ num_head ∗ len ∗ qkv_dim
    past_kv, # bsz ∗ num_head ∗ qk_dim ∗ v_dim
    decay # num_head ∗ 1 ∗ 1
    ):
    current_kv = decay * past_kv + \
        k.unsqueeze(-1)* v.unsqueeze(-2)
    output = torch.sum(q.unsqueeze(-1) \
        * current_kv, dim=-2)
    output = group_norm(output)
    return output, current_kv

Chunkwise representation

There is even a hybrid representation mixing parallel and recurrent mechanisms. This representation should be fine for speeding up training, especially for long sequences. Divide the input sequences into chunks of length B

0 : B
B : 2B
2B : 3B
…
i·B : (i+1)B
…

and express the queries, keys and values as follows

with the adjusted recurrent relation

where [i] indicates the i-th chunk. Chunkwise retention

is composed of two pieces, one is the Inner-Chunk term, which follows the parallel representation, and the other is Cross-Chunk term following the recurrent representation.

Last piece: Gated Multi-Scale Retention

It’s time to define the multi-scale retention (MSR) layer seen at the beginning. Let r be the head dimension. Authors set the number h of retention heads as d/r. MSR assigns different γ for each head. The value γ is identical among different layers for simplicity. The swish gate is used to increase the non-linearity of retention layers. So, given input X, we define the layer as:

where W_G and W_O are learnable d × d matrices. The heads use multiple “discount factor” γ scales; depending on different values in the range 0 ~ h, γ becomes smaller. Retention layer uses Sub-LayerNorm to normalize outputs. The multi-scale modeling leads to different variances for the heads, this is why LayerNorm is replaced with GroupNorm.

Conclusions

What about performance? Well, this is another paper full of (supposedly) excellent results. Surely time will tell us if it is a real revolution or just another clumsy attempt to outdo the Transformer architecture. Everything seems so linear and recurrent, perhaps too much to get the expected results?

Useful links

Retentive Network: A Successor to Transformer for Large Language Models
Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, F. Wei
arXiv:2307.08621 [cs.CL] (2023)

Official implementation on GitHub (link)

PyTorch implementation of RetNet by Jamie Stirling (link)

xPos paper (link)

Reading code

30 Aug 202312 Apr 2024 af

Exploring code in the era of large language models

Reading code is often boring. Not to mention that trying to understand what is that code snippet for could take a long time. Here we try to examine code (a Python file) – actually not extremely difficult to understand! – just to offer an alternative view. Of course… it’s just an example and may not have the greatest generality, but it will still be useful..

Example file

We use Visual Studio Code (just for convenience, but other editors are ok). The file we are going to explore is model.py from llama2.c repository by A. Karpathy. Let’s pretend we know neural networks, transformers and large language models. The code is 367 lines long and represents the model behind a tiny Llama 2 implementation. Llama 2 is an open source large language model software that can be used for natural language processing tasks such as sentiment analysis, topic modeling, named entity recognition, and more.

What is this code for?

Many code files have a description of what the code does more or less at the beginning of the file (usually as comments within quotes, slashes or hashtags). This is not the case in our example (the file model.py), but the file name is self explanatory enough: it is the code for the neural network model (based on transformer architecture). We will not deal here with what a transformer or an attention mechanism is because it would be beyond the scope of the post.

At the beginning of the computer program you can see the lines for importing modules or libraries, identified by the initial import command. Note that the ML framework used is PyTorch (import torch). Sometimes it is useful to have a bird’s-eye view of the code. So, start to shrink each class by clicking near the class definition, as shown below.

Click on the wedge-shaped element to shrink the dataclass definition

You can do the same with function definitions. After shrinking, you find the following three blocks of code.

First block: defines the class with model arguments (ModelArgs) and the class RMSNorm implementing Root Mean Square Layer Normalization, used in TransformerBlock and Transformer classes below. Don’t worry if you don’t know this stuff about AI, we’re not here to discuss it, it’s the connection between code components that we care about!

Second block: defines some auxiliary function we’ll see in detail later.

Third block: the place in which the elementary components that make up the transformer are defined: Attention and FeedForward are used to define TransformerBlock and Transformer is built using of TransformerBlock blocks.

A little help from ChatGPT

Now, as an example, let’s investigate the functions from the second block. Suppose we want to find out what the precompute_freqs_cis function is all about. Let’s ask ChatGPT! You shouldn’t trust ChatGPT’s answers much, after all it’s something that doesn’t think. This might seem like a last-ditch effort because ChatGPT is misleading and unreliable… but it works quite well when asked to sift through a piece of code.

A prototype request for examining a code could be as follows:

examine the following code:

```*LANGUAGE HERE (optional)*
*CODE HERE*
```

Actually, asking “examine the following code” is overkill because you can just cut and paste the code into the ChatGPT prompt window to get the same answer. Anyway, let’s ask ChatGPT about the precompute_freqs_cis function.

examine the following code:
```
def precompute_freqs_cis(
    dim: int,
    end: int,
    theta: float = 10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)\
        [: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device)  # type: ignore
    freqs = torch.outer(t, freqs).float()  # type: ignore
    freqs_cis = torch.polar(torch.ones_like(freqs),
        freqs)  # complex64
    return freqs_cis
```

Here is ChatGPT’s response:

The provided code snippet is a Python function named precompute_freqs_cis that calculates and returns a complex-valued tensor named freqs_cis. The function appears to be related to precomputing frequency values for some purpose, possibly for signal processing or mathematical computations. Let’s break down the code step by step:

[CUT]
Detailed code description in 6 steps
[end CUT]

Overall, this function seems to be used to precompute complex frequency values based on given parameters. It’s important to note that this code snippet relies on the torch library, likely from PyTorch, as indicated by the usage of functions like torch.arange, torch.outer, and torch.polar.

Detailed code description in 6 steps (returned by ChatGPT) is the following.

Seems fine. Where are these frequencies used? Let’s discover more examining the next function, that is reshape_for_broadcast. Similar request:

examine the following code:
```
def reshape_for_broadcast(
    freqs_cis: torch.Tensor,
    x: torch.Tensor):
    ndim = x.ndim
    assert 0 <= 1 < ndim
    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
    shape = [d if i == 1 or i == ndim - 1 else 1 \
        for i, d in enumerate(x.shape)]
    return freqs_cis.view(*shape)
```

ChatGPT replies that

“The provided code snippet defines a function named reshape_for_broadcast that reshapes a given complex tensor (freqs_cis) to match the shape of another tensor (x) for broadcasting purposes. Broadcasting is a technique used in array-based computations to perform operations on arrays of different shapes without explicitly replicating the data.”

And again:

“In summary, the purpose of the reshape_for_broadcast function is to reshape the freqs_cis tensor in a way that allows it to be broadcasted (element-wise operations) with the tensor x. Broadcasting enables efficient computation between tensors of different shapes, and this function helps ensure that the broadcasting is performed correctly by reshaping the freqs_cis tensor to match the shape of the x tensor.”

Without knowing much about the context, ChatGPT gives an accurate and sound answer because these complex frequencies are actually used in the apply_rotary_emb function, that implements rotary embeddings, that is, is a common technique used to inject positional information into the input data when working with self-attention mechanisms, such as those found in transformer models (for example, to encode the position of a word in a phrase).

Conclusions

This post just wanted to be an example of how you can make it easier to read a code. If you want to make it easier for you to read and understand what a code does, the features offered by editing tools (such as modern IDEs) and current language models offer great advantages and possibilities. In this example we have used ChatGPT but it is clear that similar tools such as h2oGPT, Open Assistant and so on can be used very well. Of course the best thing is to try and experiment, because without practice you won’t get anywhere.

Useful links

llama2.c – A. Karpathy’s GitHub page (link)

Introducing Llama 2 – Meta AI (link)

What Is a Transformer Model? – NVIDIA blog (link)

Rotary Embeddings: A Relative Revolution – EleutherAI (link)

ChatGPT login page (link)

Forward-Forward algorithm

6 Jan 202311 Apr 2024 af

Impressions on a new(?) learning procedure for neural networks

2022 has gone away with Hinton’s last effort — The Forward-Forward Algorithm: Some Preliminary Investigations. It is not my intention to stir up a controversy about Hinton, but to this day it still escapes me what his real contribution to neural networks is. Last time I covered an article by Hinton was for Capsule Networks (what happened to them?) a few years ago.

There are some issues with backpropagation: first, even if neural networks are somewhat modeled on real neuronal functioning, backpropagation does not exist biologically; second, everything one puts into a neural network (as a black box) has to be modeled as a differentiable module to work well with backpropagation.

Main idea

Hinton’s last paper introduces the Forward-Forward (FF) learning method with the following key features:

(a) FF replaces forward and backward passes of backpropagation by two forward passes; one operates on positive values and the other operates on negative values;

(b) each layer has its own objective function, that is, a measure of goodness for positive and negative data;

(c) FF computes the gradients locally using a local objective function, so there is no need to backpropagate the errors.

Looking at a piece of the implementation code for the layer train method, the input is literally split into positive and negative values to operate on.

Learning with a simple layer-wise goodness function

The sum of the squared activities in a layer can be used as the “goodness” but there are many other possibilities, including minus the sum of the squared activities. Specifically, we look to correctly classify input vectors as positive data or negative data when the probability that an input vector is positive is given by the following (θ is a threshold term and σ denotes the logistic function):

$\displaystyle p(\mathsf{positive}) = \sigma\left( \sum_j y_j^2 - \theta \right)\,.$

A single hidden layer can be learned using the following criterion: the sum squared activities of the hidden units has to be high for positive data (over a certain threshold value θ for sure) and low for negative data.

A necessary observation: since it is trivial to distinguish positive from negative data by simply using the length of activity vector in the first hidden layer as an input to the second hidden layer (no need to learn new features), FF normalizes the length of the hidden vector before using it as input to the next layer. Briefly, the activity vector in the first hidden layer has a length and an orientation: the length is used to define the goodness for that layer the orientation (only) is passed to the next layer.

A supervised example

To implement supervised learning with FF, one way is to include the class labels in the input (see figure below).

An image with the correct label constitutes the positive data and an image with incorrect label constitutes the negative data. The only difference between positive and negative data is the label, so FF should ignore all image features that do not correlate with the label.

After training on MNIST dataset using FF, it is possible to classify a test digit running the net with a particular label as part of the input and accumulate the goodnesses of all but the first hidden layer. This has to be done for each label separately. After that, the label with the highest accumulated goodness is chosen. The paper reports that, during training, in order to pick hard negative labels, a forward pass from a neutral label was used.

With MNIST, after training all the layers, to make a prediction for a test image x, we find the pair (x, y) for all labels y (where y in {0, 1,…, 9}) that maximizes the network’s overall activation.

Performance

Hinton’s paper reports a brief comparison between FF and backpropagation on CIFAR-10. The test performance of FF is slightly worse than backpropagation. There is also an interesting page about the analysis of performance versus backpropagation.

Code implementations

I’d like to mention two GitHub repositories, one form Nebuly-ai and the other from Mohammad Pezeshki. Both are PyTorch implementations.

Useful Links

The Forward-Forward Algorithm: Some Preliminary Investigations
G. Hinton
arXiv:2212.13345 [cs.LG], 2022.

Code from Nebuly-ai.

Code from M. Pezeshki.

Detailed Backpropagation Algorithm (link).

Interesting performance analysis page.

Notes on the GUIE competition

28 Nov 202228 Nov 2022 af

Brief post on the 1st place solution of the Google Universal Image Embedding competition on Kaggle

The Google Universal Image Embedding competition (GUIE) is, as reported in the competition description page, the first competition in image representations that should work across many object types. Image representations are a key element for computer vision models. In past times, generic embedding learning techniques were applied to different domains separately, rather than developing generic embedding models which could be applied to all domains combined.

Representations are very useful. As a simple example, it is well known that autoencoders find representations for images. These representations are usually cheaper/smaller than the images from which they originate. One can easily work on these representations (for example, comparing them) without resorting to the original images.

Some of the types of images to be evaluated in this competition: apparel & accessories, packaged goods, landmarks, furniture & home decor, storefronts, dishes, artwork, toys, memes, illustrations, and cars. This Competition requires contestants to develop a model with the capacity to generate a 64-dimensional embedding for each image. Then the back-end server will retrieve the image of the same instance via a search based on kNN (k = 5).

The competition ended last October, 2022. In this post we will examine the 1st place solution by Qinghua Cui and Shihao Shao. We report some Shao’s comments on the strategies and development that led to the winning model.

First attempts

The competition is a bit atypical as no dataset is provided. From the discussion emerges that larger datasets result in better score, since weights trained on ImageNet-22K are better than 1K ones. So the first strategy was searching for pre-trained weights on super large datasets. A good choice to begin with was CLIP, whose code can be found here.

Cui and Shao adapted weights of VIT-H pre-trained on Laion-2B, a subset of Laion-5B as their baseline model. They added a linear projection layer to squeeze the embedding into 64d. Then, an ArcFace head was adapted together with this modification. A Dropout layer was inserted between the last and the second-to-last linear layer with a drop rate of 0.2. SGD with momentum was chosen as optimizer with L2 weights decay rate of 1.5e-4.

Dimensionality reduction algorithms like random projections, PCA, t-SNE did not work.

Training

One of the first issues participants became concerned about was the strict competition rule that banned datasets without a commercial license. Later, the rules were updated: licensing for the winning model was no longer required, only the source code used to generate it had to be licensed. All publicly available datasets were fine for model training as long as they were publicly disclosed on the forum.

Some attempts followed the scheme “choose datasets -> decide model, training-related things”. So, the winners decided to test various datasets like Products-10k, Shopee, MET Artwork Dataset, Alibaba goods, H&M Personalized Fashion, GPR1200, GLDv2-Full, DeepFashion -Consumer-to-shop Clothes Retrieval Benchmark part. Datasets were added to the training list iteratively instead of training on each datasets from very beginning. This led the score above 0.610.

The winners decided not to follow usual training methods as LP-FT (linear probing, then full fine-tuning). In the end they trained on the last 2 fully-connected layers to completely converge for 6 epochs and then, froze the linear layer and train only on the backbone part, for 3 epochs. We will present some of the reasons for that decision below.

They noted that the weights of the last layer changed rapidly when training on all the layers. Furthermore, the central embedding of each class changed rapidly and so the Euclidean distance between classes. Hence, they decided to

(a) freeze the final FC layer while training on the rest part (backbone);

(b) adding dropout to the full connection layer, already a well-known trick to avoid over-fitting — not always working, but fine in this case.

Products-10k gave the most improvement, so it was used for fine-tuning respecting the order of “first fc, then backbone”, reaching a score of 0.671.

Ensemble strategy

Another odd fact was that model ensemble (by averaging outputs) did not give a superior result, as also noted by other participants.

Having trained two models on the same datasets, due to the noise of randomly mini-batch selection and possibilities in augmentation, the outputs can differ greatly. This, presumably, could be determined by the oscillating results of the final FC layers. So, ensemble does not work as it is in the most cases.

However, the ensemble of models should work when getting “similar” results from final FC layers in different models. Therefore, the winners tried to apply ensembling keeping the final two FC layers frozen.

Finally, Shiao wrote:

“I need to redo EVERY THING MENTIONED ABOVE to the new laion-2b VIT-H model thanks to this weight:(, except several changes: 1) drop model ensemble, VIT-H is really a huge guy 2) train on all the datasets at the same time, drop products-10k, leave products-10k as the final fine-tuning datasets.”

Overlapping patches can help image segmentation for Vision Transformer models: the last trick was to set 4 pixels overlapping using 290 x 290 resolution.

The final performance results were 0.732 on public leaderboard and 0.728 on private leaderboard.

Useful links

GUIE competition overview page on Kaggle.

1st Place Solution in Google Universal Images Embedding paper.

1st place solution comments by S. Shiao.

1st place solution Github repository.

OpenCLIP repo.

ArcFace paper.

Laion-5B dataset page.

Active Dendrites

11 Apr 202212 Apr 2024 af

Avoiding catastrophic forgetting

The following content is mainly about the article Avoiding Catastrophe: Active Dendrites Enable Multi-Tasking Learning in Dynamic Environments by A. Iyer et al. (December 2021). It is a pleasant paper mixing biology, neuroscience and mathematical modeling, I hope you find it interesting.

Catastrophic forgetting

Standard Artificial Neural Networks (ANNs), based on the (inaccurate) point neuron model [Lapique, 1907] and backpropagation algorithm, often fail dramatically in multiple task learning. Differently from single-task machine learning, learning multiple distinct tasks introduces new complications. When using gradient-based methods (such as backpropagation), a noteworthy issue is that error gradients and accumulated knowledge from different tasks can interfere with one another. Effective weight tweaking to reduce the error for one task may lead to suboptimal or ruinous performance for another task. This is a common problem known as catastrophic forgetting.

The same is true for continual learning, that concerns the ability to acquire new knowledge over time while retaining relevant information from the past. A typical scenario involves training a network on a set of distinct tasks presented in a strict sequence of training phases. As a basic example, consider two different learning tasks: (1) classify dogs type and (2) identify Aramaic alphabet letters.

In essence, learning is starting from an initial weights configuration and then moving throughout weight space to a place where the error is small on the task being learned.

The figure above provides an intuitive support for what happens. Consider the sequential learning of the two aforementioned tasks 1 and 2. From the initial weights configuration (yellow dot), after learning to classify dogs we reach a certain minimum region (a). Then we learn to indentify letters and the weight configuration is modified to reach a minimum region (b). So network has completely ignored which weight configuration is appropriate for the first task.

Biological neurons and Active Dendrites

The point neuron model postulates that all of neural synapses have a linear impact on the cell. This simple assumption laid the foundations of Rosenblatt’s original Perceptron [Rosenblatt, 1958] and continues to form the basis for current deep learning networks.

This artificial neuron has relatively few synapses and no dendrites. Learning occurs by changing the strength or “weight” of the synapses which are represented by a scalar value that can take positive or negative values. A weighted sum of point neuron inputs is calculated and then a non-linear function f determines the output value of the neuron. It is now well known that the point neuron assumption is an oversimplified model of biological computations.

Pyramidal neurons (see figure below) are the most common type of neurons in the neocortex. Biological neurons have thousands of synapses arranged along dendrites. Biological synapses are partly stochastic, and therefore are low precision. Learning in a biological neuron mostly involves the formation of new synapses and the removal of unused synapses.

In real neurons, proximal synapses (those close to the cell body) have a linear impact on the neuron, but the most of synapses occur on distal dendritic segments (away from the cell body). These distal segments are known as active dendrites and process synapses in a non-linear fashion. When input to an active dendritic segment reaches a threshold, the segment initiates a dendritic spike that travels to the cell body and can determine a depolarization of the neuron for an extended period of time, even for half a second. During this period, the neuron is closer to its firing threshold and any new input is more likely to make the neuron fire. Hence, these dendrites — differently from proximal segments — have a modulatory and long-lasting impact on the neuron’s activity. Any active dendritic segment receives input signal from cells in different layers or in the form of top-down feedback.

Sparse Representations

Neural circuits in the neocortex are highly sparse. Studies reveal that relatively few neurons spike in response to a sensory stimulus. Neural connectivity is also sparse: pyramidal neurons are sparsely connected to each other and receive relatively few signals from neighboring neurons.

This is not the case in neural network modeling, where connections are mostly dense. Sparse neural representations are introduced using vectors where most of the entries are zero. Studies show that sparse representations are more resistant to noise than the dense ones. Furthermore, pattern recognition is less prone to negative effects due to slight perturbations in the input.

Active Dendrites Neuron

The authors propose a new neuron model. Mimicking what happens in pyramidal neurons, the active dendrites neuron receives two sources of input, in analogy with the proximal and distal inputs. Feedforward input is treated exactly like a point neuron. At the same time, multiple dendritic segments process a context vector and their output modulates the feedforward activation. In other words, the magnitude of the response to a given stimulus is highly context-dependent. The image below shows five dendrites processing context (weights involved are represented by small discs) and the feedforward input.

Given input x, weights w and bias b, the feedforward signal is, as usual, computed as

$\hat{t} = \mathbf{w}^\top \mathbf{x} + b \,.$

Note that weights here do not represent a 2d matrix but a vector including just the values involved with the particular neuron (needles to say, we are referring to a single artificial neuron whose functioning we are defining). On the other hand, each dendrite j computes

$\mathbf{u}_j^\top \mathbf{c}$

where u_j are weights relative to j -th dendrite and c is a context vector (for example, the context vector may encode task ID info). We will not delve too deeply into the question of calculating such a context vector but, in short, the context vector:

1) is computed using prototype representations for different classes;

2) if the system receives task information during training, then the prototype vector for a certain task is computed by taking the element-wise mean over all the training samples across all features;

3) if the system receives no task information during training, then a statistical clustering approach is used: if the new batch of samples is similar to earlier training samples, they are assigned to an existing prototype; if not, the new batch of samples is assumed to correspond to a new task, and a novel prototype is instantiated.

The figure above illustrates the prototype method. Yellow points represent samples for task A, beige for task B.

Returning to our neuron model, the segment with the strongest response to the context is selected:

$\displaystyle d = \max_j \mathbf{u}_j^\top \mathbf{c}\, .$

The active dendrites contextual contribution modulates the feedforward activation in the following manner:

$\displaystyle y= f(\hat{t},d) = \hat{t} \cdot \sigma(d)\,.$

In the expression above, y is the resulting activation, σ is the sigmoid function which takes a real number and maps it into the range [0, 1]. It is clear that weak responses (near zero) to the context vector will significantly reduce the resulting activation.

Modeling sparsity

To add sparsity in active dendrites neuron architectures, authors apply the kWTA (k-Winner-Take-All) function, that mimics biological inhibitory networks, defined as follows:

${k(y_i) = \begin{cases} y_i & \textsf{if}\; y_i\; \textsf{is one of the top}\, k \, \textsf{activations over all} \, i\\ 0 & \textsf{otherwise}\end{cases}}$

where i indexes neurons in the same layer. Sparsity is ensured by selecting the top k activations and setting all others to zero.

Active Dendrites Network Architecture

The figure below shows an active dendrite neurons network. All neurons in each hidden layer are active dendrites neurons. The network is trained by backpropagation.

The neurons selected by the kWTA function are the only having nonzero activations (hence nonzero gradients) and these latter neurons will be the only ones to be updated during the backpropagation algorithm backward pass.

A very small sparse subset of the full network is actually updated for each input. This is because for each of those “winner” neurons, only the dendritic segment that was chosen by the max operator is updated (the other segments are not modified).

What do we expect from this model? Different dendritic inputs are expected to activate different subnetworks. If this happened, the backpropagation algorithm would only modify the connections of the neurons in each subnetwork, leaving the rest of the connections in the whole network untouched (see figure below).

From tests carried out on the permuted MNIST dataset, empirical evidence shows that the network does indeed invoke separate subsets of neurons to learn different tasks. As for the results, authors claim that — in the multi-task RL setting — a 3-layer active dendrites network can achieve an average accuracy of about 88% when learning 10 Meta-World environment tasks together, while — in the continual learning setting — an almost identical network can achieve greater than 90% accuracy when learning 100 permuted MNIST tasks in sequence.

Useful links

A. Iyer, K. Grewal, A. Velu, L. O. Souza, J. Forest, S. Ahmad
Avoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments
arXiv:2201.00042v1 [cs.NE], 2021.

L. Lapique’s 1907 paper (translated, 2007).

Why Neural Networks Forget, and Lessons from the Brain [link].

J. Snell, K. Swersky, R. S. Zemel
Prototypical networks for few-shot learning
arXiv:1703.05175v2 [cs.LG], 2017.

Permuted MNIST [link].

T. Yu, D. Quillen, Z. He, R. Julian, A. Narayan, H. Shively, A. Bellathur, K. Hausman, C. Finn, S. Levine
Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
arXiv:1910.10897v2 [cs.LG], 2019 (v2 revised 2021).

CoAtNets

6 Mar 202211 Apr 2024 af

A class of state-of-the-art computer vision models

This post refers mainly to the paper CoAtNet: Marrying Convolution and Attention for All Data Sizes by Z. Dai et al. (2021).

CoAtNet models (pronounced “coat” net) for computer vision emerge as a combination of the Convolutional and Transformer (a Self-Attention based model) architectures. Experiments show that CoAtNets achieve state-of-the-art performance across various datasets like ImageNet and JFT-3B.

Convolution and Self-Attention

Convolutional neural networks (CNNs) use the convolution operation as follows (check here for a simple intro to convolution operation). Let x be a given input — think of an image or, more generally, a feature representation — whose dimensions are r × c × d, where r and c are image (or representation) rows and columns, d is the number of channels. Let $\mathcal{L}(i)$ be a local image patch around pixel xᵢ, where i denotes coordinates (α, β). Then the convolution output yᵢ is

$\displaystyle y_i = \sum_{\mathcal{L}(i)} w_{i-j} \odot x_j$

where i – j = (α, β) – (m, n) and wᵢ ₋ ⱼ is a weight (a convolution kernel entry). The index j = (m, n) varies over the patch $\mathcal{L}(i)$ . Note that xᵢ can be also considered as a 1 × 1 × d, so the product might involve multiple channels. Below, an example with $\mathcal{L}(i)$ = $\mathcal{L}$ (3, 3), a local patch for representation x.

CNNs employ weight sharing: kernel matrix is reused for generating the output for all pixel positions (a, b). Weight sharing enforces translation equivariance:

convolve(translate(x)) = translate(convolve(x))

and this is a fine property because you if your CNN detects a particular element in an image, it will find that element again when shifting the image.

For self-attention, consider a 1× 1 × d “pixel” xᵢ and consider a region $\mathcal{G}$ whose center is, for simplicity, xᵢ. This is similar to a local image patch, but the letter $\mathcal{G}$ tells us that this region can even be considered as global. Single-headed attention yᵢ is

$\displaystyle y_i = \sum_{j \in \mathcal{G}} \textsf{softmax}_\mathcal{G} \left( q_i^\top k_j \right) v_j$

where the queries qᵢ = Qxᵢ, keys kⱼ = Kxⱼ and values vⱼ = Vxⱼ quantities are described here. The matrices Q, K and V are learned matrices. Softmax is applied in relation to the quantities computed from pixels in the neighborhood $\mathcal{G}$ of xᵢ. The notation j ∈ $\mathcal{G}$ indicates that the sum is over all indices j corresponding to elements (pixels) in $\mathcal{G}$ . This computation is repeated for every pixel x₍ ₎ to obtain outputs y₍ ₎. In practice, multiple attention heads are used to learn distinct representations of the input. Below, an image showing what has just been described.

The dashed lines represent learned transformations, the rest are matrix operations.

In the current setting, no positional information is encoded in attention. This poses a limit on the expressiveness of vision models. Information about position can be achieved through the well-known positional embeddings, using sinusoidal functions. However, many experiments suggest to use relative positional embeddings for better results. Relative attention is defined as follows. Consider the relative distance of pixel of coordinates i = (α, β) to each position j = (m, n) in $\mathcal{G}$ so that each position determines two distances: a row offset m − α and column offset n − β (see figure below). The relative distances are computed with respect to — for example — pixel (0,0) and their format is row offset (yellow), columns offset (gray).

The row and column offsets are associated with an embedding r(m − α) and r(n − β) respectively, each with dimension that is the half of the output dimension d(out). Concatenating these vectors to form a unique vector, the expression for this relative attention is

$\displaystyle y_i = \sum_{j \in \mathcal{G}} \textsf{softmax}_\mathcal{G} \left( q_i^\top k_j + q_i^\top r_{j-i} \right)v_j\,.$

So we have two components as argument of softmax: the logit expressing similarity between the query and an element from $\mathcal{G}$ and the relative distance of the element from the query. Note that adding relative position information, self-attention also enjoys — similarly to convolutions — translation equivariance.

Merging desirable properties

It is worthwhile to compare the relative strengths and weaknesses of both convolution and self-attention, before questioning about how to best combine them.

Translation Equivariance. We saw earlier that this is a property satisfied by convolution.

Input-adaptive Weighting. In convolution, kernel entries are static and do not depend on the particular input. Instead, the attention weights (all the softmax parts) dynamically depend on the representation of the input.

Global Receptive Field. One of the most crucial differences between self-attention and convolution concerns the size of the receptive field. A larger receptive field, despite the high computational cost involved, provides more contextual information which could lead to higher model capacity.

An ideal model would combine the three previous properties. Taking these properties into account, the authors use the following attention mechanism for their model

$\displaystyle y_i = \sum_{j \in \mathcal{G}} \frac{\exp(x_i^\top x_j + w_{i-j})}{\sum_{k \in \mathcal{G}} \exp(x_i^\top x_k + w_{i-k})}\;x_j$

which is a kind of relative attention

$\displaystyle\sum_{j \in \mathcal{G}} \textsf{softmax}_\mathcal{G} \left( x_i^\top x_j + w_{i-j} \right)x_j$

where weights take the place of relative distances. Here $\mathcal{G}$ indicates the global spatial space and, for each j, the weight $w_{i-j}$ is a scalar (there are as many as the order of $\mathcal{G}$ ).

CoAtNet model

In the case of global attention, the complexity is quadratic w.r.t. spatial size. So it is not always feasible to use self-attention in vision tasks. Applying the previously defined attention directly to raw images would result in an excessively slow computation due to the (usually) large number of pixels involved. Hence, the authors state three main options:

(A) perform some down-sampling to reduce the spatial size and employ the global relative attention after the feature map reaches manageable level;

(B) enforce local attention, which restricts the global receptive field $\mathcal{G}$ in attention to a local field $\mathcal{L}$ just like in convolution;

(C) replace the quadratic softmax attention with certain linear attention variant which only has a linear complexity w.r.t. the spatial size.

Some experiments suggest excluding options (B) and (C) and to focus on (A). There are many ways to reduce the image size, leading to different architectures. The model we show uses, as a first stage S0, a simple 2-layer convolutional Stem. This is followed by stage S1, employing MBConv blocks with squeeze-excitation (SE), as the spatial size is too large for global attention. From S2 through S4 it is possible to consider either the MBConv or the Transformer block, provided that convolution stages must appear before Transformer stages. This leads to 4 different settings: CCCC, CCCT, CCTT and CTTT, where C denotes Convolution and T denotes Transformer. Some experiments reveal that the proper configuration is CCTT.

For both the MBConv (yellow) and the Transformer (white) blocks, transformations are of the kind

x ← x + Module(Norm(x))

where Module is MBConv, Self-Attention or FFN (FeedForward Network) and Norm corresponds to BatchNorm for MBConv and LayerNorm for Self-Attention and FFN. As activation function, Gaussian Error Linear Units (GELUs) is used in both MBConv and Transformer blocks.

Within each stage from S1 to S4, down-sampling is performed independently for both the residual branch and the identity branch.

In the Transformer block, the standard max pooling of stride 2 is directly applied to the input states of both branches of the self-attention module. A channel projection (for example, a 1 × 1 convolution) is applied to the identity branch to enlarge the hidden size. Hence, the module down-sampling can be represented as

x ← x + Proj(Pool(x)) + Attention(Pool(Norm(x))).

For the MBConv block, differently from standard MBConv block, the residual branch down-sampling is obtained by using a stride=2 convolution to the normalized inputs. The standard MBConv uses stride=2 for the Depth-wise convolution part. We can express the module as follows:

x ← Proj(Pool(x)) + Conv(DepthConv(Conv(Norm(x), stride=2))).

In depth-wise convolution, convolution is applied to a single channel at a time, that is, each channel of the input data convolves with a dedicated kernel. So, the filters/kernels will be of size k × k × 1 (see figure below).

Results

The original paper on CoAtNets reports several good results. However, it is worth noting that, as of March 2022, the state-of-the-art in image classification on ImageNet is represented by a CoAtNet model (CoAtNet-7, Top Accuracy: 90.88%, 2440M parameters, here for more).

Useful links

CoAtNet: Marrying Convolution and Attention for All Data Sizes
Z. Dai, H. Liu, Q. V. Le, M. Tan
arXiv:2106.04803v2 [cs.CV] (2021).

Stand-Alone Self-Attention in Vision Models
P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, J. Shlens
arXiv:1906.05909v1 [cs.CV] (2019).

2D Convolution (link).

Multi-Head Attention (link).

Code implementations (PyTorch and TensorFlow) (link).