Machine Learning

JetMoE

22 Apr 202429 Apr 2024 af

LLM training can be much cheaper than people generally thought

JetMoE is a recent Large Language Model (LLM) that supposedly outperforms LLaMA2-7B from Meta AI and was trained for 2 weeks using 96×H100 GPU cluster, spending only ~$80,000…

But how much does it cost to train a LLM?

Training costs

A first oddity is that the JetMoE article does not explicitly mention any training costs (except its own) for comparison with other models. Also, according to this page, Llama2-7B model requires less than $85,000 to train – so, if that were the case, what would be the big economic benefit of JetMoE? For example, where did the JetMoE staff get the amount of training costs for Llama2-7B and why didn’t they publish this data for direct comparison?

Anyway, the JetMoE article reports training costs as GPU hours (exactly, Nvidia H100 GPU hours). JetMoE training costs 30,000 H100 GPU Hours. A Microsoft “optimized version of the Llama 2 model” shows the table below expressed in A100 GPU Hours (the overall performance of the H100 is better than the previous generation A100)…

Meta’s largest LLaMA model, as of march 2023, used 2,048 Nvidia A100 GPUs to train on 1.4 trillion tokens (750 words is about 1,000 tokens), taking about 21 days: the cost was over $2.4 million. Analysts and technologists estimate that the critical process of training a large language model such as OpenAI’s GPT-3 could cost more than $4 million. You can find these numbers here.

GPT-4 training approximately costs over $100 million (here).

Key Messages

This is taken directly from the JetMoE’s page.

JetMoE-8B is trained with less than $ 0.1 million cost but outperforms LLaMA2-7B from Meta AI, who has multi-billion-dollar training resources. LLM training can be much cheaper than people generally thought.
JetMoE-8B is very open and academia-friendly because:
1. It only uses public datasets for training, and the code is open-sourced. No proprietary resource is needed.
2. It can be finetuned with very limited compute budget (e.g., consumer-grade GPU) that most labs can afford.
JetMoE-8B only has 2.2B active parameters during inference, which drastically lowers the computational cost. Compared to a model with similar inference computation, like Gemma-2B, JetMoE-8B achieves constantly better performance.

How JetMoE works

The JetMoE architecture is illustrated in the following figure.

JetMoE architecture takes advantage of sparse activation on both the attention and feed-forward layers, significantly reducing training and inference costs.

Let x be the input vector, consider a learnable matrix W_r that controls the routing. Let s be the routing output:

s = W_r x .

The Sparse Mixture of Experts (SMoE) output y is represented by a relation of the type

y = g₁ · f₁(x) + g₂ · f₂(x) + · · · + g_n · f_n(x) .

It’s just a weighted combination of n experts (these are normally 2-layer MLPs or, in case of Mixture of Attention, constructs of the type illustrated below) represented by the functions f_i with i = 1, 2, . . . , n with the various “weights” g_i as functions that select the top k logits (taking their softmax) from s, setting the rest to 0.

In essence, s is a vector whose larger components have a greater influence on the above combination defining output y. The usefulness of this approach lies in the fact that if g_i = 0 for several indices i, then all the corresponding f_i(x) will not be evaluated, thus reducing computation cost during training and inference. The mechanism of a single attention expert is illustrated in the following figure.

Matrices W_k and W_v are shared across experts to improve the training and inference efficiency, instead matrices W_q and W_o in orange vary from one expert to the other. a_e is obtained applying standard multi-head attention with RoPE to k, v and q_e .

A little coding

A very concise and quick PyTorch test Jupyter notebook for JetMoE can be found here (warning: you’ll need a lot of GPU memory). Alternatively, you can test the model directly using the Online Demo on Lepton AI (link).

Useful links

JetMoE article

H100 vs A100 performance comparison

Microsoft Llama-2-Onnx model details (link)

Article on training costs here

GPT-4 over $100 million training (here)

JetMoE’s page

Video – Rotary Positional Embedding (RoPE)

JetMoE Jupyter notebook

Online Demo on Lepton AI (link)

Also available on Substack

Support this blog

Concise Grok

15 Apr 202410 May 2024 af

Short code analysis for a huge model

This is a brief code review about the Open Release of Grok-1 (link), whose code is found here. We’ll start at the entry point (the main function of run.py file) and only go through the essential steps – delving into the details would take too much time and effort for a short post.

Intro to Grok-1

Grok is a generative artificial intelligence chatbot developed by xAI, based on a large language model (LLM). The engine powering Grok is Grok-1, a 314 billion parameter Mixture-of-Experts model trained from scratch by xAI, which became open source under the Apache-2.0 license on March 17, 2024, when xAI released the base model weights and network architecture.

The repository readme cites: “This repository contains JAX example code for loading and running the Grok-1 open-weights model. Make sure to download the checkpoint and place the ckpt-0 directory in checkpoints…”. So we know that it is relatively easy to place the weights, the difficult part is that the weights file, the result of a very expensive training, is about 318GB! In fact, the same page warns that due to the large size of the model, a machine with enough GPU memory is required to test the model with the example code… good, however we are only interested in the code for now!

Some model details: a) base model trained on a large amount of text data, not fine-tuned for any particular task; b) 314B parameter Mixture-of-Experts model with 25% of the weights active on a given token; c) trained from scratch by xAI using a custom training stack on top of JAX and Rust in October 2023.

Parameters: 314B
Architecture: Mixture of 8 Experts (MoE)
Experts Utilization: 2 experts used per token
Layers: 64
Attention Heads: 48 for queries, 8 for keys/values
Embedding Size: 6,144
Tokenization: SentencePiece tokenizer with 131,072 tokens
Additional Features:
- Rotary embeddings (RoPE)
- Supports activation sharding and 8-bit quantization
Maximum Sequence Length (context): 8,192 tokens

Grok’s performance is not superior to other particular models. On the Grok blog, they justify this as “It is only surpassed by models that were trained with a significantly larger amount of training data and compute resources like GPT-4. This showcases the rapid progress we are making at xAI in training LLMs with exceptional efficiency”. I would never have thought of producing an article on a model with a gargantuan number of parameters and unconvincing results but, at least, the code seems very understandable.

Code

The repository contains code that only needs 4 libraries other than Python: JAX, Haiku, Sentencepiece and NumPy. The script run.py simply 1) loads the checkpoint (weights file) and 2) samples from the model on a test input, i. e. after inserting some input text, the model returns a response. Clearly, we are only talking about inference, the training efforts are all wrapped up in the cumbersome checkpoint (weights) file.

Main function

The script run.py contains the main function, that is our entrypoint. A language model configuration (grok_1_model) is initialized using specific parameters. Inside, a Transformer model is defined with its parameters, together with MoE and sharding parameters (sharding is a technique used in distributed computing to partition data across multiple devices or processors, allowing for parallel processing). Then, an inference runner (“runner” refers to an instance of a class or object responsible for executing the language model for inference; it encapsulates functionalities such as loading the model, tokenizing input text, performing inference, and generating output) is set up using this model configuration. The InferenceRunner is initialized with certain parameters such as pad sizes, the actual runner (an object of the ModelRunner class located in the runners.py file), name, load, tokenizer path, local mesh configuration (this is the configuration of ) and between hosts configuration. Finally, the runner is initialized and executed (these two steps are dotted in the following picture and will be explored in the next sections) to generate text based on a given input prompt (inp).

Above, the main function essential view. There are two reduced parts (highlighted in grey) for Grok-1 model config and inference runner config; these parts are expanded below.

Inference Runner initialization

After setting all the necessary parameters, the initialize() function from the inference runner object, is executed. This triggers a cascading initialization sequence, involving multiple regions of code, which is difficult to fully describe in a few lines – we’ll try!

inference_runner.initialize() calls initialize function (1) from InferenceRunner class. In turn, this last initialization function calls another initialize function (2) from ModelRunner class.

Overall, the initialize function (1) sets up the necessary components for the inference runner, including the model, tokenizer, and associated operations. It also handles distributed computation and compilation of model functions for efficient execution during inference. Here’s a breakdown of its components:

Initialization of Runner and Data:
- The function starts by extracting the runner attribute and initializing a dummy data dictionary (dummy_data) containing placeholders for inputs and targets (in essence, two arrays of zeros).
Initialization of Model and Tokenizer:
- The model is initialized using the runner.initialize() method, passing the dummy data and configuration parameters.
- The SentencePiece tokenizer is initialized using the provided tokenizer_path.
Extraction of Model Parameters:
- Parameters such as max_len (maximum sequence length) and vocab_size are extracted from the model for further use.
Padding Function:
- Defines a pad_to_max_len function, which pads sequences to the maximum length expected by the model.
Functions for Model Operations:
- Defines several functions (hk_forward, hk_sample_step, hk_new_memory, hk_prefill_memory) to perform various model operations such as forward pass, sampling, memory initialization, and prefilling.
Sharding and Compilation:
- Prepares model sharding for distributed computation using jax.tree_util.tree_map_with_path.
- Compiles the functions for sampling and prefilling memory using hk.without_apply_rng and pjit.pjit.
Final Setup:
- Sets up RNG (random number generator) key for model initialization.
- Initializes dummy_tokens for evaluating shapes.
- Computes shapes using jax.eval_shape.
Parameter Sharding:
- Computes parameter sharding using jax.tree_util.tree_map_with_path and apply_rules from the model’s partition rules.
Compilation of Sampling and Prefill Memory Functions:
- Uses pjit.pjit to compile the sampling and prefill memory functions with appropriate input and output sharding configurations.

Inference Runner running

After initialization, the inference can begin. The run() method efficiently generates text by sampling from the language model in response to prompts and yields the generated text to the caller when requested. It handles multiple requests concurrently and efficiently utilizes resources through asynchronous data copying. Let’s break down its functionality:

Initialization: The method initializes various parameters and settings required for generating text, such as random number generators (rngs), memory buffers (memory), and sample settings.
Preparation: It prepares a prompt array and settings for sampling. The prompt is padded to a suitable length, and settings like temperature and nucleus probability are set.
Compiling: The method compiles the model for sampling. This might involve precompiling the model for different prompt sizes and compiling the model for actual sampling.
Sampling Loop: The method enters a loop where it continually samples tokens from the model in response to prompts. It yields generated text when requested by the caller.
Asynchronous Copying: During the sampling loop, it asynchronously copies data between devices and hosts to avoid blocking.
Handling Requests: It processes requests for text generation, updating the state accordingly.
Continuation: It continues the sampling loop until interrupted or until all requests are fulfilled.

Model

The model.py file contains all the architectural features and various support functions. It is about 1400 lines of code, we’ll try to summarize. We reiterate that the code uses JAX and Haiku, a library that simplifies the process of building and training neural networks in JAX, providing a high-level interface while maintaining the performance benefits of JAX’s low-level primitives. It’s commonly used in machine learning research and development where JAX is the preferred framework for its flexibility and performance.

The first lines are about 8bit quantizing weights, registering them as a Pytree nodes (enabling efficient processing and manipulation using JAX’s powerful array-based operations and transformations), sharding constraints based on the presence of a distributed computing environment, rules defining a specific pattern to enable efficient distribution and parallelization of computations in a transformer model across multiple devices or processors.

Then come all the classes strictly related to the definition of MoE + Transformer architecture: Router, MoE layer, MultiHead Attention, Decoder, Transformer, Attention Mask, RMS Norm, Rotary Embedding.

Finally, language model and its configuration are defined. These classes integrate embedding, positional encoding, transformer layers, and decoding logic to generate logits for next-token prediction. They ensures proper handling of padding, masking, and distributed computation if configured.

Conclusions

We briefly discussed the Grok-1 code, which is easy to understand and well written. The model should be tested on huge machines but it is probably still immature for a definitive and high-performance version. At the time of writing, a 1.5 version of Grok already appears to be on the way – again, not state of the art… size isn’t everything in artificial intelligence.

Useful links

xAI – Open Release of Grok-1 (link)

Open Release of Grok-1 – GitHub repository (link)

xAI – Announcing Grok (link)

Hugging Face – Mixture of Experts explained (link)

xAI – Announcing Grok-1.5 (link)

Also available on Substack

Support this blog

Mixture of Experts from scratch

8 Feb 202412 Apr 2024 af

This is a simple implementation of the Mixture of Experts (MoE) technique applied to language modeling tasks.

Evaluation and training of deep models can be computationally expensive and time-consuming. The Conditional Computation approach has been proposed to tackle this problem. Conditional Computation refers to a class of algorithms in which each input sample uses a different part of the model such that (on average) the compute, latency or power (depending on our objective) is reduced. It operates by selectively activating only parts of the network at a time.

Loading data

We will use the TinyStories dataset (info), it is is suitable and not overly large.

!wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStories_all_data.tar.gz

We import some modules providing operating system dependent functionality like operations on files, paths etc.

import os
import glob
import json

Now we create TinyStories folder and extract data inside it.

if not os.path.exists("./TinyStories"):
    os.makedirs("./TinyStories")

!tar -xzf TinyStories_all_data.tar.gz -C TinyStories

The following command returns a list of paths like

TinyStories/data00.json

TinyStories/data01.json

TinyStories/data02.json

. . .

and so on.

shard_filenames = sorted(glob.glob(os.path.join('TinyStories', "*.json")))

Let us check the first element of data.

data[0]

{'story': '\n\nLily and Ben are friends. They like to play in the park. One day, they see a big tree with a swing. Lily wants to try the swing. She runs to the tree and climbs on the swing.\n"Push me, Ben!" she says. Ben pushes her gently. Lily feels happy. She swings higher and higher. She laughs and shouts.\nBen watches Lily. He thinks she is cute. He wants to swing too. He waits for Lily to stop. But Lily does not stop. She swings faster and faster. She is having too much fun.\n"Can I swing too, Lily?" Ben asks. Lily does not hear him. She is too busy swinging. Ben feels sad. He walks away.\nLily swings so high that she loses her grip. She falls off the swing. She lands on the ground. She hurts her foot. She cries.\n"Ow, ow, ow!" she says. She looks for Ben. She wants him to help her. But Ben is not there. He is gone.\nLily feels sorry. She wishes she had shared the swing with Ben. She wishes he was there to hug her. She limps to the tree. She sees something hanging from a branch. It is Ben\'s hat. He left it for her.\nLily smiles. She thinks Ben is nice. She puts on his hat. She hopes he will come back. She wants to say sorry. She wants to be friends again.',
'instruction': {'prompt:': 'Write a short story (3-5 paragraphs) which only uses very simple words that a 3 year old child would understand. The story should use the verb "hang", the noun "foot" and the adjective "cute". The story has the following features: the story should contain at least one dialogue. Remember to only use simple words!\n\nPossible story:',
'words': ['hang', 'foot', 'cute'],
'features': ['Dialogue']},
'summary': 'Lily and Ben play in the park and Lily gets too caught up in swinging, causing Ben to leave. Lily falls off the swing and hurts herself, but Ben leaves his hat for her as a kind gesture.',
'source': 'GPT-4'}

We collect all stories in the stories list.

stories = [x['story'] for x in data]

A sample from stories is the following.

stories[42]


"Once upon a time, there was a little girl named Lily. Lily loved to play in the park with her friends. One day, Lily and her friends were playing hide and seek. Lily found a good hiding spot behind a big tree. As she was hiding, she started to yawn because she was very tired.\nSuddenly, Lily saw an enormous shadow coming towards her. She got scared and started to cry. It turned out that the shadow was just her friend, Timmy. Timmy had found her hiding spot and was trying to surprise her. \nLily learned that sometimes things that seem scary are not really scary at all. She also learned that it's important to get enough sleep so you don't yawn during the day. From that day on, Lily made sure to get plenty of rest before playing with her friends."

All the stories are joined together into the string called text. At the end of each story there is a new line \n escape sequence.

text = "\n".join(stories)

text is a very long string.

len(text)

77586884

print(text[:100])


Lily and Ben are friends. They like to play in the park. One day, they see a big tree with a swing

Character encoding

We are going to use PyTorch tensors to store data.

import torch

chars contains all the characters found in the text (joined stories). Its size is 97.

chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !"$%&'()*+,-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]`abcdefghijklmnopqrstuvwxyz|~ éñ–—‘’“”…
97

Below, two dictionaries. The first binds characters to integers and the second does the reverse.

ctoi = {ch:i for i, ch in enumerate(chars)}
itoc = {i:ch for i,ch in enumerate(chars)}

ctoi


{'\t': 0,
 '\n': 1,
 ' ': 2,
 '!': 3,
 '"': 4,
 '$': 5,
 '%': 6,
 '&': 7,
 "'": 8,
 '(': 9,
 ')': 10,
 '*': 11,
 '+': 12,

  ...

  ...

  ...

 '‘': 92,
 '’': 93,
 '“': 94,
 '”': 95,
 '…': 96}

The encoding function transforms a text s into a list of integer (one for each character). Decode works exactly in the reverse order: it takes a list of integers and returns the text composed of the characters obtained decoding these integers. For example

encode("Hello, world!")

returns the list

[37, 63, 70, 70, 73, 13, 2, 81, 73, 76, 70, 62, 3].

Likewise,

decode([37, 63, 70, 70, 73, 13, 2, 81, 73, 76, 70, 62, 3])

returns the string

'Hello, world!'

encode = lambda s: [ctoi[c] for c in s]
decode = lambda l: "".join([itoc[x] for x in l])

We store the encoded text into a tensor named data (that is not the variable encountered before).

data = torch.tensor(encode(text), dtype = torch.long)
data.shape, type(data)


(torch.Size([77586884]), torch.Tensor)

data[100]


tensor([ 1,  1, 41, 67, 70, 83,  2, 59, 72, 62,  2, 31, 63, 72,  2, 59, 76, 63,
         2, 64, 76, 67, 63, 72, 62, 77, 15,  2, 49, 66, 63, 83,  2, 70, 67, 69,
        63,  2, 78, 73,  2, 74, 70, 59, 83,  2, 67, 72,  2, 78, 66, 63,  2, 74,
        59, 76, 69, 15,  2, 44, 72, 63,  2, 62, 59, 83, 13,  2, 78, 66, 63, 83,
         2, 77, 63, 63,  2, 59,  2, 60, 67, 65,  2, 78, 76, 63, 63,  2, 81, 67,
        78, 66,  2, 59,  2, 77, 81, 67, 72, 65])

Data splitting

Now it’s time to create training and validation datasets. Training data amounts to 90% of all data, the rest is validation data.

n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

Let’s define a temporary block size, setting it equal to 8 for testing purposes only. Subsequently this parameter will be set to 256 because it represents the length of the context – it is the set of data that will be provided to the MoE model from time to time.

block_size = 8

# Training data block example
train_data[:block_size+1]


tensor([ 1,  1, 41, 67, 70, 83,  2, 59, 72])

Basically, these language models are trained to guess, given n elements of text – words, parts of words, or like in this character-level case, just characters – the next text element. We are going to train a character-level model so, for example, if the first 8 characters (the context) are your nam, the next (the 9th) should be e (the target). So we need integers x for the training data and integers y representing all the targets.

x = train_data[:block_size]
y = train_data[1:block_size+1]
x,y


(tensor([ 1,  1, 41, 67, 70, 83,  2, 59]),
 tensor([ 1, 41, 67, 70, 83,  2, 59, 72]))

Here are some examples of contexts-targets, as t varies, based on the two tensors x and y above.

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print("context", context, "target", target)


context tensor([1]) target tensor(1)
context tensor([1, 1]) target tensor(41)
context tensor([ 1,  1, 41]) target tensor(67)
context tensor([ 1,  1, 41, 67]) target tensor(70)
context tensor([ 1,  1, 41, 67, 70]) target tensor(83)
context tensor([ 1,  1, 41, 67, 70, 83]) target tensor(2)
context tensor([ 1,  1, 41, 67, 70, 83,  2]) target tensor(59)
context tensor([ 1,  1, 41, 67, 70, 83,  2, 59]) target tensor(72)

For reproducibility, we set a seed for PyTorch. Reproducibility is about limiting the number of sources of nondeterministic behavior for a specific platform, device, and PyTorch release. Often, it is possible to control sources of randomness that can cause multiple executions of your application to behave differently.

torch.manual_seed(0)

Creating batches

We set the batch size to 4 for testing (will be changed later). Batch size is how many independent sequences are going to be processed in parallel.

batch_size = 4

The following function splits the data into batches.

def get_batch(split):
    # generate a small bunch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')

yb


tensor([[71, 71, 83,  2, 64, 73, 76,  2],
        [67, 77,  2, 64, 59, 80, 73, 76],
        [59, 72, 65,  2, 59, 72, 62,  2],
        [ 2, 81, 73, 79, 70, 62,  2, 78]])

Below, examples of context-target sequences on 4 batches.

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b][:t+1]
        target = yb[b][t]
        print(context, "     ", target)
    print()


tensor([73])       tensor(71)
tensor([73, 71])       tensor(71)
tensor([73, 71, 71])       tensor(83)
tensor([73, 71, 71, 83])       tensor(2)
tensor([73, 71, 71, 83,  2])       tensor(64)
tensor([73, 71, 71, 83,  2, 64])       tensor(73)
tensor([73, 71, 71, 83,  2, 64, 73])       tensor(76)
tensor([73, 71, 71, 83,  2, 64, 73, 76])       tensor(2)

tensor([66])       tensor(67)
tensor([66, 67])       tensor(77)
tensor([66, 67, 77])       tensor(2)
tensor([66, 67, 77,  2])       tensor(64)
tensor([66, 67, 77,  2, 64])       tensor(59)
tensor([66, 67, 77,  2, 64, 59])       tensor(80)
tensor([66, 67, 77,  2, 64, 59, 80])       tensor(73)
tensor([66, 67, 77,  2, 64, 59, 80, 73])       tensor(76)

tensor([77])       tensor(59)
tensor([77, 59])       tensor(72)
tensor([77, 59, 72])       tensor(65)
tensor([77, 59, 72, 65])       tensor(2)
tensor([77, 59, 72, 65,  2])       tensor(59)
tensor([77, 59, 72, 65,  2, 59])       tensor(72)
tensor([77, 59, 72, 65,  2, 59, 72])       tensor(62)
tensor([77, 59, 72, 65,  2, 59, 72, 62])       tensor(2)

tensor([63])       tensor(2)
tensor([63,  2])       tensor(81)
tensor([63,  2, 81])       tensor(73)
tensor([63,  2, 81, 73])       tensor(79)
tensor([63,  2, 81, 73, 79])       tensor(70)
tensor([63,  2, 81, 73, 79, 70])       tensor(62)
tensor([63,  2, 81, 73, 79, 70, 62])       tensor(2)
tensor([63,  2, 81, 73, 79, 70, 62,  2])       tensor(78)

Models

Let’s import some PyTorch neural networks modules.

import torch.nn as nn
from torch.nn import functional as F

The core of MoE technique is provided by the following code. The MoE layer is a type of neural network layer that combines the predictions of multiple expert networks based a gating mechanism. The gating mechanism is learned.

The __init__ method initializes the MoeLayer class with a list of expert modules (experts), a gate module (gate), and a parameter k (default value 1). The experts are the individual neural networks that form the “experts” in the mixture, they are feed-forward neural networks. The gate is another neural network (a linear layer) responsible for producing gate logits, which are used to weight the contributions of the experts. The parameter k determines how many experts to select based on the gate logits (gate logits are the values that emerge from the application of gate module operations).

Let’s move on to discussing the mechanics of the forward method. At the beginning, the input tensor inputs is flattened (squashed) and passed through the gate module to obtain gate logits. The top-k experts with the highest gate logits are selected using torch.topk .

The gate logits are then normalized using the softmax function along the second dimension. This results in a probability distribution over the selected experts.

The selected experts and their corresponding weights are used to compute the weighted sum of the expert outputs. The final result is a tensor representing the output of the mixture of experts layer.

The output tensor is reshaped to match the shape of the input tensor and returned.

class MoeLayer(nn.Module):
    def __init__(self, experts, gate, k=1):
        super().__init__()
        assert len(experts) > 0
        self.experts = nn.ModuleList(experts)
        self.gate = gate
        self.k = k

    def forward(self, inputs: torch.Tensor):
        inputs_squashed = inputs.view(-1, inputs.shape[-1])
        gate_logits = self.gate(inputs_squashed)
        weights, selected_experts = torch.topk(
            gate_logits, self.k
        )
        weights = nn.functional.softmax(
            weights,
            dim=1,
            dtype=torch.float,
        ).type_as(inputs)
        results = torch.zeros_like(inputs_squashed)
        for i, expert in enumerate(self.experts):
            batch_idx, nth_expert = torch.where(selected_experts == i)
            results[batch_idx] += weights[batch_idx, nth_expert, None] * expert(
                inputs_squashed[batch_idx]
            )
        return results.view_as(inputs)

The picture below shows the plain Transformer encoder architecture (left) and its MoE modified version (right). Block module is implemented by the Block class, which we will see shortly (actually there are n Block modules, n is coded as n_layer).

Below, a more detailed picture highlighting MoE layer (taken from https://arxiv.org/pdf/2101.03961.pdf). “Router” represents the gating module, experts are Feed Forward Networks (FFN 1, 2, 3 and 4).

Below, the code for the Transformer model (modified to include MoE layer). The Transformer consists of several blocks. So, to implement Transformer class, we need to implement the Block class first. In turn, to implement the Block class, we need MultiHeadAttention and FeedForward classes (other than MoeLayer, already defined). To define MultiHeadAttention we need the class Head.

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias = False)
        self.query = nn.Linear(n_embed, head_size, bias = False)
        self.value = nn.Linear(n_embed, head_size, bias = False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        v = self.value(x)
        out = wei @ v
        return out

class MulitHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embed, n_embed)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x =  torch.cat([head(x) for head in self.heads], dim=-1)
        out = self.dropout(self.proj(x))
        return out


class FeedForward(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, 4* n_embed),
            nn.ReLU(),
            nn.Linear(4 * n_embed, n_embed),
         nn.Dropout(dropout))

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    def __init__(self, n_embed, n_head, num_experts=4):
        super().__init__()
        self.sa_head= MulitHeadAttention(n_head, n_embed//n_head)
        self.ffw = MoeLayer(
            experts=[FeedForward(n_embed) for _ in range(num_experts)],
            gate=nn.Linear(n_embed, num_experts, bias=False),
        )

        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)

    def forward(self, x):
        x = x + self.sa_head(self.ln1(x))
        x = x + self.ffw(self.ln2(x))
        return x


class Transformer(nn.Module):
    def __init__(self):
        super().__init__()

        self.token_embedding_table = nn.Embedding(vocab_size, n_embed, device=device)
        self.position_embedding_table = nn.Embedding(block_size, n_embed, device=device)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.lm_head = nn.Linear(n_embed, vocab_size)


    def forward(self, idx, targets=None):
        B, T = idx.shape

        token_emb = self.token_embedding_table(idx)
        pos_emb = self.position_embedding_table(torch.arange(T).to(device))
        x = token_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x)
        if targets == None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, idx, max_new_tokes):
        for _ in range(max_new_tokes):
            idx_cond = idx[:, -block_size:]
            logits, loss = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim = -1)
            idx_next = torch.multinomial(probs, num_samples = 1)
            idx = torch.cat((idx, idx_next), dim = 1)
        return idx

Here are all the necessary hyperparameters. max_iters is set to 3000 for testing (it will take some time to train). Probably things start to become significant for values larger than 5000…

# hyperparameters
batch_size = 64 # independent sequences processed in parallel
block_size = 256 # max context length
max_iters = 3000 
eval_interval = 100
learning_rate = 1e-3
eval_iters = 200
n_embd = 384
n_embed = 384
n_head = 6
n_layer = 6
dropout = 0.0

# set device
device = 'cuda' if torch.cuda.is_available() else 'cpu'

Model training

Our model is the previously defined Transformer.

model = Transformer()

The function below evaluates loss for training and validation data.

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            X = X.to(device)
            Y = Y.to(device)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

Move the model to the device and adopt AdamW optimizer.

model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(),lr=1e-4)

The training loop. If max_iters is large, it may take some time to complete.

for iter in range(max_iters):

    # print the loss on train and val datasets
    if iter % 100 == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f},
            val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')
    xb = xb.to(device)
    yb = yb.to(device)

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


step 0: train loss 4.9073, val loss 4.9073
step 100: train loss 2.3431, val loss 2.3454
step 200: train loss 2.3039, val loss 2.3042
step 300: train loss 2.2779, val loss 2.2779
step 400: train loss 2.2433, val loss 2.2438
step 500: train loss 2.1811, val loss 2.1828
step 600: train loss 2.0586, val loss 2.0600
step 700: train loss 1.8800, val loss 1.8853
step 800: train loss 1.7369, val loss 1.7424
step 900: train loss 1.6339, val loss 1.6397
step 1000: train loss 1.5603, val loss 1.5576
step 1100: train loss 1.4920, val loss 1.4932
step 1200: train loss 1.4438, val loss 1.4467
step 1300: train loss 1.3997, val loss 1.4049
step 1400: train loss 1.3656, val loss 1.3669
step 1500: train loss 1.3264, val loss 1.3289
step 1600: train loss 1.3024, val loss 1.2976
step 1700: train loss 1.2736, val loss 1.2743
step 1800: train loss 1.2499, val loss 1.2537
step 1900: train loss 1.2261, val loss 1.2253
step 2000: train loss 1.2046, val loss 1.2061
step 2100: train loss 1.1865, val loss 1.1890
step 2200: train loss 1.1698, val loss 1.1704
step 2300: train loss 1.1549, val loss 1.1545
step 2400: train loss 1.1383, val loss 1.1397
step 2500: train loss 1.1250, val loss 1.1214
step 2600: train loss 1.1100, val loss 1.1127
step 2700: train loss 1.0963, val loss 1.0971
step 2800: train loss 1.0880, val loss 1.0880
step 2900: train loss 1.0735, val loss 1.0768
step 2999: train loss 1.0622, val loss 1.0644

Model evaluation

We test our model first encoding some small sequence d to get started.

d = 'a long time ago, there was a '
x = torch.tensor(encode(d), dtype = torch.long,device=device).unsqueeze(0)
print(decode(model.generate(x, max_new_tokes=500)[0].tolist()))


a long time ago, there was a she what orn it was drawaying.
Lily said on the tress and went fast, what let so deep. So, he said, "From you, Max! I have full new get special?" But so atcher amaze her paint and hellped swing that mudre that he every day.
One Bunny day, a ball abloove make turn very thought animals alun. Lily asked the field mortor the ground, another of get theree were so aftul, scareful deond again.
One day, a mexe, something more sak yurng afr he could the make slove locks? 
Lily asked her for to man stook

Useful links

Code notebook (link)

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
W. Fedus, B. Zoph, N. Shazeer
arXiv:2101.03961v3 [cs.LG](2021, rev. 2022)

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, Z. Chen
arXiv:2006.16668v1 [cs.CL] (2020)

TinyStories dataset (link)

Mixture of Experts Explained (link)

Neural Networks as Decision Trees

17 Jan 202317 Jan 2023 af

More or less obvious transpositions

This post is inspired by a recent article (which we will not cover) stating that neural networks are decision trees. Clearly it is not the only article to address the topic. Paying too much attention to certain articles — stating that neural networks are decision trees, compositions of splines, kernel machines — one may end up believing that neural networks are equivalent to any ML construct one chooses to name…

ReLU activations naturally determine tree structures

The following simple argument is from the paper Towards Interpretable ANNs: An Exact Transformation to Multi-Class Multivariate Decision Trees by Nguyena, Kasmarika and Abbass. Consider a feed-forward neural network whose hidden layers are activated by ReLU function. Fix a hidden layer, say the k-th — we think of it as assigned just to omit the index k. The index j refers to this layer, the index i refers to the preceding layer (the (k-1)-th layer). Denote with z_j the value of the hidden node j in layer k before the activation:

$\displaystyle z_j = \sum_{i= 1}^{I} w_{i\, j} \, H_{i} + b_ j \,.$

The H values are the activations from the preceding layer (inputs to k-th layer) and b_j is a bias term. The post-activation values h may coincide with the z values or not (in this case ReLU activation returns 0). The possibilities are depicted in the following figure.

Due to the nature of ReLU activation, the output of a node after applying ReLU activation is either 0 or the same value to the input to that node, prior activation (that is, h_j = z_j). Thus, it is easy to see that each hidden layer of the neural network can be transformed into a binary decision tree. Decision at each tree stage is made by the activation of the corresponding node in the hidden layer based on the constraint of whether or not the value before the activation function is greater than 0.

As for explainability, it is clear that the size of the tree grows exponentially as the network size grows; it’s like going from one black box to another.

C-Net

There is a method for generating multivariate decision trees (MDTs) from neural networks. We present the first C-Net architecture (there is a new version which we will not cover). The procedure is the following. After the neural network is trained, new data is introduced and the outputs of the last hidden layer are computed. In other words, from a set of training and test data, denoted with <X_t, Y_t> and <X_T, Y_T> respectively, we can compute the mapping between the last hidden output layer and the output, denoted as <H_t, Y_t> and <H_T, Y_T>. We retain these two sets, representing the relationship between the last hidden layer and the output layer, for the next stage in which they are used to train a Quinlan C5 univariate decision tree (UDT) whose algorithm adapts an entropic information gain ratio for branch-splitting criterion. After that, we know that a decision tree can be represented by a set of polyhedrons expressed in the form of linear constraints. These constraints have the form H_j(X_t) op C_j , where op represents the binary operators {≤, <, =, >, ≥}, and C_j is the numeric threshold of such a constraint on input H_j . To obtain a multivariate for of the expression, a back-projection from the output of the neural network to the input of the neural network is needed.

The algorithm is the following.

Useful links

Neural Networks are Decision Trees
C.Aytekin
arXiv:2210.05189 [cs.LG], 2022.

Towards Interpretable ANNs: An Exact Transformation to Multi-Class Multivariate Decision Trees
D. T. Nguyen, K. E. Kasmarik, H. A. Abbass
arXiv:2003.04675 [cs.LG], 2020.

C-Net: A Method for Generating Non-deterministic and Dynamic Multivariate Decision Trees
H. A. Abbass, M. Towsey, G. D. Finn
Knowledge and Information SystemsVolume, 3 Issue, pp. 184–197, 2001 (link).

Rectifier (ReLu activation) – Wikipedia entry.

The illusion of learning (link).

Explainable AI – Wikipedia entry.

Forward-Forward algorithm

6 Jan 202311 Apr 2024 af

Impressions on a new(?) learning procedure for neural networks

2022 has gone away with Hinton’s last effort — The Forward-Forward Algorithm: Some Preliminary Investigations. It is not my intention to stir up a controversy about Hinton, but to this day it still escapes me what his real contribution to neural networks is. Last time I covered an article by Hinton was for Capsule Networks (what happened to them?) a few years ago.

There are some issues with backpropagation: first, even if neural networks are somewhat modeled on real neuronal functioning, backpropagation does not exist biologically; second, everything one puts into a neural network (as a black box) has to be modeled as a differentiable module to work well with backpropagation.

Main idea

Hinton’s last paper introduces the Forward-Forward (FF) learning method with the following key features:

(a) FF replaces forward and backward passes of backpropagation by two forward passes; one operates on positive values and the other operates on negative values;

(b) each layer has its own objective function, that is, a measure of goodness for positive and negative data;

(c) FF computes the gradients locally using a local objective function, so there is no need to backpropagate the errors.

Looking at a piece of the implementation code for the layer train method, the input is literally split into positive and negative values to operate on.

Learning with a simple layer-wise goodness function

The sum of the squared activities in a layer can be used as the “goodness” but there are many other possibilities, including minus the sum of the squared activities. Specifically, we look to correctly classify input vectors as positive data or negative data when the probability that an input vector is positive is given by the following (θ is a threshold term and σ denotes the logistic function):

$\displaystyle p(\mathsf{positive}) = \sigma\left( \sum_j y_j^2 - \theta \right)\,.$

A single hidden layer can be learned using the following criterion: the sum squared activities of the hidden units has to be high for positive data (over a certain threshold value θ for sure) and low for negative data.

A necessary observation: since it is trivial to distinguish positive from negative data by simply using the length of activity vector in the first hidden layer as an input to the second hidden layer (no need to learn new features), FF normalizes the length of the hidden vector before using it as input to the next layer. Briefly, the activity vector in the first hidden layer has a length and an orientation: the length is used to define the goodness for that layer the orientation (only) is passed to the next layer.

A supervised example

To implement supervised learning with FF, one way is to include the class labels in the input (see figure below).

An image with the correct label constitutes the positive data and an image with incorrect label constitutes the negative data. The only difference between positive and negative data is the label, so FF should ignore all image features that do not correlate with the label.

After training on MNIST dataset using FF, it is possible to classify a test digit running the net with a particular label as part of the input and accumulate the goodnesses of all but the first hidden layer. This has to be done for each label separately. After that, the label with the highest accumulated goodness is chosen. The paper reports that, during training, in order to pick hard negative labels, a forward pass from a neutral label was used.

With MNIST, after training all the layers, to make a prediction for a test image x, we find the pair (x, y) for all labels y (where y in {0, 1,…, 9}) that maximizes the network’s overall activation.

Performance

Hinton’s paper reports a brief comparison between FF and backpropagation on CIFAR-10. The test performance of FF is slightly worse than backpropagation. There is also an interesting page about the analysis of performance versus backpropagation.

Code implementations

I’d like to mention two GitHub repositories, one form Nebuly-ai and the other from Mohammad Pezeshki. Both are PyTorch implementations.

Useful Links

The Forward-Forward Algorithm: Some Preliminary Investigations
G. Hinton
arXiv:2212.13345 [cs.LG], 2022.

Code from Nebuly-ai.

Code from M. Pezeshki.

Detailed Backpropagation Algorithm (link).

Interesting performance analysis page.

Dropout tales

26 Aug 202211 Apr 2024 af

Insights into a popular regularization technique

Dropout is an effective regularization technique used to reduce overfitting in neural networks. It works like this: given a feedforward neural network, at training time remove some neurons at each non-output layer, depending on a certain probability. For example, if the probability is 0.5 (this probability can vary across levels), you flip a coin and decide if a certain neuron should be in or out. The picture below shows the given original network (called base or parent network) on the left and the network after applying dropout on the right.

As you may have noticed, at training time the network on the right (picture above) is a simpler network (less units) prone to express a simpler model.

At test time no units are dropped, so that the full network is used to make predictions. The picture below shows what happens at training time and during test.

At training time, a unit (neuron) is present with a certain probability p and is connected to units in the next layer with weights w. At test time, the unit is always present and weights are multiplied by p. This is because we would like the outputs of units during test time to be equivalent to their expected outputs at training time.

In fact, dropout retains a unit with probability p and removes a unit (the output of a unit is set to 0) with probability 1 − p. This means that if the output of a unit prior to dropout was x, then after dropout the expected output would be E[output] = px + (1 − p) · 0 = px. Therefore, to ensure that the outputs have the same expectation at test time as they did during training, we have to multiply weights by p at test time.

However, this implementation of dropout is undesirable because it requires scaling of neuron outputs at test time. This is bad for test-time performance, so it is preferable to use inverted dropout, where the scaling occurs at training time instead of testing time. In inverted dropout, the output of any retained unit is divided by p before the value is propagated to the next layer. In this case

$\displaystyle \text{E}[\text{output}] = p \cdot \frac{x}{p} + (1 - p) \cdot 0\,,$

avoiding output scaling at test time.

Dropout as a regularizer

By the fact that units can go away at random, each neuron may miss an important input (or more important inputs) from the previous layer and so it can not rely on any one input. The neuron has to spread out the weights with respect to its incoming neurons, causing the weights to shrink. This shrinking lowers the squared norm of the weights. Hence dropout is, in some respects, similar to L2 regularization. This explanation can be found in this video lecture by Andrew Ng.

The fact that some type of L2 regularization was hiding behind the dropout technique was already discussed in a 2013 article. One of the study findings is that dropout can be seen as an attempt to apply an L2 penalty after normalizing the feature vector by a quantity depending on the diagonal of an estimate of Fisher information matrix.

In the picture above, a comparison of two L2 regularizers (take a look at this page if you need a quick recap on regularization). The solid ellipses are level surfaces of the likelihood and the dashed curves are level surfaces of the regularizer. The top panel shows a classic spherical L2 regularizer. Let I be the Fisher information matrix. If I were a multiple of the identity matrix, then these level surfaces would be perfectly spherical. In dropout, these level surfaces are non-spherical (bottom panel) due to the normalization of the problem features by diag(I)⁻¹ᐟ²: L2 penalty is applied after scaling (the features have been balanced out).

Dropout as a bagging algorithm

There is an obvious link between intuitive regularization and size/complexity of the network (see picture below). Smaller networks correspond to rigid and simple models. It would be useful sometimes — to avoid overfitting — to exploit a method that helps reducing complexity, returning a better performing model. Intuitively, fewer neurons (units) in action correspond to simpler models.

As you may have noticed, at training time the network (after applying dropout) is a simpler network (less units) prone to express a simpler model, maybe reducing overfitting. The network is trained to produce accurate predictions on unseen data even in unfriendly conditions where some neurons are missing.

Recall that to learn with bagging, we deﬁne t diﬀerent learners (ensemble models), construct t diﬀerent datasets by sampling from the training set with replacement, and then train model i on dataset i. The bagging meta-algorithm is depicted below: (1) create multiple data sets Dᵢ through sampling with replacement; (2) employ multiple learners Lᵢ in parallel; (3) combine all learners using an averaging or majority-vote strategy.

Dropout aims to approximate this process, but with an exponentially large number of neural networks. Dropout trains the ensemble consisting of (possibly all) subnetworks that can be formed by removing non-output units from a given base network (see figure below). The base network can be identified with its 2ⁿ thinned subnetworks.

When training with dropout, we use minibatches and each time we load an example into a minibatch, we randomly sample a diﬀerent binary mask (0 out, 1 in) applying to all of the input and hidden units in the network.

There is a significant difference between bagging and dropout. Bagging models are all independent. Dropout models, instead, share parameters: each model inherits a diﬀerent subset of parameters from the parent neural network. This parameter sharing makes it possible to represent an exponential number of models with a tractable amount of memory. Moreover, dropout training differs from bagging in that each model is trained for only one step.

In bagging, the prediction of the ensemble is given by the arithmetic mean of all of the resulting predictions. In the case of dropout, at test time it is not feasible to explicitly average the predictions from exponentially many thinned models. However, there is a simple approximate averaging method that works well in practice. There is no theoretical reason (at the moment) for the accuracy of this approximate averaging method, but empirically it performs very well. The idea is to use a single neural net at test time without dropout. This neural net is obtained adjusting the weights as shown before, i.e. outgoing weights of a retained unit are multiplied by p at test time. We already observed that this ensures that, for any hidden unit, the actual output at test time is the same as the expected output at training time. By doing this scaling, a large number of networks with shared weights can be combined into a single neural network to be used at test time.

Dropout in practice

Dropout is implemented in PyTorch through the nn.Dropout class. nn.Dropout randomly zeroes some of the elements of the input tensor with probability p using samples from a Bernoulli distribution. Note that here p is the probability to drop the unit; this is different from our previous usage (so far we have denoted with p the probability to retain a unit).

Below, a minimal example showing how Dropout sets to zero several units of matrix x (setting p = 0.75, about 3 units out of 4 are dropped).

import torch
from torch.nn import Dropout

x = torch.full((3, 5), 1.0)
print(x)
dropout = Dropout(p = 0.75)
y = dropout(x)
print(y)

The TensorFlow analogue is tf.keras.layers.Dropout. Below, a small neural network example with nn.Dropout modules interspersed between Linear layers.

import torch
from torch.nn import Sequential, Linear, ReLU, Dropout

model = Sequential(Linear(10, 100), ReLU(),
                   Dropout(),
                   Linear(100, 50), ReLU(),
                   Dropout(),
                   Linear(50, 2))
t = torch.rand(10)
print(model(t))

If the neural network is defined as a class, it is possible to specify nn.Dropout occurrences in the forward method.

What are the best values for p ? There is no right value that works for all kinds of situations, the key is to repeat experiments until a satisfactory value is reached. As initial values to be refined later, some sources cite that, typically, an input unit is included with probability 0.8 (p = 0.2) and a hidden unit is included with probability 0.5 (p = 0.5).

Useful links

Dropout: A Simple Way to Prevent Neural Networks from Overfitting
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov
Journal of Machine Learning Research 15 (1929–1958), 2014 [link].

Fundamentals of Deep Learning
N. Buduma, N. Lacascio
36–37, O’Reilly Media, Inc., 2017.

Dropout Training as Adaptive Regularization
S. Wager , S. Wang , P. Liang
arXiv:1307.1493 [stat.ML], 2013.

Deep Learning
I. Goodfellow, Y. Bengio, A. Courville
Chapter 7 (224–270), MIT Press, 2016 [link].

Dropout — PyTorch docs page.

How does dropout work during testing in neural network?[Prylipko]

Neural Interpreters

13 Feb 202211 Apr 2024 af

Sparse attention mechanisms and several analogies with programming codes

For this post we refer to the paper “Dynamic Inference with Neural Interpreters” by Rahaman et al. (2021).

Overview

A neural interpreter is a collection of modules almost resembling a programming code: it is a bunch of scripts which are made up of functions which are made up of lines of code. Essentially, this is an attention-based network and inputs to the model are routed through a sequence of functions in a way that is end-to-end learned.

Convolutional networks reuse computational units, like filters, laterally (once depth is fixed), meanwhile recurrent neural networks only reuse computational units (RNN cells) vertically, i.e., in depth. Such rigidity in the way networks reuse their units is believed to be one of the reasons for the poor generalization. Neural interpreter model aims to be an architecture made of independent and composable pieces, capable of relaxing this rigidity in computation reuse.

Input and Output

Assume that the input set contains vector embeddings of image patches or entire images. These elements are vectors of a certain dimension d_in. The input set additionally includes one or more learned vectors, called CLS tokens, for which the corresponding outputs interface with their respective classifiers. The output is another set of vectors whose dimension is d_out (with the same cardinality as the input set).

Scripts

A neural interpreter is a stack of nₛ scripts mapping one set of vectors X = {x₁ , x₂, …} to another Y = {y₁ , y₂, …} with the same number of elements:

$\mathbf{Y} = \mathsf{Neural \; Interpreter}(\mathbf{X}) = \left[ \mathsf{Script}_{n_s} \, \circ \, \cdots \, \circ\, \mathsf{Script}_1 \right](\mathbf{X})$

**Fig. 2**. A neural interpreter is a stack of scripts

Increasing the number of scripts nₛ will increase the depth of the architecture. A script has four components:

a type inference module;
a type matching mechanism;
a set of functions;
an interpreter.

We will soon describe these four components.

Functions

Each script contains functions. Functions are vector-valued instructions to other components in the script. Formally, a function fᵤ is a pair (sᵤ, cᵤ) where sᵤ is called signature and cᵤ is called code (u is used as index). The signature is a normalized vector of dimensions d_type and indicates to the type matching mechanism (see below) what inputs are to be routed to fᵤ (note the analogy with coding). The vector cᵤ, a learned parameter for each function, is the code that tells the function what to do (further details in a moment). Each f has its own code that would always be the same.

For example, f₁, f₂ and f₃ all share their global parameters but they all have their own codes. Samples can jump flexibly from one function to another. The way each sample is routed through the network is completely independent and it is determined on a per-sample basis. Every example has its own independent path to the network and the routing itself is completely learned.

Not all the examples are routed, so let’s see how an example gets to the functions’ scope.

Type Matching and Inference

Before getting to the functions, a sort of higher-level attention is performed on the set elements. Type matching is responsible for routing the information elements through functions. This is a three step procedure.

a) At the beginning, an input set element xᵢ is processed through an MLP module (called type inference module) to obtain a type vector tᵢ whose dimension is d_type. This vector lies in the same unit hypersphere 𝓣 containing the signature vectors sᵢ.

b) Consider a function fᵤ. Define a distance function based on the cosine similarity between tᵢ and signature sᵤ, that is d𝓣= 1 – sᵤ · tᵢ .

c) Successively, a sort of softmax with normalization is performed, returning a coefficient Cᵤᵢ. However, Cᵤᵢ is set to 0 if the distance between sᵢ and tᵢ is larger than τ, a value called truncation parameter. This introduces sparsity in the model. Fix u and i, then Cᵤᵢ is the compatibility between function fᵤ and set element xᵢ: xᵢ can be processed by fᵤ only if Cᵤᵢ is sufficiently large. If Cᵤᵢ = 0, then fᵤ cannot access xᵢ.

Modulated Linear layers and modulated MLPs

The following constructs are needed to define an attention mechanism later on. These constructs should be interpreted as programmable modules (the program is determined by the code c). Modulated linear layers act like linear layers with the only difference being that, instead of x, the linear transformation is applied to

x´ = x ⊗ LayerNorm(W𝒸 c)

where W𝒸 is a learnable matrix that constitutes a set of parameters shared among all functions in the same script (the symbol ⊗ denotes entry-wise product). In short

$\mathbf{y} = \mathsf{ModLin}(\mathbf{x}; \mathbf{c} ) = \mathbf{W}\mathbf{x}^\prime + \mathbf{b}$

where W is a weight matrix and b is a bias term. Having defined modulated linear layer, one may also stack L of them (sharing the same code c) interspersed with GELU activation functions to get the modulated MLP:

$\begin{aligned} \mathbf{y} &= \mathsf{ModMLP}(\mathbf{x}; \mathbf{c} )\\ &= ( \mathsf{ModLin}_L(\bullet; \mathbf{c} ) \, \circ \, \mathsf{Activation} \, \cdots \, \circ \, \mathsf{ModLin}_1(\bullet; \mathbf{c} ) )(\mathbf{x})\,. \end{aligned}$

ModAttn

A type of conditional (that is, conditioned by the code vector cᵤ of function fᵤ) multi-head attention mechanism is used. Queries, keys and values are evaluated using ModLin layers (instead of simple linear layers) for each head h:

$\begin{aligned} \mathbf{k}_{uhi} &= \mathsf{ModLin}_{\textsf{key}}^h(\mathbf{x}_i \,; \, \mathbf{c}_u )\\ \mathbf{q}_{uhi} &= \mathsf{ModLin}_{\textsf{query}}^h(\mathbf{x}_i \,; \, \mathbf{c}_u )\\ \mathbf{v}_{uhi} &= \mathsf{ModLin}_{\textsf{value}}^h(\mathbf{x}_i \,; \, \mathbf{c}_u ) \,. \end{aligned}$

Then, consider again the compatibility coefficients {Cᵤᵢ}; these quantities would serve as modulators when evaluating self-attention weights. Self-attention weights are given by the normalizing expression

$\displaystyle W_{uhij} = \frac{\tilde{W}_{uhij}}{\epsilon + \sum \tilde{W}_{uhij}}$

where epsilon avoids divisions by ~0 terms and

$\displaystyle \tilde{W}_{uhij} = C_{ui}C_{uj}\left[ \mathsf{softmax}_j \left(\frac{ \mathbf{q}_{uhi} \cdot \mathbf{k}_{uhj} }{\sqrt{d_\textsf{key}}}\right) \right] \,.$

For example, fix fᵤ and the head h. Then we have

$\displaystyle \tilde{W}_{ij} = C_{i}C_{j}\left[ \mathsf{softmax}_j \left(\frac{ \mathbf{q}_{i} \cdot \mathbf{k}_{j} }{\sqrt{d_\textsf{key}}}\right) \right]$

and, after normalization, the weight W_ij is the attention weight between elements xᵢ and xⱼ. Intuitively, information about xᵢ and xⱼ is mixed by fᵤ at head h only if W_uhij is not 0. This can happen in two cases: 1) the compatibility factors are both non-zero (that is, fᵤ can access both xᵢ and xⱼ) or 2) self-attention weights (the softmax part) is close to zero. Finally, the following linear combination is computed

$\displaystyle \tilde{\mathbf{y}}_{uhi} = \sum_j W_{uhij}\mathbf{v}_{uhj}$

and the final output is

$\displaystyle \tilde{\mathbf{y}}_{ui} = \mathsf{ModLin}(\tilde{\mathbf{y}}_{ui;h}\,;\,\mathbf{c}_u)$

where the semicolon separating h from ui indicates that the results of various heads are folded (as usual in multi-head attention) into one single object.

Line of Code

A line of code layer is a ModAttn layer followed by a ModMLP layer (see figure below, on the right). Both these layers share the same condition vector and there are weighted residual connections between them.

A line of code (LOC) is a line of code layer applied in parallel streams, one per function, as shown in Fig. 5 (right). Inputs of a LOC, say {xᵤᵢ}, are written with an extra index u, meaning that this is a specific input to the function fᵤ. If a function fᵤ cannot access xᵤᵢ, then fᵤ acts on xᵤᵢ as the identity function. For example, focus on a particular function fᵤ and on its specific inputs {xᵤᵢ} as i vary. Then

$\mathbf{a}_{ui} = \mathbf{x}_{ui} +C_{ui} \tilde{\mathbf{a}}_{ui}$

where ãᵤᵢ is the output of the attention layer (ModAttn); then

$\mathbf{y}_{ui} = \mathbf{a}_{ui} +C_{ui} \tilde{\mathbf{b}}_{ui}$

where $\mathbf{\tilde{b}}_{ui}$ is the output of the MLP module (essentially, a ModMLP layer). Note that if fᵤ cannot access xᵤᵢ (that is, Cᵤᵢ = 0), then the output yᵤᵢ is just xᵤᵢ.

Interpreter

The interpreter layer is a stack of LOCs sharing the same function codes. The interpreter broadcasts a given set element to multiple parallel computational streams, one for each function. Let the number of stacked LOCs be nₗ. Let X = {x₁ , x₂, …} and Cᵤ = {Cᵤ₁, Cᵤ₂, …}, then

$\mathbf{y}_{i} = \mathbf{x}_{i} + C_{1i}\, \mathcal{L}(\mathbf{X}, \mathbf{c}_1, \mathbf{C}_1) + C_{2i}\, \mathcal{L}(\mathbf{X}, \mathbf{c}_2, \mathbf{C}_2) + \cdots$

where

$\mathcal{L} = \underbrace{\mathsf{LOC}_{n_l}\,\circ\, \cdots\,\circ\,\mathsf{LOC}_1}_{n_l \, \textsf{times}} \,.$

Essentially, the output is a weighted sum with compatibilies of the elements with the respective function as coefficients. Given a set of inputs and an instruction (the function code), the role of the interpreter is to execute that instruction and compute the output.

Increasing the number of LOCs n_l increases the architecture depth and also the number of parameters.

Functions Iteration

We have already seen that the overall model is a stack of multiple scripts. A script can be expressed as a recurrent application of Function Iteration (FnIter)

$\{ \mathbf{y}_1, \mathbf{y}_2, \dots\} =( \underbrace{\mathsf{FnIter}\,\circ\, \cdots\,\circ\,\mathsf{FnIter}}_{n_i \, \textsf{times}})( \{ \mathbf{x}_1, \mathbf{x}_2, \dots\})$

where FnIter is defined as the composition of the type matching mechanism and the interpreter.

The number of function iterations nᵢ can increase without increasing the number of parameters, so FnIter can enable units sharing in depth.

Experiments

Some experiments have been conducted on subjects such as learning fuzzy boolean expressions, multi-task image classification abstract reasoning. However, we do not delve any further into such matters as it will only be time to determine whether this recent architecture is profitable or not.

Useful links

Original article on Neural Interpreters.

Nice discussion with authors (video).

Sharpness-Aware Minimization

10 Apr 202111 Apr 2024 af

This post deals with a recent optimizing method for training neural networks described in the paper Sharpness-Aware Minimization for Efficiently Improving Generalization by P. Foret et al. (December 2020). Honestly, the first time I read about the paper details, I really thought the procedure therein described (or something similar) had already been explored many years before by tons of people… I was even surprised to read that it worked in some contexts.

Is loss value not enough?

Modern models train through optimization methods relying just on the training loss. These models can easily memorize the training data and are prone to overfitting. They have more parameters than needed and this large number of parameters provides no guarantee of proper generalization to the test set.

Sharpness-Aware Minimization (SAM) is a procedure that aims to improve model generalization by simultaneously minimizing loss value and loss sharpness (the pictures below provide an intuitive support for the notion of “sharpness” for a loss landscape).

**Fig. 1**. Sharp vs wide (low curvature) minimum

**Fig. 2**. Sharp minimum (left) vs wide minimum (right) for a ResNet trained with SGD (source)

SAM seeks parameters lying in neighborhoods having uniformly low loss value (and not just parameters having low loss value). When SAM procedure is used to update weights, the sharpness of the loss landscape is taken into account.

Empirical studies suggest that SAM improves model generalization ability across a range of widely studied computer vision tasks on datasets like CIFAR{10-100} and ImageNet.

A learning setup

A little premise before going into details. Let

$S = \left\{(x_1,y_1), \dots, (x_n, y_n) \right\}$

be a training dataset, drawn i.i.d. from a distribution D. The vectors $x_i$ are the feature vectors and the vectors $y_i$ are the labels. Here, identically and independently distributed (i.i.d.) data means that the probability of obtaining a first training example $(x_a,y_a)$ does not affect the probabilities of subsequent drawings (similarly to the drawing with replacement). In other words, when i.i.d. sampling is used, each example in the population has the same chance of being observed.

We seek to learn a model that generalizes well (roughly, a model that performs well on the test set). We consider a family of models which parameters are w ∈ W (w is d-dimensional vector) and a loss function acting on each single datapoint $l$ . A loss function is, tipically, a function expressing the discrepancy between the model prediction and the actual observation (label). We define the train set loss as

$\displaystyle L_S(w) = \frac{1}{n}\sum_{i=1}^n l(w,x_i,y_i),$

that is, the mean of per-data-point errors over S, and the population loss

$L_D(w) =\mathbb{E}_{(x,y)\sim D}\, l(w,x,y)$

as the mean per-data-point loss over the whole distribution D.

Which is the goal of model training? Having observed only S, find model parameters w such that the population loss L_D(w) is low. In practice, training loss L_S(w) is used as an estimate of population loss L_D(w), and the model parameters w are selected by solving min_w L_S(w) using some optimization procedure as Stochastic Gradient Descent (SGD) or Adam.

For modern models, L_S(w) is typically a non-convex function of the parameters w. A problem is that this function has multiple local — and even global — minima in which assumes similar values while having signifcantly different generalization performance (that is, the population loss assumes significantly different values).

What makes SAM different is the focus on minima neighborhoods. Rather than seeking out parameter values w that simply have low training loss L_S(w), the SAM procedure seeks out parameter values whose neighborhoods have both low loss and low curvature.

Sharpness

It can be proved that, for any choice of $\rho$ , it is highly probable that

$\displaystyle L_D(w) \leq \underset{\| \epsilon \|_2 \leq \rho}{\max} L_S(w+\epsilon) + h \left(\frac{\|w\|_2^2}{\rho^2} \right)$

under some technical conditions on the population loss. The inequality does not hold generally, but it holds with high probability (say 1 – δ) on the training set S. h is a strictly increasing function from positive reals to positive reals. Adding and subtracting L_S(w),

$\displaystyle \left[ \underset{\| \epsilon \|_2 \leq \rho}{\max} L_S(w+\epsilon) - L_S(w) \right] + L_S(w) + h \left(\frac{\|w\|_2^2}{\rho^2} \right).$

The term enclosed by square brackets is the sharpness. Note that the more the loss grows around w (steep landscape), the larger is the sharpness. Sharpness measures how quickly the training loss can be increased by moving from w to a nearby parameter value w + ϵ.

Minimization

The function h is removed in favor of a simpler constant λ (that is not strictly increasing, however…), making the last addendum a standard L2 regularization term. At this point, we propose to choose parameter values by solving the following minimization problem

$\displaystyle \underset{w}{\min}\; L_S^{\mathsf{SAM}}(w) + \lambda \|w\|_2^2\,,$

where

$\displaystyle L_S^{\mathsf{SAM}} = \underset{\| \epsilon \|_p \leq \rho}{\max} L_S(w+\epsilon)$

with $\rho$ ≥ 0 as hyperparameter and p in [1, ∞] (a little generalization, though p=2 is empirically the best choice).

In order to minimize $L_S^{\mathsf{SAM}}(w)$ , an efficient approximation of its gradient will be determined. A first step is to consider the first-order Taylor expansion of L_S(w + ϵ) around 0, with respect to ϵ, and put it in the $\displaystyle L_S^{\mathsf{SAM}}$ expression. Taking the argument:

$\displaystyle \begin{aligned} \epsilon^*(w) &= \arg\underset{\| \epsilon \|_p \leq \rho}{\max} L_S(w+\epsilon) \\ &\approx \arg\underset{\| \epsilon \|_p \leq \rho}{\max} \left(L_S(w) + \epsilon^\top \nabla_w L_S(w) \right) \\ &= \arg\underset{\| \epsilon \|_p \leq \rho}{\max} \epsilon^\top \nabla_w L_S(w). \end{aligned}$

The last expression is just the argmax of the dot product of the vectors ϵ and ∇_w L_S(w), and it is well known which is the argument that maximizes it (check this dual norm result; optimal value is denoted with y). An easy intro to dual norms can be found here. Let’s denote temporarily ∇_wL_S(w) with $g$ . The argument that solves the preceding approximation is

$\displaystyle \hat{\epsilon}(w) = \rho\, \mathrm{sign}(g) \frac{ |g|^{q-1} }{ \left(\| g\|_q^q\right)^{1/p}}$

where 1/p + 1/q = 1. Since $\hat{\epsilon}(w) \,$ is the peak argument, we can write

$\displaystyle \begin{aligned} \nabla_w\,L_S^{\mathsf{SAM}}(w) &\approx \nabla_w\,L_s(w+ \hat{\epsilon}(w)) \\ &= \frac{\mathrm{d}\,(w + \hat{\epsilon}(w)) }{ \mathrm{d}\, w} \, \nabla_w\,L_s(w)|_{ w+ \hat{\epsilon}(w) } \\ &= \nabla_w\,L_s(w)|_{ w+ \hat{\epsilon}(w) }\, +\, \frac{\mathrm{d}\, \hat{\epsilon}(w) }{ \mathrm{d}\, w} \, \nabla_w\,L_s(w)|_{ w+ \hat{\epsilon}(w) }. \end{aligned}$

Modern frameworks can easily compute the preceding approximation. However, to speed up the computation, second-order terms can be dropped obtaining

$\displaystyle \nabla_w\,L_S^{\mathsf{SAM}}(w) \approx \nabla_w\,L_s(w)|_{ w+ \hat{\epsilon}(w) }.$

Algorithm

Input: training set S, loss function l, batch size b, step size $\eta$ , neighborhood size $\rho$ .
Output: model trained with SAM.

Initialize weights w₀, t = 0;
while not converged do
    sample batch $\mathcal{B}=\{(x_1,y_1),\dots,(x_b,y_b)\}$ ;
    compute gradient $\nabla_w\,L_\mathcal{B}(w)$ of the batch’s training loss;
    compute $\hat{\epsilon}(w)$ ;
   compute $g = \nabla_w\,L_\mathcal{B}(w)|_{w+\hat{\epsilon}(w)}$ for the SAM objective;
    update weights: $w_{t+1} = w_t - \eta g$ ;
     t = t + 1
end
return w_t

Figure 3 above shows the common weight update vector (yellow arrow from w_t to w_t+1) and the SAM update (beige arrow from w_t to $w_{t+1}^{\mathsf{SAM}}$ ). The SAM update vector is not computed in w_t, but in $w_{\mathsf{ADV}} = w_t + \hat{\epsilon}$ .

The JAX code from the authors’ paper and additional info can be found here; another implementation in PyTorch is available here.

Useful links

Original article (2020).

Dual norms result.

Introductory video about dual norms.

Taylor expansion (Wikipedia article).

Code repository from the authors’ paper.

PyTorch implementation available here.

Colab notebook with TensorFlow implementation.

The Kullback-Leibler divergence

19 Jan 20219 May 2024 af

In this post we will just spend a few words on a well-known measure of how dissimilar a given distribution is from another reference distribution. First we will give a definition for such a measure and then we will provide some intuitive meaning together with some useful coding snippets.

Definition

Let’s begin with the discrete case. So let P and Q be two probability distributions defined on the same probability space $\mathcal{X}$ . A first attempt may be considering the average of the difference between the distributions. Quite close indeed, the following defintion is just a little bit different. The Kullback-Leibler divergence (also called relative entropy) KL(P ‖ Q) is defined as the average of the difference between the logarithms of probabilities P(x) and Q(x):

$\mathrm{KL}(P\Vert Q) \, \stackrel{\mathsf{def}}{=} \, \mathbb{E} \big[ \log P(x) - \log Q(x) \big]\,.$

The expectation is taken using the probabilities P (often written as x $\sim$ P). The definition of expectation leads to the expression

$\displaystyle \mathrm{KL}(P\Vert Q) = \sum_{x \in \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right).$

In the case of continuous distributions we write

$\displaystyle \mathrm{KL}(P\Vert Q) = \int_{-\infty}^\infty p(x) \log\left(\frac{p(x)}{q(x)}\right) \,\mathrm{d}x$

where p(x) and q(x) are P and Q respective densities.

KL divergence is often called a “distance” but it is not a distance in mathematical sense (a metric): KL divergence is not symmetrical. This means that KL(P ‖ Q) is generally different from KL(Q ‖ P).

If Q(x) is 0 for some x, the KL divergence is not defined unless it is P(x) = 0. What if P is 0 somewhere? In this case, we interpret that the KL divergence must be zero since when a approaches 0, the expression alog(a) tends to 0 .

Motivations behind the definition

A first intuition comes form the fact that if {p_i} and {q_i} are two probability mass functions, that is, two countable or finite sequences of nonnegative numbers that sum to one, then

$\displaystyle \sum_{i} p_i \log \left(\frac{p_i}{q_i}\right) \geq 0$

with equality if and only if p_i = q_i for all i. The fact that the divergence of one probability distribution with respect to another is nonnegative and zero only when the two distributions are the same suggests the interpretation of KL divergence as a “distance” between two distributions, that is, a measure of how different the two distributions are.

A second intuition about the fact that KL divergence actually expresses some kind of distance between two distributions comes from the expression

$\begin{aligned} \displaystyle \mathrm{KL}(P\Vert Q) &= \int_{-\infty}^\infty p(x) \left( \log p(x) - \log q(x) \right) \, \mathrm{d}x \\& = \int_{-\infty}^\infty p(x) D(x)\, \mathrm{d}x \end{aligned}$

where it is immediate to recognize that the difference between logarithms D(x) is a term expressing the gap between the two distributions. If the average gap is small, then the two distributions are “similar” or “close”.

**Fig. 1**. Two continuous distribution densities p and q and their respective logarithmic transformations log(p) and log(q)

Connection with cross entropy

KL divergence KL(P ‖ Q) is equal to

$\begin{aligned} \displaystyle \mathrm{KL}(P\Vert Q) &= - \sum_x P(x) \log Q(x) + \sum_x P(x) \log P(x) \\& = H(P,Q) + H(P) \end{aligned}$

where H(P , Q) is the cross entropy of P and Q and H(P) is the entropy of P. As we said, KL(P ‖ Q) can be thought of as something like a measurement of how far the distribution Q is from the distribution P. But cross entropy is itself such a measurement… the difference is that cross entropy has a — generally nonzero — minimum when P = Q, that is H(P , P) = H(P); so in KL divergence we subtract the entropy term H(P) to attain minimum value 0. This is coherent with the property that the distance of an object from itself should be zero.

Quick example

Let P and Q be the following distributions (each possible outcome x is in $\mathcal{X}$ = {0, 1, 2}):

	0	1	2
Distribution P(x)	9 / 25	12 / 25	4 / 25
Distribution Q(x)	1 / 3	1 / 3	1 / 3

The following picture shows both P (amber) and Q (gray).

Next picture shows the logarithm of distributions with the difference D at x = 2.

**Fig. 4**. log(P) and log(Q) with difference D

Let’s calculate KL(P ‖ Q).

$\begin{aligned} \displaystyle \mathrm{KL}(P\Vert Q) &= \sum_x P(x) \log \left( \frac{P(x)}{Q(x)} \right) \\&= 9/25 \log\left(\frac{9/25}{1/3}\right) + 12/25 \log\left(\frac{12/25}{1/3} \right) + 4/25 \log\left(\frac{4/25}{1/3} \right) \\& \approx 0.0853\,. \end{aligned}$

Interchanging the arguments, we find that KL(Q ‖ P) is approximately 0.0974 and this value is different from the previous.

Evaluate KL divergence with Python

Import the entropy function

from scipy.stats import entropy

and then compute KL(P ‖ Q) from the example above in just one line.

entropy([9/25, 12/25, 4/25], qk=[1/3, 1/3, 1/3])

0.0852996013183706

Below, a simple Python coding example for figures 1~4. Note that the two continuous density curves have a magnifying coefficient for scaling purposes.

import matplotlib.pyplot as plt 
import numpy as np 

p = [9/25, 12/25, 4/25]
q = [1./3,1./3,1./3]
xx = ['0','1','2']

logq = np.log(q)
logp = np.log(p)

plt.bar(xx, q, color='beige')
plt.bar(xx, p, alpha=.6, color='yellowgreen')
plt.show()

plt.bar(xx, logq, color='beige')
plt.bar(xx, logp, alpha=.6, color='yellowgreen')
plt.show()

from scipy.stats import norm, skewnorm

x = np.arange(-3,2.5,.001)
plt.plot(x, 10*skewnorm.pdf(x,-1.2), color='black')
plt.plot(x, 10*norm.pdf(x, scale=1.1), color='yellowgreen')
log1 = np.log(skewnorm.pdf(x,-1.2))
log2 = np.log(norm.pdf(x, scale=1.1))
plt.plot(x, log1, color='black')
plt.plot(x, log2, color='yellowgreen')
plt.fill_between(x, log1, log2, 
                 where=log1>=log2, facecolor='darkgrey', 
                 interpolate=True)
plt.fill_between(x, log1, log2, 
                 where=log1<log2, facecolor='lightgreen', 
                 interpolate=True)
plt.show()

Feel free to email me for comments, questions, suggestions or if you just want to leave a message.

Multinomial Logistic Regression

25 Sep 202014 Aug 2023 af

In this post we will show how to quickly build a simple model for digits classification using TensorFlow 2 on MNIST dataset. We just need base Python, NumPy, Matplotlib and a recent version of TensorFlow (2.X).

The model

Our aim is to train a simple classifier with SGD (or other optimizers like Adam) using TensorFlow.

The idea is to apply a very simple transformation to the input, obtaining a vector that, suitably adjusted, expresses information about which class the input belongs to.

Let W be a matrix and b a bias vector (both trainable). Starting from (flattened) input x, through the linear transformation Wx+b we obtain a resulting logits vector which is then squashed with softmax function to get a probability distribution. The i-th entry of this distribution vector represents the probability that the input belongs to the i-th class (see figure below).

For each class (each digit in the case of MNIST dataset) we need to calculate a logit (using a linear function)

$z_k = w_k \cdot x + b_k \quad (k=0,\dots,9)$

and transform logits to valid probabilities $p_k$ with softmax

$\displaystyle p_k = \frac{e^{z_k}}{\sum_{i=0}^9 e^{z_i} } \quad k=0,\dots,9$ .

For our model, we can assume that $x$ is a flattened vector coming from a digit image and $w_k$ is a row from a weight matrix. The model just described is known by a variety of names, including Multinomial Logistic Regression and Softmax Regression.

We will use cross-entropy loss to train our multi-class classifier. In particular, since we have labels representing digit classes that are integers (and not one-hot vectors), TensorFlow has a nice loss function that fits this case: SparseCategoricalCrossentropy.

The code

We start by importing NumPy, Matplotlib and TensorFlow.

import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline

import tensorflow as tf
print("We're using TF", tf.__version__)

Our MNIST dataset consists of 50000 28×28 images of digits from 0 to 9. We will train a classifier on this data.

from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

Time for some dataset visualization.

print("x_train [shape %s] sample patch:\n" % (str(x_train.shape)), 
      x_train[1, 15:20, 5:10])
print("A closeup of a sample patch:")
plt.imshow(x_train[1, 15:20, 5:10], cmap="Greys")
plt.show()

print("And the whole sample:")
plt.imshow(x_train[0], cmap="Greys")
plt.show()

print("y_train [shape %s] 10 samples:\n" % (str(y_train.shape)),
      y_train[:10])

Normalize image values from [0,255] to [0,1].

x_train, x_test = x_train / 255., x_test / 255.

Here’s our (very simple) model, Keras-style built.

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(10, activation='softmax')])

model.summary()

Training over 10 epochs we get an accuracy ~93%.

model.compile(optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10)

Visualizing results

We compare some predicted digits with the actual digits.

predictions = model.predict(x_test)
predictions = np.argmax(predictions, axis=1)
print(predictions[:10])
print(y_test[:10])

[7 2 1 0 4 1 4 9 6 9]
[7 2 1 0 4 1 4 9 5 9]

Just one mismatch on the first 10 examples.

n_to_show = 8
indices = np.random.choice(range(len(x_test)), n_to_show)

fig = plt.figure(figsize=(15, 3))
fig.subplots_adjust(hspace=0.4, wspace=0.4)

for i, idx in enumerate(indices):
    img = x_test[idx]
    ax = fig.add_subplot(1, n_to_show, i+1)
    ax.axis('off')
    ax.text(0.5, -0.4, 
            'predicted = ' + str(predictions[idx]),
            fontsize=10, 
            ha='center',
            transform=ax.transAxes)
    ax.text(0.5, -0.7, 
            'actual = ' + str(y_test[idx]),
            fontsize=10, 
            ha='center', 
            transform=ax.transAxes)
    ax.imshow(img, cmap='binary')

[Jupyter Notebook]

Feel free to email me for comments, questions, suggestions or if you just want to leave a message.