JetMoE

LLM training can be much cheaper than people generally thought

JetMoE is a recent Large Language Model (LLM) that supposedly outperforms LLaMA2-7B from Meta AI and was trained for 2 weeks using 96×H100 GPU cluster, spending only ~$80,000…

But how much does it cost to train a LLM?

Training costs

A first oddity is that the JetMoE article does not explicitly mention any training costs (except its own) for comparison with other models. Also, according to this page, Llama2-7B model requires less than $85,000 to train – so, if that were the case, what would be the big economic benefit of JetMoE? For example, where did the JetMoE staff get the amount of training costs for Llama2-7B and why didn’t they publish this data for direct comparison?

Anyway, the JetMoE article reports training costs as GPU hours (exactly, Nvidia H100 GPU hours). JetMoE training costs 30,000 H100 GPU Hours. A Microsoft “optimized version of the Llama 2 model” shows the table below expressed in A100 GPU Hours (the overall performance of the H100 is better than the previous generation A100)…

Meta’s largest LLaMA model, as of march 2023, used 2,048 Nvidia A100 GPUs to train on 1.4 trillion tokens (750 words is about 1,000 tokens), taking about 21 days: the cost was over $2.4 million. Analysts and technologists estimate that the critical process of training a large language model such as OpenAI’s GPT-3 could cost more than $4 million. You can find these numbers here.

GPT-4 training approximately costs over $100 million (here).

Key Messages

This is taken directly from the JetMoE’s page.

JetMoE-8B is trained with less than $ 0.1 million cost but outperforms LLaMA2-7B from Meta AI, who has multi-billion-dollar training resources. LLM training can be much cheaper than people generally thought.
JetMoE-8B is very open and academia-friendly because:
1. It only uses public datasets for training, and the code is open-sourced. No proprietary resource is needed.
2. It can be finetuned with very limited compute budget (e.g., consumer-grade GPU) that most labs can afford.
JetMoE-8B only has 2.2B active parameters during inference, which drastically lowers the computational cost. Compared to a model with similar inference computation, like Gemma-2B, JetMoE-8B achieves constantly better performance.

How JetMoE works

The JetMoE architecture is illustrated in the following figure.

JetMoE architecture takes advantage of sparse activation on both the attention and feed-forward layers, significantly reducing training and inference costs.

Let x be the input vector, consider a learnable matrix W_r that controls the routing. Let s be the routing output:

s = W_r x .

The Sparse Mixture of Experts (SMoE) output y is represented by a relation of the type

y = g₁ · f₁(x) + g₂ · f₂(x) + · · · + g_n · f_n(x) .

It’s just a weighted combination of n experts (these are normally 2-layer MLPs or, in case of Mixture of Attention, constructs of the type illustrated below) represented by the functions f_i with i = 1, 2, . . . , n with the various “weights” g_i as functions that select the top k logits (taking their softmax) from s, setting the rest to 0.

In essence, s is a vector whose larger components have a greater influence on the above combination defining output y. The usefulness of this approach lies in the fact that if g_i = 0 for several indices i, then all the corresponding f_i(x) will not be evaluated, thus reducing computation cost during training and inference. The mechanism of a single attention expert is illustrated in the following figure.

Matrices W_k and W_v are shared across experts to improve the training and inference efficiency, instead matrices W_q and W_o in orange vary from one expert to the other. a_e is obtained applying standard multi-head attention with RoPE to k, v and q_e .

A little coding

A very concise and quick PyTorch test Jupyter notebook for JetMoE can be found here (warning: you’ll need a lot of GPU memory). Alternatively, you can test the model directly using the Online Demo on Lepton AI (link).

Useful links

JetMoE article

H100 vs A100 performance comparison

Microsoft Llama-2-Onnx model details (link)

Article on training costs here

GPT-4 over $100 million training (here)

JetMoE’s page

Video – Rotary Positional Embedding (RoPE)

JetMoE Jupyter notebook

Online Demo on Lepton AI (link)

Also available on Substack

Support this blog