LLM training can be much cheaper than people generally thought
JetMoE is a recent Large Language Model (LLM) that supposedly outperforms LLaMA2-7B from Meta AI and was trained for 2 weeks using 96×H100 GPU cluster, spending only ~$80,000…
But how much does it cost to train a LLM?
Training costs
A first oddity is that the JetMoE article does not explicitly mention any training costs (except its own) for comparison with other models. Also, according to this page, Llama2-7B model requires less than $85,000 to train – so, if that were the case, what would be the big economic benefit of JetMoE? For example, where did the JetMoE staff get the amount of training costs for Llama2-7B and why didn’t they publish this data for direct comparison?
Anyway, the JetMoE article reports training costs as GPU hours (exactly, Nvidia H100 GPU hours). JetMoE training costs 30,000 H100 GPU Hours. A Microsoft “optimized version of the Llama 2 model” shows the table below expressed in A100 GPU Hours (the overall performance of the H100 is better than the previous generation A100)…
Meta’s largest LLaMA model, as of march 2023, used 2,048 Nvidia A100 GPUs to train on 1.4 trillion tokens (750 words is about 1,000 tokens), taking about 21 days: the cost was over $2.4 million. Analysts and technologists estimate that the critical process of training a large language model such as OpenAI’s GPT-3 could cost more than $4 million. You can find these numbers here.
GPT-4 training approximately costs over $100 million (here).
Key Messages
This is taken directly from the JetMoE’s page.
- JetMoE-8B is trained with less than $ 0.1 million cost but outperforms LLaMA2-7B from Meta AI, who has multi-billion-dollar training resources. LLM training can be much cheaper than people generally thought.
- JetMoE-8B is very open and academia-friendly because:
- It only uses public datasets for training, and the code is open-sourced. No proprietary resource is needed.
- It can be finetuned with very limited compute budget (e.g., consumer-grade GPU) that most labs can afford.
- JetMoE-8B only has 2.2B active parameters during inference, which drastically lowers the computational cost. Compared to a model with similar inference computation, like Gemma-2B, JetMoE-8B achieves constantly better performance.
How JetMoE works
The JetMoE architecture is illustrated in the following figure.
JetMoE architecture takes advantage of sparse activation on both the attention and feed-forward layers, significantly reducing training and inference costs.
Let x be the input vector, consider a learnable matrix Wr that controls the routing. Let s be the routing output:
s = Wr x .
The Sparse Mixture of Experts (SMoE) output y is represented by a relation of the type
y = g1 · f1(x) + g2 · f2(x) + · · · + gn · fn(x) .
It’s just a weighted combination of n experts (these are normally 2-layer MLPs or, in case of Mixture of Attention, constructs of the type illustrated below) represented by the functions fi with i = 1, 2, . . . , n with the various “weights” gi as functions that select the top k logits (taking their softmax) from s, setting the rest to 0.
In essence, s is a vector whose larger components have a greater influence on the above combination defining output y. The usefulness of this approach lies in the fact that if gi = 0 for several indices i, then all the corresponding fi(x) will not be evaluated, thus reducing computation cost during training and inference. The mechanism of a single attention expert is illustrated in the following figure.
Matrices Wk and Wv are shared across experts to improve the training and inference efficiency, instead matrices Wq and Wo in orange vary from one expert to the other. ae is obtained applying standard multi-head attention with RoPE to k, v and qe .
A little coding
A very concise and quick PyTorch test Jupyter notebook for JetMoE can be found here (warning: you’ll need a lot of GPU memory). Alternatively, you can test the model directly using the Online Demo on Lepton AI (link).
Useful links
JetMoE article
H100 vs A100 performance comparison
Microsoft Llama-2-Onnx model details (link)
Article on training costs here
GPT-4 over $100 million training (here)
JetMoE’s page
Video – Rotary Positional Embedding (RoPE)
JetMoE Jupyter notebook
Online Demo on Lepton AI (link)
Also available on Substack