What is the Chinchilla Point?
The Chinchilla Point, or Chinchilla Optimal, is a benchmark in the efficient training of large language models (LLMs). It defines a strategic balance between the model’s size (parameters) and the training dataset size (tokens).
Maintaining this balance was proposed as a way to optimise the performance of LLMs without disproportionately increasing computational resource usage (compute) during training.
The Chinchilla Point benchmark was originally described by researchers at DeepMind, Google’s artificial intelligence research laboratory, in a 2022 paper titled “Training Compute-Optimal Large Language Models”.
Chinchilla Scaling Laws
The Chinchilla Scaling Laws were originally outlined in DeepMind’s research in “Training Compute-Optimal Large Language Models”. In this paper, they aimed to answer the question:
Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?
Here, FLOPs stand for Floating Point Operations, a measure used to quantify the amount of computational effort involved in training or running a neural network.
This term refers to the total number of floating-point operations that would be performed during the complete training of a model, providing a standard measure of computational workload and resource usage.
For our purposes, think of FLOPs as the amount of compute available to train the model.
Since the compute budget is finite—as is the energy required to power this compute—strategic scaling of model size and training data is critical to optimising resource use.
In the paper, the researchers estimated the optimal training FLOPs and training tokens for various model sizes. The estimates were based on comprehensive empirical data, involving the training of models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens.
Below is a modified version of a table included in the paper, with a column showing the tokens per parameter.
Parameters | FLOPs | Tokens | Tokens per Parameter |
---|---|---|---|
400 Million | 1.92e+19 | 8.0 Billion | 20 |
1 Billion | 1.21e+20 | 20.2 Billion | 20.2 |
10 Billion | 1.23e+22 | 205.1 Billion | 20.51 |
67 Billion | 5.76e+23 | 1.5 Trillion | 22.39 |
175 Billion | 3.85e+24 | 3.7 Trillion | 21.14 |
280 Billion | 9.90e+24 | 5.9 Trillion | 21.07 |
520 Billion | 3.43e+25 | 11.0 Trillion | 21.15 |
1 Trillion | 1.27e+26 | 21.2 Trillion | 21.2 |
10 Trillion | 1.30e+28 | 216.2 Trillion | 21.62 |
The researchers called this the “compute-optimal frontier”, and highlighted:
“… for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.”
We can observe from the above table that the approximate ratio of tokens per parameter is ~21 for models between 400 million and 10 trillion parameters, highlighting the researchers’ comment that:
While there is significant uncertainty extrapolating out many orders of magnitude, our analysis clearly suggests that given the training compute budget for many current LLMs, smaller models should have been trained on more tokens to achieve the most performant model.
This sets the table for the origins of the Chinchilla Optimal or Chinchilla Point.
Origin of the Term “Chinchilla Point”
After the researchers at DeepMind had developed an estimate for the optimal training FLOPs and training tokens for various model sizes, they wanted to put this to the test.
They had an existing model called Gopher, which was not optimised on this “compute-optimal frontier”.
Gopher is a 280-billion-parameter model trained on 300-billion training tokens. Based on the scaling laws the researchers had uncovered, this model was vastly oversized from a tokens per parameter perspective:
Based on our estimated compute-optimal frontier, we predict that for the compute budget used to train Gopher, an optimal model should be 4 times smaller, while being training on 4 times more tokens.
To test this hypothesis, the researchers created a new, more compute-optimal model, training a 70-billion-parameter model on 1.4 trillion tokens. They called this new model: Chinchilla.
This Chinchilla model contrasts with the larger Gopher model, highlighting the new smaller but more effectively scaled model.
This birthed the phrase “Chinchilla Point”, denoting the compute sweet spot in the ratio of parameters to tokens for optimal training of large language models.
Llama 3 And The Chincilla Point
On April 18th 2024, Meta released Llama 3: their most powerful open-source large language model (LLM) to date.
This was a landmark release, with comparable performance to the latest proprietary large language models, such as GPT-4 by OpenAI and Claude 3 by Anthropic, while also being significantly cheaper to run than Meta’s previous release, Llama 2.
What drove a lot of talk about this release in the AI community were comments made by Meta during their release of Llama 3 and Mark Zuckerberg in subsequent podcast interviews regarding the training of Llama 3.
In Meta’s Llama release post, they reported that:
We made several new observations on scaling behavior during the development of Llama 3.
For example, while the Chinchilla-optimal amount of training compute for an 8B parameter model corresponds to ~200B tokens, we found that model performance continues to improve even after the model is trained on two orders of magnitude more data.
Both our 8B and 70B parameter models continued to improve log-linearly after we trained them on up to 15T tokens.
Larger models can match the performance of these smaller models with less training compute, but smaller models are generally preferred because they are much more efficient during inference.
In discussing the training of Llama 3 on Dwark Patel’s podcast (great podcast, worth a listen), the founder and CEO of Meta, Mark Zuckerberg, made the following comment:
Transcribed:
One of the interesting things about it, that we saw even with the 70 billion [parameter model], is we thought it would get more saturated.
It’s like we trained it on around 15 trillion tokens. I guess our prediction going in was that it was going to asymptote more. But even by the end, it was still learning.
It’s like we probably could have fed it more tokens and it would have gotten somewhat better.
But I mean, at some point, you’re running a company, you need to do these meta [ha] reasoning questions of like, “All right, do I want to spend our GPUs on training this 70 billion model further? Do we want to get on with it so we can start testing hypotheses for Llama 4?”
So we needed to make that call. And I think we got to a reasonable balance for this version of the 70 billion [parameter model]. There’ll be others in the future where the 70 billion multimodal one that’ll come over the next period.
But yeah, I mean, that was fascinating that the architectures at this point can just take so much data.
This is a significant shake-up to the previously observed relationship between training data volume and model performance as described by the Chinchilla Point.
The findings from Llama 3 suggest that the optimal training point may vary more with specific model architectures and training goals than previously understood.
This calls for a more flexible approach to applying these Chinchilla Scaling Laws, particularly as AI developers aim to optimise both the performance during training and efficiency during inference.
The broader implications for the AI research community could be substantial. Researchers and tech companies looking to train future LLMs might need to reconsider the efficiency benchmarks established by the Chinchilla Point in light of this new information.
This could lead to a new paradigm in training large language models where the “compute-optimal” balance is adjusted based on empirical results from advanced models like Llama 3.
Is the Chinchilla Point All That Matters?
The Chinchilla Point is an attempt to find the optimal ratio between the model’s size (parameters) and the training dataset size (tokens).
However, as noted by the researchers at DeepMind, these are not the only two parameters to choose from when selecting a language model and a procedure to train.
Other noteworthy factors include:
- The learning rate
- The learning rate schedule
- The batch size
- The optimiser
- The width-to-depth ratio
It’s possible that the benefits Meta observed training Llama 3 past the Chinchilla Point stem from differences in these factors.
The Chinchilla scaling laws have also baked in the assumption that training data (tokens) are a commodity. But given many papers have observed better models when using “higher quality” data, it’s possible that the benefits that Meta observed training past the Chinchilla Point are a result of differences in the underlying training data.
Note that this difference could go either way, i.e. that lower-quality data can benefit from further “post-Chinchilla” training or that high-quality data can benefit from “post-Chinchilla” training.
(where here, the quality would compare to the original data used by DeepMind, since they are the ones who coined the Chinchilla Scaling Laws).
How Many Training Tokens Was Chinchilla Trained On?
The Chinchilla model, developed by DeepMind, was trained using approximately 1.4 trillion tokens with 70 billion parameters. This training setup was part of an initiative to discover the most compute-optimal configurations for large language models.
By using such a significant number of tokens (compared to their previous model, Gopher), the researchers aimed to balance model size with training data, enhancing performance given a fixed amount of computational resources.
DeepMind’s strategy was to align the number of parameters with a suitable volume of training tokens. This approach for Chinchilla was designed to optimise the model’s ability to learn and generalise across various linguistic tasks, reinforcing its efficiency.
This shift to a more data-intensive training regimen reflects a broader move away from larger, less efficiently trained models. By adhering to these newly established scaling laws, DeepMind set a new benchmark for future AI training methodologies, emphasising the balance between resource utilisation and performance.
That said, as discussed above, the latest empirical research from Meta’s Llama 3 has revealed that large language models might continue to benefit from increased training data beyond the previously established benchmarks, challenging the assumptions underpinning the Chinchilla Scaling Laws.