Scaling Law

Scaling Laws Overview

The loss of next-token prediction is predictable and smooth.
We only need to know two variables to estimate their accuracy: the total number of parameters(N) and the size of text tokens used to train the model(D)

As we train larger models on more data, we continue to see improvements in accuracy

The scaling law also help understading the relation between model quality and model size, training data and computational resources, which has been utilized in LLM pre-training

Additionally, the risk of overfitting is closely related to the ratio of model size to the data size. In OpenAI’s paper, they recommend the equation to avoid overfitting

D ≳ (5 \times 1 0^{3}) N^{0.74}

Empirical Experiments

Given compute budget(measured in FLOPs), what would be the optimal combination of model size and training data size?
核心问题是：给定compute budget，模型大小和数据大小的结合应该是什么样的

Two significant findings

OpenAI scaling laws paper use PF-days(PetaFLOPs-days)
$1 0^{15} \times 24 \times 3600$
Peta denote a factor of 1e15
24 hours
3600 seconds

In OpenAI’s paper, several experiments have been conducted.
The basic setting is

Model size(N): 768-1.5 billion non-embedding parameters
Dataset Size(D): 22 million to 23 billion tokens
Model shape: depth, width, attention head, feed-forward dimension
Context Length: 1024
Test Cross-Entropy Loss(L): measures performance
Compute(C): Computational resoureces used for training

The experiments are 5 year old, and many changes happened these years. The size of model and dataset are much larger and the Context length is longer(8K in training for llama3). But with the laws revealed by the paper, these experiments provides an insightful perspective over LLM pre-training

Two conclusion were drawn:

the impact of scale if more significant than model architecture(transformer details)
- the scale refers to parameters(N), dataset(D), Computational resources(C)
There is power-law relationship between the performance of the model and each of the scaling factors(when they are not constrained by one another)

Pasted image 20250116110849 ¹

The power law trend is illustrrated by the first 4 equation
When we scaling up one of these factors, we can expect a corresponding and predictable improvement in the model’s performance, following a power-law trend

Sample-Efficient LLMs

Experiments found that, given tokens processed, larger models achieve lower loss than small models(Larger model is more sample-efficient) 大模型对样本的学习能力更强
Pasted image 20250116113751 ¹

But in Google’s paper, the model size should be scale too(we can achieve the same performance using less parameter)
Pasted image 20250116114232 ²
Pasted image 20250116114236 ²

谷歌实验的模型规模远大于OpenAI，可能OpenAI的规模不够大导致结论不同

Pasted image 20250116115240 ²

Google’s paper offers a new way of interpretation of LLMs scaling law.
Ploting out IsoLoss contours, we can find the fewest FLOPs in each curve. These points give us the efficient frontier(blue line)
It means that given a compute budget, we can find the optimal model size and predict the loss the model would be

For optimal compute-efficient training, DeepMind suggests to have more than 20 training tokens for every 1 model parameter

Scaling law equation

OpenAI’s scaling law:

L (N, D) = [(\frac{N _{c}}{N})^{\frac{α _{N}}{α _{D}}} + \frac{D _{c}}{D}]^{α_{D}}

DeepMind’s scaling law

L (N, D) = E + \frac{A}{N ^{0.34}} + \frac{B}{D ^{0.28}}

Deeper

很模糊，具体怎么计算和预测不是很清楚，知识量不足，得再回来看看

White Box

Notes

Model Editing (ML2025)

Reasoning (ML2025)

CS231n Module 3

Thoughts

生成式人工智能浪潮下残缺的人

《人都是要死的》读后感

Scaling Law

Scaling Laws Overview

Empirical Experiments

Two significant findings

Sample-Efficient LLMs

Scaling law equation

Deeper

Graph View

Table of Contents

Backlinks

White Box

Notes

Model Editing (ML2025)

Reasoning (ML2025)

CS231n Module 3

Thoughts

生成式人工智能浪潮下残缺的人

《人都是要死的》读后感

Scaling Law

Scaling Laws Overview

Empirical Experiments

Two significant findings

Sample-Efficient LLMs

Scaling law equation

Deeper

Footnotes

Graph View

Table of Contents

Backlinks