Training data size for LLMs — how many tokens do you actually need in 2026?

"How many tokens do I need?" has eight different right answers, depending on whether you mean pre-training, continued pre-training, SFT, DPO, RLHF, distillation, RAG indexing, or evaluation. The Chinchilla scaling law was the first quantitative answer in 2022; three years on, the industry has both validated and substantially exceeded it.

Key takeaways

Modern pre-training uses 100–500 tokens per parameter, not the Chinchilla 20 — inference compute dominates total cost.
Continued pre-training sweet spot: 1–10 B in-domain tokens with a 1:4 in-domain to general mix.
SFT sweet spot: 1 000–10 000 curated examples. Above 100 000, diminishing returns.
DPO and preference optimization: 10 000–50 000 pairs for style/safety tuning.
Eval set: 200–500 examples, never leaked into training. The discipline matters more than the size.

The Chinchilla baseline — and why most modern models exceed it
Continued pre-training — moving the needle on a specific domain
SFT — examples, not tokens
DPO and preference optimization — preference pairs
Evaluation data — the smallest, the most important
RAG corpus — quality of chunks, not quantity of tokens
Estimating cost before you commit
The diminishing-returns line
Quick reference — token volumes by stage
Bottom line
Frequently asked questions

“How many tokens do I need to train my model?” has eight different right answers, depending on whether you mean pre-training, continued pre-training, SFT, DPO, RLHF, distillation, RAG indexing, or evaluation. The Chinchilla scaling law gave the field its first quantitative answer in 2022. Three years later, the industry has both validated and substantially exceeded it. This guide walks through the actual data-size guidance for each training stage in 2026, the diminishing-returns thresholds, and the trade-offs that determine whether you should aim for “compute-optimal” or “inference-optimal” data volumes.

The Chinchilla baseline — and why most modern models exceed it

Inference compute wins

Llama 3 used 15 T tokens at 8B (1 900 tokens/parameter — way over Chinchilla’s 20). Why? Inference compute dominates total cost over a model’s lifetime. Over-training trades training cost for cheaper inference forever — a winning trade for any deployed model.

The Chinchilla paper from DeepMind established that, for a fixed compute budget, you should train a smaller model on more tokens rather than a larger model on fewer tokens. The headline finding: roughly 20 tokens per parameter is compute-optimal for pre-training. A 7 B model wants ~140 B tokens; a 70 B model wants ~1.4 T tokens.

Three years later, every major open-weights release has gone well beyond Chinchilla-optimal:

Llama 3 (8 B and 70 B): 15 T tokens. That is ~1 900 tokens/parameter for the 8 B and ~210 for the 70 B.
DeepSeek-V3 (671 B mixture-of-experts, 37 B active): 14.8 T tokens.
Qwen 2.5 (7 B): 18 T tokens — over 2 500 tokens per parameter.

This is not a refutation of Chinchilla; it is a different optimization. Chinchilla optimizes training compute. Modern teams optimize inference compute: a model that is “over-trained” (more tokens per parameter than Chinchilla-optimal) achieves the same quality at smaller parameter count, which means cheaper inference for the entire lifetime of the model. Inference compute dwarfs training compute over a model’s deployed life, so the trade-off favours over-training. If you are pre-training from scratch in 2026, aim for 100–500 tokens per parameter, not 20.

Continued pre-training — moving the needle on a specific domain

Continued pre-training resumes the base model’s autoregressive language-modelling objective on an in-domain corpus. The data-size thresholds are well established in the literature:

Below ~500 M tokens of in-domain data: measurable improvement on in-domain perplexity, often at the cost of out-of-domain capability. Marginal benefit over SFT for most use cases.
1–10 B tokens in-domain: the sweet spot for vertical-language LLMs (medical, legal, finance, code). Enough to shift token frequencies and acquire domain vocabulary without catastrophically forgetting general competencies.
10–100 B tokens in-domain: approaches the volume of a small dedicated pre-training. Genuinely changes the model’s distribution. Reserve for cases where a vertical foundation model is the deliverable.

One practical rule: maintain a 1:4 in-domain to general-corpus ratio during continued pre-training. A pure in-domain run almost always degrades general fluency more than the in-domain gain justifies.

SFT — examples, not tokens

Need 1–10 B in-domain tokens of French regulatory text?

Our corpus ships 460 M tokens of finance, regulation, and economic FR text in tiered packages — the right volume to swap into a continued pre-training mix.

See the volumes

For supervised fine-tuning, the metric is examples (instruction–response pairs), not tokens. The empirical thresholds, validated across hundreds of public fine-tunes:

50–500 examples: teaches the model a specific format or persona. Useful for tightly scoped tasks (always answer in JSON, always cite sources, refuse a specific class of request).
1 000–10 000 examples: the LIMA range. With careful curation, this band produces high-quality general-purpose assistants. Most public open-source SFT recipes sit here.
10 000–100 000 examples: diminishing returns set in. Each additional 10 K examples often adds 0.5–1 % on benchmarks, sometimes less. Worth it only if the additional examples cover gaps identified in evaluation.
Above 100 000 examples: needed for multi-task SFT covering a broad capability surface. Standard for the post-training stage of public foundation models. Rarely needed for vertical fine-tunes.

Quality dominates quantity throughout this regime. A team that ships 2 000 expert-reviewed SFT examples will outperform a team that ships 50 000 LLM-generated examples on the same task. The composition of an SFT dataset is covered in detail in our companion article on SFT dataset formats.

DPO and preference optimization — preference pairs

Direct Preference Optimization and its variants (ORPO, KTO, SimPO) train on pairs of preferred and rejected responses to the same prompt. Data-size guidance:

5 000–20 000 preference pairs: sufficient to tune style, formatting, and safety behaviour after SFT. The post-SFT alignment step for most open-weights releases.
50 000–200 000 preference pairs: needed when DPO is the primary alignment vehicle (no separate SFT or weak SFT). The volume used by frontier alignment pipelines.

Preference data quality is harder to measure than SFT quality because it is binary (preferred vs rejected) and depends on human-judgment consistency. Use the agreement rate between independent annotators as a quality proxy — below 75 %, the data is too noisy to train on.

Evaluation data — the smallest, the most important

An eval set is small in absolute terms but disproportionate in impact: it gates every decision about training.

20–50 examples: enough to detect catastrophic regressions during development. Run after every training change. Should never be touched by training data.
200–500 examples: the production eval set. Stratified by task, source, and difficulty. Used for go/no-go decisions and tracked over time as a “regression budget”.
1 000–5 000 examples: for benchmark releases or when shipping a model to a customer who will run their own evaluation. Documented with clear scope, scoring rubric, and known limitations.

One discipline that matters more than size: never let an eval example leak into training data. The fastest way to inflate apparent quality and the fastest way to lose customer trust when the inflation is found out.

RAG corpus — quality of chunks, not quantity of tokens

For retrieval-augmented generation, the question of “how much data” reduces to “how much queryable, well-chunked, well-embedded text”. The thresholds where things change:

Under 10 000 documents: a simple vector index (FAISS, Qdrant, Pinecone) on a single machine. Embedding generation takes hours. The retrieval bottleneck is recall, not latency.
10 000 – 1 million documents: hybrid search (BM25 + vectors) becomes worthwhile. Re-ranker on top of retriever starts to materially improve precision.
Above 1 million documents: the index becomes the dominant operational cost. Hierarchical retrieval (cluster-first, then chunk-rerank) replaces flat search. Embedding model choice matters more than vector store choice.

Doubling the corpus rarely doubles answer quality. Doubling the chunk-quality (cleaner extractions, smarter chunking, deduplication) often does. RAG performance is overwhelmingly bottlenecked by chunk quality and re-ranker quality, not raw token count.

Estimating cost before you commit

A back-of-envelope formula that gets you within 30 % of the real GPU-hour cost for a pre-training or continued-pre-training run:

GPU-hours ≈ 6 × N_params × N_tokens / (GPU_TFLOPS × 1e12 × 3600 × utilization)

with utilization = 0.4 to 0.55 on modern stacks

Plugging in a 7 B continued pre-training on 5 B tokens, with H100 at 989 TFLOPS bf16 and 50 % utilization: about 240 GPU-hours, or 10 days on a single H100, or 30 hours on an 8× H100 node. At cloud rates (~3 USD/H100-hour in mid-2026), that is roughly 720 USD — a useful sanity check before reserving a cluster.

For SFT and DPO, the formula is similar but the relevant quantity is example-tokens × epochs, and the utilization is lower (more I/O bound). A 5 K-example SFT on a 7 B model with QLoRA fits in 1–3 hours on a single 24 GB GPU.

The diminishing-returns line

If adding 10 % more data yields under 0.3 % improvement on your headline metric, stop scaling data and start scaling something else — annotation quality, prompt design, model architecture.

At every training stage there is a volume above which adding more data costs more than it gains. The signals that you have crossed it:

Eval loss plateaus over the last 20–30 % of training tokens.
Eval task accuracy stops improving while training loss continues to drop — the gap between training and eval grows.
Adding 10 % more data yields under 0.3 % improvement on the headline metric.
Specific failure modes (the ones your customers report) do not budge with more data of the same distribution.

When any two of these fire, stop scaling data and start scaling something else — annotation quality, prompt design, model architecture, or simply ship the version you have and revisit after deployment data accumulates.

Quick reference — token volumes by stage

Stage	Minimum	Sweet spot	Diminishing returns
From-scratch pre-training	10× params (tokens)	200–500× params	~1 000× params
Continued pre-training	~500 M tokens	1–10 B tokens	~100 B tokens
SFT	~100 examples	1 000–10 000	~100 000
DPO / preference	~2 000 pairs	10 000–50 000	~200 000
Eval set	20–50 examples	200–500	~5 000
RAG corpus	any size	quality > quantity	indexing cost dominates

Bottom line

The correct answer to “how many tokens do you need” is “what training stage, for what model, with what quality bar”. Pre-training in 2026 wants 200–500 tokens per parameter, not the Chinchilla-optimal 20. Continued pre-training wants 1–10 B in-domain tokens with a general-corpus rehearsal mix. SFT wants 1 000–10 000 carefully curated examples. DPO wants 10 000–50 000 preference pairs. Evals want 200–500 examples and a discipline of never leaking them into training.

If you are planning a vertical fine-tune and want to compare the actual token volumes that drove our French regulatory and financial text corpus, see our companion guides on SFT datasets and dataset formats — the storage and curation decisions that determine whether your token count translates into usable training signal.

Frequently asked questions

Is the 20-tokens-per-parameter Chinchilla rule still valid?

Compute-optimal — yes. But modern teams optimize inference compute, not training compute. Over-training (100–500 tokens/parameter) gives a smaller model with the same quality and cheaper inference for the model’s entire deployed life.

How many SFT examples do I really need?

1 000–10 000 carefully curated. The 2023 LIMA paper plus three years of replication studies confirm that quality dominates quantity in this range. Above 100 000, returns diminish hard.

How big should my evaluation set be?

200–500 stratified examples is the production sweet spot. Smaller for development sanity, larger for benchmark release. Discipline matters more than size — never let eval leak into training.

How much RAG data do I need?

Quality of chunks dominates quantity of tokens. Below 10 K documents a flat vector index is fine; 10 K–1 M needs hybrid search and a reranker; above 1 M, index cost dominates and you need hierarchical retrieval.

How do I know when to stop adding data?

Four signals: eval loss plateaus, eval gap grows, 10 % more data yields under 0.3 % gain, and specific user-reported failures don’t move. When two of these fire, switch focus.

Need a tokenized French corpus for continued pre-training?

We deliver 460 M tokens of finance, regulation, and economic French text as Parquet shards — versioned, provenance-tracked, AI Act art. 10 ready.

Talk to us

Training data size for LLMs — how many tokens do you actually need in 2026?

Key takeaways

In this article

The Chinchilla baseline — and why most modern models exceed it

Continued pre-training — moving the needle on a specific domain

SFT — examples, not tokens

Need 1–10 B in-domain tokens of French regulatory text?

DPO and preference optimization — preference pairs

Evaluation data — the smallest, the most important

RAG corpus — quality of chunks, not quantity of tokens

Estimating cost before you commit

The diminishing-returns line

Quick reference — token volumes by stage

Bottom line

Frequently asked questions

Is the 20-tokens-per-parameter Chinchilla rule still valid?

How many SFT examples do I really need?

How big should my evaluation set be?

How much RAG data do I need?

How do I know when to stop adding data?

Need a tokenized French corpus for continued pre-training?

Keep reading

How to train an LLM on your own data

SFT datasets — format and best practices

Choosing a dataset format

Comments

Leave a Reply Cancel reply

Keep reading.

French legal NLP — corpora, benchmarks, and evaluation tasks in 2026.

Instruction tuning vs SFT vs RLHF — choosing the right dataset shape.

RAG datasets — what makes a corpus retrieval-friendly in 2026.

Data deduplication for LLM corpora — MinHash LSH, exact match, semantic.