Essay · French Corpus LLM · Finaleads LLC

Best public datasets for training generative AI models in 2026.

The public training-data landscape changed more in 2024-2025 than in the previous five years combined. FineWeb, FineWeb-2, RedPajama-V2, Common Corpus, Dolma, SlimPajama, Pleias, Cosmopedia. Each has a different license, a different cleaning posture, and a different fit for what you are training. This is a working map for 2026.

Key takeaways

  • FineWeb-2 (15 trillion tokens, 1,000+ languages) replaced FineWeb-Edu as the strongest open generalist baseline. ODC-BY, requires an additional filtering pass for most production use.
  • Pleias Common Corpus (~2 trillion tokens, 8 European languages including French) is the only major release built entirely on public-domain and openly licensed text. Slower to scale but the cleanest compliance posture.
  • RedPajama-V2 (30 trillion tokens raw, ~5 trillion deduplicated) is the largest open multilingual web crawl. ODC-BY license, with the work of dedup + classification left to the user.
  • For French specifically, fineweb-2-fr is the best free starting point (~200B tokens). Vertical FR corpora (finance, legal, regulatory) are a complement, not a replacement, when you target a specific buyer.
  • Licensing posture matters more than raw token count. A trillion ODC-BY tokens you can ship to enterprise customers beats five trillion of unclear-license web crawl that nobody on legal will sign off.

How to read this list

“Best” depends on what you train. A pretraining run for a 7B model is not the same problem as a 1B SFT run on French finance text. The criteria that matter, in order: license clarity, language coverage matching your target, deduplication posture, and the cleaning level the team behind the dataset already shipped. Token count is the last filter, not the first.

All counts below are approximate as of early 2026. Hugging Face card pages update; verify the exact size and license before you commit. We mark each dataset on three axes: license (permissive vs research-only), language (English-first, multilingual, French-specific), and posture (raw vs filtered).

Generalist English-first crawls

DatasetTokensLicensePosture
FineWeb-Edu (HuggingFaceFW)~1.3TODC-BYFiltered (edu classifier)
FineWeb (original)~15TODC-BYLightly filtered
Dolma (AI2)3TODC-BYFiltered, multi-domain
SlimPajama-627B~627BODC-BYDedup of RedPajama-V1
The Pile (legacy)~825BMIT2020 vintage, English-heavy

FineWeb-Edu remains the strongest single-shot English baseline at the 1B–7B scale. The edu classifier filter (Llama-3 based) drops 80–90 % of CommonCrawl and keeps what scored as “educational” in a 0–5 rubric. The downside is a narrower content profile — code, dialogue, edge-domain technical text are underrepresented. Mix with code and instruction datasets if you want generalist behavior.

Generalist multilingual crawls

DatasetTokensLanguagesLicense
FineWeb-2~15T1,000+ODC-BY
RedPajama-V230T raw / 5T dedup5 main + tailODC-BY
mC4 (multilingual C4)~6T108ODC-BY
CulturaX~6.3T167ODC-BY (with caveats)
OSCAR-2301~2T150+CC0-1.0

FineWeb-2 is the 2025 standout: 1,000+ languages, deduplicated per language, ships with minhash signatures so you can re-dedup against your own corpus. The French subset is around 200B tokens after dedup, which is roughly what a small French LLM lab needs as a base. RedPajama-V2 is larger but ships rawer — you do the classifier pass.

LICENSING TRAP

Some “openly licensed” multilingual crawls inherit unclear upstream from individual sites. CulturaX is the cleanest of the multi-license bundles but read the card carefully — a few sources are research-only. If you target enterprise customers, default to ODC-BY or CC0 sources.

Public-domain and openly licensed corpora

Pleias Common Corpus, released late 2024 and expanded through 2025, is the largest training corpus built entirely on text in the public domain or under explicit open licenses. ~2 trillion tokens across 8 European languages, with French and English the largest subsets. It is slower to grow than CommonCrawl-based corpora because acquisition is gated on actual license verification, but it is the cleanest compliance posture you can ship to enterprise.

Other openly licensed building blocks worth knowing: Wikipedia (CC-BY-SA), arXiv metadata + abstracts (mixed, mostly CC0 abstracts), USPTO patents (US-PD), Project Gutenberg (US-PD, limited modern content), the Library of Congress digitization (US-PD).

A trillion ODC-BY tokens you can ship beats five trillion of unclear-license web crawl that nobody on legal will sign off.

Code corpora

DatasetTokensLicenseNotes
The Stack v2 (BigCode)~900BPermissive onlyOpt-out enforced
StarCoder Training Data~250BPermissive onlyFiltered subset of Stack
CodeParrot~50BPermissive onlyPython-focused
OpenCoder pretrain~960BPermissive onlyFiltered + classifier

The Stack v2 is the canonical code corpus for permissive-license training. The opt-out list is enforced — BigCode honors author requests to remove repositories from subsequent versions. If you fine-tune on code, the Stack v2 plus a filtered Python or language-specific slice is the standard starting point.

Instruction and chat corpora

For SFT and instruction tuning, the relevant 2026 corpora are different from pretraining datasets. Tulu-3 (Allen AI), Magpie-Pro, Cosmopedia v2, OpenHermes-2.5, and the Llama-Nemotron post-training data are the active baselines.

  • Tulu-3 (Allen AI): 940K instructions, mixed open licenses. Includes RLHF preference pairs and DPO data. Best single-shot SFT dataset published openly in 2025.
  • Cosmopedia v2 (Hugging Face): 30B+ synthetic tokens, generated by Mixtral and Llama-3. Strong for general knowledge SFT, weaker on technical depth.
  • Magpie series: 1M+ instructions generated by extracting Llama-3’s own outputs. Apache-2.0. Useful for diversity, less for specialized domains.
  • OpenHermes-2.5: 1M+ conversations, Apache-2.0. The reference for chat tuning open models.

Vertical doesn’t mean less. It means more useful.

Generalist corpora are necessary; they are rarely sufficient. For French finance, regulatory, and economic LLMs, a vertical corpus with audit trail and pseudonymization moves the needle where generalist data does not.

Vertical-specific French corpora

For French finance, legal, and regulatory training, the public landscape is much thinner than for generalist text. fineweb-2-fr gives you the breadth. For depth, the public options are:

  • DILA bulk archives (Direction de l’information légale et administrative): JORF, LEGI, CASS, JADE, CONSTIT, KALI, CIRC, CNIL. Open data, Licence Ouverte 2.0. Total around 2.7M documents and 2 billion tokens across French law, jurisprudence, and regulator decisions.
  • EUR-Lex French translations: ~100K regulations and directives in French through 2022. The post-2022 layer (DORA, MiCA, AI Act, CSRD) requires extraction via the Cellar API or directly from EUR-Lex with cookies — the bulk archive does not cover it yet.
  • BOFiP (tax doctrine): ~8K documents covering French tax administration. Open data, Licence Ouverte 2.0.
  • French open-data on data.gouv.fr: a long tail of sectoral datasets, most useful when joined with the DILA archives via citation graphs.

Choosing a starting mix

A pragmatic 2026 mix for a 1B–7B French-capable generalist model: 60 % FineWeb-2 (English + French + Spanish), 15 % Pleias Common Corpus, 10 % The Stack v2 (code), 10 % vertical French (finance, regulatory, legal — your call), 5 % instruction data (Tulu-3, Magpie). Adjust the mix toward your target use case. The vertical 10 % is the block that decides whether your model is useful for French regulated industries.

For SFT on top of an existing pretrained model, the mix collapses: 40 % Tulu-3, 30 % vertical, 20 % Cosmopedia, 10 % code instructions. Pretrained models already saw the generalist text — SFT is where you push the specialization that justifies the model.

Need French finance and regulatory text on top of FineWeb-2?

We ship a vertical French corpus on finance, regulatory, and economic text — pseudonymized, audit-trailed, and signed. The 10 % of your mix that decides domain capability.

Frequently asked questions

FineWeb-Edu and Pleias Common Corpus are the two cleanest large options as of early 2026. FineWeb-Edu is ODC-BY (more permissive); Pleias is built on public-domain and open-license source material (cleanest compliance). Both are large enough that scale is not the bottleneck. Pretraining-grade datasets in the 5T+ range typically require more careful license review.

How many tokens do I actually need?

Chinchilla-optimal for a 7B model is around 140B tokens. For a 70B model, around 1.4T. Going beyond Chinchilla-optimal still helps but with diminishing returns — Llama-3 8B was trained on 15T tokens, well past optimal, and gained measurable capability. The ceiling is your data quality, not your token count.

Should I deduplicate across datasets?

Yes. Cross-corpus duplication is real — Wikipedia text shows up in CommonCrawl, parliamentary text shows up in legal databases, the same Reuters articles get republished across regional news sites. MinHash LSH with a Jaccard threshold of 0.7 to 0.8 is the standard pass. Expect 20–40 % drop on a typical multi-source mix.

Is it safe to mix research-only and commercial datasets?

Not if you train a single model you intend to ship. The model inherits the most restrictive license in the training mix. Some research-only datasets are derived from permissive sources — in those cases you can usually reconstruct an equivalent corpus from the source. Always assume the strict reading.

What about synthetic data?

Synthetic data (Cosmopedia, Magpie, Llama-Nemotron) is now standard for instruction tuning and increasingly used in pretraining. The legal question is whether the generator model’s license permits derivative datasets — most permissive open models do, but check. Quality-wise, synthetic data closes a measurable gap on instruction following but does not yet replace real human-authored text on long-form reasoning and technical depth.


Keep reading

Read next

How to train an LLM on your own data — a practical 2026 guide

End-to-end: hardware, dataset prep, fine-tuning frameworks, evaluation. The practical playbook for a small team.

Read next

SFT datasets — format, structure, instruction-tuning best practices

Instruction-tuning data shapes, format choices, and how to design an SFT set that actually moves your model.

Read next

Training data size for LLMs — how many tokens do you actually need in 2026?

Chinchilla, beyond-Chinchilla, and the quality-vs-quantity trade-off in the post-scaling era.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *