Editorial · French Corpus LLM · regulatory & generative AI

Category: LLM Training Data

Building, scaling, and governing training datasets for large language models — pretraining corpora, supervised fine-tuning sets, instruction tuning data, evaluation benchmarks, and the engineering behind them.

  • Instruction tuning vs SFT vs RLHF — choosing the right dataset shape.

    Instruction tuning, supervised fine-tuning, and RLHF/DPO all live under the same vague label of “post-training” in 2026. They are not interchangeable. Each requires a differently shaped dataset, captures different signal, and answers different deployment questions.

    Key takeaways

    • SFT is the umbrella. Instruction tuning is a flavor of SFT with explicit task instructions. RLHF and DPO are different — they require preference pairs, not just demonstrations.
    • Instruction tuning datasets are (instruction, optional context, response) triples. Quality beats quantity past ~10K examples — LIMA showed strong results at 1K well-curated examples.
    • RLHF and DPO need (prompt, preferred response, rejected response) triples. Building these is 5–10× the per-example cost of SFT examples, but the alignment signal is sharper.
    • Domain-specific SFT on a permissive corpus is the highest-ROI post-training step for vertical applications. 5,000–30,000 well-curated examples typically outperform a 1M-example generalist SFT mix on domain tasks.
    • Eval frames the dataset shape. If your eval is “does the model follow this instruction,” build instruction data. If it is “does the model prefer the safer answer,” build preference data. Mismatched eval-vs-data is the most common failure mode.

    Terminology — what each method actually is

    Supervised fine-tuning (SFT) is the umbrella term: you continue training a pretrained model on labeled examples of (input, desired output). Instruction tuning is a flavor of SFT where the inputs are explicit task instructions. RLHF (reinforcement learning from human feedback) trains the model to prefer responses humans prefer, using a reward model. DPO (direct preference optimization) achieves a similar effect without a separate reward model, by directly optimizing the preference loss.

    In production 2026, the typical post-training stack on a frontier-adjacent open model is: SFT pass (general instruction following) → SFT pass (domain specialization) → DPO or KTO pass (alignment refinement) → optional RLHF pass for tail behaviors. Each step uses a differently shaped dataset.

    SFT data shape — examples and counts

    The minimum shape is (prompt, completion). Examples are stored as JSONL with one record per row, or as a Hugging Face Dataset. For chat-template models, prompts include the system message and conversation history; for completion-style models, prompts are the raw input string.

    FieldTypeExample
    messageslist of dicts[{role: system, content: …}, {role: user, content: …}]
    completionstringAssistant’s response text
    sourcestringtulu-3 / magpie / cosmopedia / custom
    quality_scorefloat0.0–1.0 (optional)
    domainstringgeneral / code / math / french-legal

    Counts: general SFT mixes are 100K–1M examples (Tulu-3 is 940K, OpenHermes-2.5 is ~1M). Domain-specific SFT works at 5K–50K examples. Beyond ~100K examples on a single domain, returns diminish quickly.

    Instruction tuning data shape — examples and counts

    Instruction tuning is SFT with explicit task framing. Each example reads as “do X with input Y, here is the expected output Z.” The classic Alpaca format uses three fields: instruction, input, output. The input field is optional — many tasks have only an instruction (e.g., “write a poem about autumn”).

    ALPACA VS CHATML

    The Alpaca format is convenient for academic work but production deployments use ChatML or the model’s native chat template (Llama, Mistral, Qwen all differ). Convert Alpaca-style examples to the target chat template at training time, not at dataset creation time. Keep the dataset in the most general format.

    The 2024–2025 instruction-tuning corpora — Tulu-3, OpenHermes-2.5, Magpie — all ship in ChatML or close variants. Hugging Face datasets typically include both the raw format and the chat-template-applied format for convenience. Diversity of instruction types matters more than depth — 50K diverse instructions beat 200K instructions that all look like “summarize X.”

    RLHF and DPO data shape — preference pairs

    Preference data has a different shape: each row is (prompt, chosen_response, rejected_response). The model learns which response a human (or a model acting as judge) preferred. Datasets: Anthropic HH-RLHF, OpenAI WebGPT preferences, UltraFeedback, Allen AI Tulu DPO sets.

    FieldTypeNotes
    promptstring or chatSame as SFT
    chosenstringThe preferred response
    rejectedstringThe dispreferred response
    chosen_scorefloatOptional reward-model score
    rejected_scorefloatOptional reward-model score

    Building preference pairs costs 5–10× what building SFT examples costs, because every prompt needs two candidate responses and a judgment. In practice 2026 teams build preference data semi-synthetically: generate K=4–8 candidates per prompt with the SFT model, score them with a strong judge model (GPT-4o, Claude, Llama-Nemotron-70B), then keep the top-1 / bottom-1 pair. UltraFeedback and the Magpie-DPO series follow exactly this pattern.

    If you cannot articulate why one response is better than another in your domain, you are not ready to do RLHF. SFT first; refine when the preferences are clearly statable.

    Quality vs quantity — the LIMA finding

    LIMA (Less Is More for Alignment, Zhou et al. 2023) showed that a 65B model SFT-tuned on 1,000 carefully curated examples matched the performance of the same model tuned on the full FLAN collection (~600K examples). The replicated findings since: past ~10,000 high-quality examples in a target domain, additional volume produces diminishing returns or actively harms quality when low-quality examples dilute the mix.

    The corollary: invest in curation. For domain-specific SFT, 5,000 examples that domain experts have reviewed beat 50,000 examples a moderately capable judge model accepted. The bottleneck is reviewer time, not generation cost. The reviewers also need to be actual domain experts — generic annotators undercatch the technical errors that matter.

    A French finance + regulatory corpus, primed for SFT

    Use our vertical corpus as the foundation for domain SFT, instruction tuning, or preference data construction. Pseudonymized, audit-trailed, ready to drop into Hugging Face Datasets.

    Synthetic data — when it helps and when it hurts

    Synthetic data — examples generated by a strong model — is now standard for instruction tuning and increasingly used in preference data. The main 2025 entries: Cosmopedia v2 (textbooks generated by Mixtral), Llama-Nemotron post-training data (Llama-3 generated), Magpie (extracted Llama-3 outputs), Tulu-3 synthetic complement, OpenCoder synthetic instructions.

    When synthetic helps: instruction following, format compliance, long-tail edge cases where real examples are rare. When synthetic hurts: deep domain expertise (synthetic regulatory analysis tends to hallucinate citations), nuanced preference judgments, and tasks where the synthetic generator does not actually understand the task. The tell-tale: if the generator model itself cannot do the task reliably, its synthetic training data will encode its own errors.

    • Mix real and synthetic. 70 % synthetic / 30 % real is a common production ratio for instruction tuning.
    • Cite the generator. Synthetic data inherits the generator’s biases and license. Document which model generated which subset.
    • Filter by judge agreement. Use a different model to filter the generator’s output. Reduces the bias drift from training on a single model’s preferences.

    Eval shapes the dataset shape

    The most common post-training failure: shipping a dataset shape that does not match the eval. If your eval is GSM8K (math reasoning), your dataset should heavily include math problems with step-by-step solutions. If your eval is a customer-call summary task, your dataset should heavily include conversation-to-summary pairs. Generic instruction mixes lift generic benchmarks; they do not lift specific evals.

    Vertical evaluation, where the eval is “does the model correctly answer this regulatory question with a valid citation,” needs SFT data shaped exactly like the eval task. A retrieval-augmented eval needs retrieval-augmented SFT. A citation-required eval needs citation-bearing SFT.

    A practical post-training mix for vertical LLMs

    A 2026 reference recipe for a French regulatory assistant built on a 7–14B pretrained open model:

    • SFT pass 1 — general instruction following. Tulu-3 + Cosmopedia v2 subset, ~200K examples, 3 epochs.
    • SFT pass 2 — domain specialization. 8K–15K vertical examples from your French regulatory corpus, hand-curated, 1–2 epochs. This is the block that decides whether the model can answer regulatory questions usefully.
    • DPO pass — alignment refinement. 5K–10K preference pairs, generated semi-synthetically from the SFT model and judged by GPT-4o-class or Llama-Nemotron-70B. Targets faithfulness, citation use, and refusal behavior.
    • Optional safety SFT. 2K examples covering known regulator-side expectations (privacy, scope, refusal patterns). Hand-written.

    Total: ~30K–230K labeled examples depending on whether you count the generalist first pass. The vertical contribution is 13K–25K examples — small in absolute terms, decisive in domain performance.

    A French regulatory corpus for domain-specific SFT

    Pseudonymized, audit-trailed, AI Act Article 10 documented. Use it as the foundation for your SFT pass 2 — the block that decides whether the model can answer regulatory questions.

    Frequently asked questions

    Should I do SFT before DPO, or together?

    Sequentially. SFT establishes the model’s instruction-following baseline. DPO refines the model’s preferences on top of that baseline. Doing them together — or doing DPO without a prior SFT — gives unstable training dynamics because the preference loss is conditioned on the model’s current distribution.

    How do I know if my SFT dataset is good enough?

    Three checks. (1) Inter-annotator agreement on a 100-example sample — if two domain experts disagree on what the correct output should be, the dataset is ambiguous. (2) A held-out eval that mirrors your target task — small SFT runs on subsets should show monotonic improvement with more data. (3) Failure-mode analysis after a small training run — the failure cases tell you what additional examples to add.

    Can I use the same data for both SFT and DPO?

    Not directly. SFT data is (prompt, response) pairs; DPO data is (prompt, chosen, rejected) triples. You can derive a DPO set from an SFT set by generating multiple candidates per prompt and scoring them, but the SFT and DPO sets serve different purposes.

    Is RLHF still relevant in 2026 or has DPO won?

    DPO is the production default for most alignment work in 2026 — simpler, no reward model to train and maintain. RLHF retains an edge on hard alignment problems where the reward function is non-trivial (safety, multi-turn coherence, long-form faithfulness). Anthropic and DeepMind still publish active RLHF research. For most teams shipping a fine-tuned model in 2026, DPO is the right starting point.

    What about KTO and IPO?

    Variants of DPO with different loss formulations. KTO (Kahneman-Tversky Optimization) needs only single-response labels, not pairs, which is cheaper. IPO (Identity Preference Optimization) addresses a bias in the original DPO loss. Both are worth experimenting with after you have a working DPO pipeline. None changes the dataset shape requirements significantly.


    Keep reading

    Read next

    SFT datasets — format, structure, instruction-tuning best practices

    Field-by-field deep dive into the SFT data format, with the common pitfalls in chat template handling and turn-level masking.

    Read next

    How to train an LLM on your own data — a practical 2026 guide

    End-to-end fine-tuning playbook — hardware, dataset prep, framework choice, evaluation.

    Read next

    Best public datasets for training generative AI models in 2026

    The pretraining and post-training datasets you can actually train on without legal review — license, language, size, posture.

  • RAG datasets — what makes a corpus retrieval-friendly in 2026.

    Most teams shopping for retrieval-augmented generation focus on the retriever — vector DB, reranker, embedding model. The data is treated as input. That is backwards. RAG quality is decided in your corpus structure: how you chunk, what you index, what metadata you keep, and whether the model can trust what it retrieves.

    Key takeaways

    • A retrieval-friendly corpus ships with per-chunk metadata, not just per-document. Source URL, publication date, jurisdiction, document type — at the chunk level, queryable as filters.
    • Pre-computed embeddings move RAG from a research prototype to a 10-minute deployment. E5-multilingual-large at 1024 dimensions is the 2026 default for multilingual production. Ship them with the corpus.
    • Chunk size 512 tokens with 64-token overlap is the production-default. Smaller chunks improve precision but kill cross-paragraph context. Domain-specific tuning is real — legal text wants bigger chunks than news.
    • Provenance per chunk is the AI Act Article 10 footprint of RAG. When your retriever surfaces a chunk, the citation chain has to lead back to a canonical source. Skip this and your enterprise customer’s audit fails.
    • Hybrid retrieval (dense + sparse + reranker) beats pure dense by 5–15 % on regulatory text. Even with a perfect embedding model, sparse signal (exact match on a CELEX number, an article reference) catches what semantics can’t.

    What makes a corpus “retrieval-friendly”

    Two corpora with the same raw content can produce very different RAG quality. The difference is in five preparation choices: chunking strategy, metadata schema, embedding model, provenance density, and source-aware filtering. A retrieval-friendly corpus is one where those choices are made deliberately and shipped alongside the text.

    The opposite — a “pretraining-friendly” corpus — optimizes for raw token throughput, light cleaning, broad license, no chunking, no embeddings. Pleias Common Corpus, FineWeb-2, RedPajama are pretraining-shaped. They will work in a RAG pipeline, but you do all the preparation work yourself. A vertical corpus shipped for RAG already has that work done.

    Chunking — size, overlap, boundaries

    Chunk size 512 tokens with 64-token overlap is the production-default for E5- and BGE-class embedding models. Smaller chunks (128–256 tokens) improve precision on specific-fact retrieval but lose cross-paragraph context — a model that retrieves a single sentence often cannot reason about the surrounding argument. Larger chunks (1024+ tokens) are good for long-context models but reduce the recall granularity.

    Boundary detection matters more than size. Splitting mid-sentence is the most common mistake. The right boundary order is: section heading > paragraph > sentence > token. For legal text specifically, also respect article boundaries — never split an article across two chunks if you can avoid it, because the article number is the citation key.

    DomainRecommended chunk sizeOverlapRationale
    News / web256–51232–64Short, self-contained paragraphs
    Legal / regulatory512–76864–128Articles are the natural unit
    Scientific papers512–1024128Cross-section reasoning
    Code / API docs256–51264Function-level boundaries
    Conversational logs128–25632Turn-level retrieval

    ANCHOR ON STRUCTURE

    If your source documents have a table of contents or visible structure (HTML headings, markdown sections, PDF outline), use it. Splitting on existing structure gives chunks that humans understand as units. Random splitting at 512 tokens generates chunks that look fine to the retriever but make no sense in a citation.

    Metadata per chunk — what to keep, what to drop

    The minimum metadata schema per chunk: chunk ID, parent document ID, source URL, publication date, chunk position within document, chunk text length. Anything you might want to filter on at query time should be a metadata field, not a tag inside the text.

    Domain-specific metadata that earns its keep:

    • Legal corpora: jurisdiction, court level, article reference, CELEX number, date of decision, citation list (which other documents this chunk cites).
    • Financial corpora: entity tickers mentioned, currency, time period discussed, document type (10-K, regulator decision, research note), regulator if any.
    • Multilingual corpora: language, language confidence, translation fidelity score if the chunk is translated content.
    • Regulatory corpora: regulator name, regulatory topic (e.g., AML, GDPR, prudential), authority of the document (binding, soft law, guidance).

    A retriever that cannot filter by jurisdiction or date is useless on regulatory questions. Metadata is the dial that turns RAG from research demo to production tool.

    Pre-computed embeddings — model and dimensions

    Three reasons to pre-compute and ship embeddings with the corpus: (1) speed to first query — your customer indexes Parquet, not text; (2) consistency — everyone using the corpus uses the same embedding, so retrieval results are reproducible; (3) cost — embedding 2 million documents once on the corpus owner’s GPU is cheaper than every customer doing it themselves.

    Model choice as of 2026:

    ModelDimLanguagesPosture
    E5-multilingual-large-instruct1024100+Production default 2026
    BGE-multilingual-gemma21024100+Strong multilingual
    mGTE-large102470+Strong on legal/long-doc
    multilingual-e5-base768100+Lighter, faster
    CamemBERT v2 + projection768FR onlyFR-specific fine-tune

    Storage: float32 at 1024 dim is 4 KB per chunk; INT8-quantized is 1 KB. For a 2-million-chunk corpus, that is 8 GB float32 vs 2 GB INT8 — both manageable. INT8 quantization loses around 0.5–1.5 % retrieval quality, which is usually worth it.

    A French finance + regulatory corpus with E5-multilingual embeddings

    We ship the corpus with pre-computed embeddings, per-chunk metadata, jurisdiction and citation graph. Plug into Pinecone, Weaviate, or pgvector in 10 minutes.

    Hybrid retrieval — why dense alone is not enough

    Pure dense retrieval (cosine on embeddings) is the 2022 default. In 2026 it is no longer enough for regulatory or legal retrieval. The reason: semantic models smooth over exact identifiers. A query for “Article L. 612-15” or CELEX “32023R1114” (MiCA) is a precise lookup, not a semantic one. Dense retrieval lands somewhere close but not exact; sparse retrieval (BM25) lands on it.

    Standard hybrid: BM25 for sparse, E5-multilingual or BGE for dense, combine via reciprocal rank fusion or a learned reranker. Then a cross-encoder reranker (BGE-reranker or Cohere Rerank) on the top 25 candidates. On internal evals across French regulatory text, this beat pure dense by 5–15 % nDCG@10. The cost is operational complexity, not training cost — both retrievers run cheap once the index is built.

    Provenance and citation — the audit angle

    When your retriever surfaces a chunk and the model uses it, the citation chain should lead back to a canonical source. For an enterprise compliance use case, that means: the chunk has a stable URL or document ID, the document has an audit trail (when extracted, from where, with what license), and the model’s answer can include the citation in the output.

    This is also the AI Act Article 10 footprint of RAG. A generative AI provider deploying a regulated-domain assistant has to demonstrate where the retrieved content came from. A corpus that ships per-chunk provenance JSON-LD makes that demonstration trivial — you read the audit field directly. A corpus that does not requires you to rebuild the trail yourself.

    Eval — measuring retrieval quality on your corpus

    Two evals to run on any RAG corpus before deployment: a synthetic-query eval and a real-query eval. Synthetic: for each document, generate 1–3 questions a Llama or Qwen-class model would plausibly answer using only that document; index the questions; measure recall@K. Real: pull 100 actual user questions from your support tickets or sales calls; have a domain expert score the retrieved chunks for relevance.

    Benchmarks: BEIR is the canonical retrieval benchmark in 2026, but its domains are broad — for vertical RAG you need a domain-specific eval. MIRAGE (medical), LegalBench (legal), MIRACL (multilingual general). For French regulatory specifically, no public eval exists yet; build your own from 100–500 expert-curated query-document pairs.

    Common failure patterns in 2026 RAG deployments

    • Splitting on tokens, not structure. Chunks land mid-sentence, the retriever surfaces incomplete information, the model generates plausible nonsense.
    • Missing date metadata. A query about “the current regulation” retrieves a 2018 version because the chunk has no date filter. Avoidable if every chunk carries publication date and the retriever respects it.
    • Embedding the wrong text. Embedding the chunk including boilerplate, footer, or page number degrades retrieval. Clean before you embed.
    • Single embedding model. Multilingual corpus, monolingual embedding. Recall on the non-English subset drops by 30 %+. Use a multilingual model from the start.
    • No reranker. Top-K candidates from dense retrieval are noisy. A cross-encoder reranker on top 25–50 is the single highest-ROI addition to a 2026 RAG pipeline.

    A vertical French corpus ready for RAG

    Per-chunk metadata (jurisdiction, date, citation graph), E5-multilingual embeddings, audit-trail per document. Plug-and-play for finance, regulatory, and economic RAG.

    Frequently asked questions

    Do I need a vector database or is a flat file enough?

    For under 1 million chunks, a flat file with FAISS or HNSWlib is fine and removes a service dependency. For larger corpora or strict latency SLAs, a managed vector DB (Pinecone, Weaviate, Qdrant Cloud) saves you operational work. pgvector inside PostgreSQL is the right answer if you already run Postgres and the corpus fits.

    How often should I re-embed?

    Re-embed when the embedding model changes or when the corpus refresh adds significant new content (10 %+). Incremental embeds on new content only is the standard pattern. A full re-embed every 6–12 months keeps the corpus aligned with model improvements without massive cost.

    Can I use OpenAI embeddings instead of open-source ones?

    Yes, but you create vendor lock-in and a network dependency. OpenAI text-embedding-3-large is currently top-tier on English and good on multilingual but a closed-weight API. For production RAG on a corpus you ship to customers, open-weight (E5, BGE, mGTE) is the more durable choice.

    Does fine-tuning the embedding model help?

    On domain-specific corpora, yes. Sentence-Transformers SimCSE or Contrastive Loss fine-tuning on (query, relevant_chunk) pairs from your domain typically improves domain retrieval by 5–15 % with a few thousand pairs. The pairs cost more than the compute.

    What about long-context models — do they replace RAG?

    They reduce some classes of RAG use cases but do not replace RAG for large corpora. A 200K-context model can read an entire book, but it cannot read a 2-million-document corpus. RAG is the necessary architecture for corpus-scale knowledge. Long-context models complement RAG by accepting more retrieved chunks per query, which lets you operate at lower retrieval precision.


    Keep reading

    Read next

    How to train an LLM on your own data — a practical 2026 guide

    End-to-end fine-tuning playbook for a small team — hardware, dataset prep, evaluation.

    Read next

    Choosing a dataset format — Parquet vs JSONL vs Arrow

    Format choices that decide how fast you can re-embed and re-index when the model changes.

    Read next

    Training data size for LLMs — how many tokens do you actually need in 2026?

    Tokens for pretraining and chunks for RAG live in different mental models. Both matter.

  • Best public datasets for training generative AI models in 2026.

    The public training-data landscape changed more in 2024-2025 than in the previous five years combined. FineWeb, FineWeb-2, RedPajama-V2, Common Corpus, Dolma, SlimPajama, Pleias, Cosmopedia. Each has a different license, a different cleaning posture, and a different fit for what you are training. This is a working map for 2026.

    Key takeaways

    • FineWeb-2 (15 trillion tokens, 1,000+ languages) replaced FineWeb-Edu as the strongest open generalist baseline. ODC-BY, requires an additional filtering pass for most production use.
    • Pleias Common Corpus (~2 trillion tokens, 8 European languages including French) is the only major release built entirely on public-domain and openly licensed text. Slower to scale but the cleanest compliance posture.
    • RedPajama-V2 (30 trillion tokens raw, ~5 trillion deduplicated) is the largest open multilingual web crawl. ODC-BY license, with the work of dedup + classification left to the user.
    • For French specifically, fineweb-2-fr is the best free starting point (~200B tokens). Vertical FR corpora (finance, legal, regulatory) are a complement, not a replacement, when you target a specific buyer.
    • Licensing posture matters more than raw token count. A trillion ODC-BY tokens you can ship to enterprise customers beats five trillion of unclear-license web crawl that nobody on legal will sign off.

    How to read this list

    “Best” depends on what you train. A pretraining run for a 7B model is not the same problem as a 1B SFT run on French finance text. The criteria that matter, in order: license clarity, language coverage matching your target, deduplication posture, and the cleaning level the team behind the dataset already shipped. Token count is the last filter, not the first.

    All counts below are approximate as of early 2026. Hugging Face card pages update; verify the exact size and license before you commit. We mark each dataset on three axes: license (permissive vs research-only), language (English-first, multilingual, French-specific), and posture (raw vs filtered).

    Generalist English-first crawls

    DatasetTokensLicensePosture
    FineWeb-Edu (HuggingFaceFW)~1.3TODC-BYFiltered (edu classifier)
    FineWeb (original)~15TODC-BYLightly filtered
    Dolma (AI2)3TODC-BYFiltered, multi-domain
    SlimPajama-627B~627BODC-BYDedup of RedPajama-V1
    The Pile (legacy)~825BMIT2020 vintage, English-heavy

    FineWeb-Edu remains the strongest single-shot English baseline at the 1B–7B scale. The edu classifier filter (Llama-3 based) drops 80–90 % of CommonCrawl and keeps what scored as “educational” in a 0–5 rubric. The downside is a narrower content profile — code, dialogue, edge-domain technical text are underrepresented. Mix with code and instruction datasets if you want generalist behavior.

    Generalist multilingual crawls

    DatasetTokensLanguagesLicense
    FineWeb-2~15T1,000+ODC-BY
    RedPajama-V230T raw / 5T dedup5 main + tailODC-BY
    mC4 (multilingual C4)~6T108ODC-BY
    CulturaX~6.3T167ODC-BY (with caveats)
    OSCAR-2301~2T150+CC0-1.0

    FineWeb-2 is the 2025 standout: 1,000+ languages, deduplicated per language, ships with minhash signatures so you can re-dedup against your own corpus. The French subset is around 200B tokens after dedup, which is roughly what a small French LLM lab needs as a base. RedPajama-V2 is larger but ships rawer — you do the classifier pass.

    LICENSING TRAP

    Some “openly licensed” multilingual crawls inherit unclear upstream from individual sites. CulturaX is the cleanest of the multi-license bundles but read the card carefully — a few sources are research-only. If you target enterprise customers, default to ODC-BY or CC0 sources.

    Public-domain and openly licensed corpora

    Pleias Common Corpus, released late 2024 and expanded through 2025, is the largest training corpus built entirely on text in the public domain or under explicit open licenses. ~2 trillion tokens across 8 European languages, with French and English the largest subsets. It is slower to grow than CommonCrawl-based corpora because acquisition is gated on actual license verification, but it is the cleanest compliance posture you can ship to enterprise.

    Other openly licensed building blocks worth knowing: Wikipedia (CC-BY-SA), arXiv metadata + abstracts (mixed, mostly CC0 abstracts), USPTO patents (US-PD), Project Gutenberg (US-PD, limited modern content), the Library of Congress digitization (US-PD).

    A trillion ODC-BY tokens you can ship beats five trillion of unclear-license web crawl that nobody on legal will sign off.

    Code corpora

    DatasetTokensLicenseNotes
    The Stack v2 (BigCode)~900BPermissive onlyOpt-out enforced
    StarCoder Training Data~250BPermissive onlyFiltered subset of Stack
    CodeParrot~50BPermissive onlyPython-focused
    OpenCoder pretrain~960BPermissive onlyFiltered + classifier

    The Stack v2 is the canonical code corpus for permissive-license training. The opt-out list is enforced — BigCode honors author requests to remove repositories from subsequent versions. If you fine-tune on code, the Stack v2 plus a filtered Python or language-specific slice is the standard starting point.

    Instruction and chat corpora

    For SFT and instruction tuning, the relevant 2026 corpora are different from pretraining datasets. Tulu-3 (Allen AI), Magpie-Pro, Cosmopedia v2, OpenHermes-2.5, and the Llama-Nemotron post-training data are the active baselines.

    • Tulu-3 (Allen AI): 940K instructions, mixed open licenses. Includes RLHF preference pairs and DPO data. Best single-shot SFT dataset published openly in 2025.
    • Cosmopedia v2 (Hugging Face): 30B+ synthetic tokens, generated by Mixtral and Llama-3. Strong for general knowledge SFT, weaker on technical depth.
    • Magpie series: 1M+ instructions generated by extracting Llama-3’s own outputs. Apache-2.0. Useful for diversity, less for specialized domains.
    • OpenHermes-2.5: 1M+ conversations, Apache-2.0. The reference for chat tuning open models.

    Vertical doesn’t mean less. It means more useful.

    Generalist corpora are necessary; they are rarely sufficient. For French finance, regulatory, and economic LLMs, a vertical corpus with audit trail and pseudonymization moves the needle where generalist data does not.

    Vertical-specific French corpora

    For French finance, legal, and regulatory training, the public landscape is much thinner than for generalist text. fineweb-2-fr gives you the breadth. For depth, the public options are:

    • DILA bulk archives (Direction de l’information légale et administrative): JORF, LEGI, CASS, JADE, CONSTIT, KALI, CIRC, CNIL. Open data, Licence Ouverte 2.0. Total around 2.7M documents and 2 billion tokens across French law, jurisprudence, and regulator decisions.
    • EUR-Lex French translations: ~100K regulations and directives in French through 2022. The post-2022 layer (DORA, MiCA, AI Act, CSRD) requires extraction via the Cellar API or directly from EUR-Lex with cookies — the bulk archive does not cover it yet.
    • BOFiP (tax doctrine): ~8K documents covering French tax administration. Open data, Licence Ouverte 2.0.
    • French open-data on data.gouv.fr: a long tail of sectoral datasets, most useful when joined with the DILA archives via citation graphs.

    Choosing a starting mix

    A pragmatic 2026 mix for a 1B–7B French-capable generalist model: 60 % FineWeb-2 (English + French + Spanish), 15 % Pleias Common Corpus, 10 % The Stack v2 (code), 10 % vertical French (finance, regulatory, legal — your call), 5 % instruction data (Tulu-3, Magpie). Adjust the mix toward your target use case. The vertical 10 % is the block that decides whether your model is useful for French regulated industries.

    For SFT on top of an existing pretrained model, the mix collapses: 40 % Tulu-3, 30 % vertical, 20 % Cosmopedia, 10 % code instructions. Pretrained models already saw the generalist text — SFT is where you push the specialization that justifies the model.

    Need French finance and regulatory text on top of FineWeb-2?

    We ship a vertical French corpus on finance, regulatory, and economic text — pseudonymized, audit-trailed, and signed. The 10 % of your mix that decides domain capability.

    Frequently asked questions

    FineWeb-Edu and Pleias Common Corpus are the two cleanest large options as of early 2026. FineWeb-Edu is ODC-BY (more permissive); Pleias is built on public-domain and open-license source material (cleanest compliance). Both are large enough that scale is not the bottleneck. Pretraining-grade datasets in the 5T+ range typically require more careful license review.

    How many tokens do I actually need?

    Chinchilla-optimal for a 7B model is around 140B tokens. For a 70B model, around 1.4T. Going beyond Chinchilla-optimal still helps but with diminishing returns — Llama-3 8B was trained on 15T tokens, well past optimal, and gained measurable capability. The ceiling is your data quality, not your token count.

    Should I deduplicate across datasets?

    Yes. Cross-corpus duplication is real — Wikipedia text shows up in CommonCrawl, parliamentary text shows up in legal databases, the same Reuters articles get republished across regional news sites. MinHash LSH with a Jaccard threshold of 0.7 to 0.8 is the standard pass. Expect 20–40 % drop on a typical multi-source mix.

    Is it safe to mix research-only and commercial datasets?

    Not if you train a single model you intend to ship. The model inherits the most restrictive license in the training mix. Some research-only datasets are derived from permissive sources — in those cases you can usually reconstruct an equivalent corpus from the source. Always assume the strict reading.

    What about synthetic data?

    Synthetic data (Cosmopedia, Magpie, Llama-Nemotron) is now standard for instruction tuning and increasingly used in pretraining. The legal question is whether the generator model’s license permits derivative datasets — most permissive open models do, but check. Quality-wise, synthetic data closes a measurable gap on instruction following but does not yet replace real human-authored text on long-form reasoning and technical depth.


    Keep reading

    Read next

    How to train an LLM on your own data — a practical 2026 guide

    End-to-end: hardware, dataset prep, fine-tuning frameworks, evaluation. The practical playbook for a small team.

    Read next

    SFT datasets — format, structure, instruction-tuning best practices

    Instruction-tuning data shapes, format choices, and how to design an SFT set that actually moves your model.

    Read next

    Training data size for LLMs — how many tokens do you actually need in 2026?

    Chinchilla, beyond-Chinchilla, and the quality-vs-quantity trade-off in the post-scaling era.

  • SFT datasets — format, structure, and instruction-tuning best practices.

    A supervised fine-tuning dataset looks deceptively simple — inputs and target outputs. The difficulty is hidden in the format you choose, the template you apply, and the loss mask you compute. Get any wrong and your training run completes cleanly while teaching the model something different from what you intended.

    Key takeaways

    • Chat format (messages array) is the 2026 default — adopt unless you have a specific reason not to.
    • Always use tokenizer.apply_chat_template() — never hardcode the chat string.
    • Loss-mask the prompt; train only on the assistant tokens.
    • 1 000 curated examples outperform 50 000 unfiltered ones. LIMA is consensus, not opinion.
    • Version the dataset (hash + version id in the model card) — Article 10 expects it.

    A supervised fine-tuning dataset looks deceptively simple: a list of inputs and the responses you want the model to produce. The difficulty is hidden in three places — the format you choose, the template you apply, and the loss mask you compute. Get any of those wrong and the training run will complete cleanly while teaching the model something different from what you intended. This guide covers what an SFT dataset actually is, the three formats in use in 2026, the chat templates you cannot ignore, and the quality bar that distinguishes a usable dataset from a dataset that quietly poisons your fine-tune.

    What an SFT dataset is, and is not

    An SFT dataset is a curated collection of input–output examples that demonstrate the behaviour you want the model to learn. It is not a knowledge base, not a search index, and not raw text. Three properties separate it from anything else:

    • Each example has a clear target output. There is no ambiguity about what the model should produce.
    • Each example carries an implicit format contract. The model learns the prefix structure (system prompt, user role, special tokens) as much as it learns the content.
    • The dataset has a defined scope. A mixed dataset of summarization plus code generation plus tool calling teaches each task less well than three focused datasets, unless balanced with deliberate care.

    If your goal is to inject knowledge that the base model lacks, SFT is the wrong instrument — use retrieval-augmented generation or continued pre-training instead. SFT teaches behaviour and format, not facts.

    The three SFT dataset formats in 2026

    Three formats dominate, with different fit depending on whether you are doing single-turn instruction following, multi-turn assistant behaviour, or pre-templated training.

    Format A — Chat (messages array)

    The dominant format in 2026. Each example is a list of messages with a role field (system, user, assistant, occasionally tool) and a content field. JSONL with one example per line.

    {"messages": [
      {"role": "system", "content": "You are a financial compliance assistant."},
      {"role": "user", "content": "Summarise the key changes in DORA art. 5."},
      {"role": "assistant", "content": "DORA Article 5 introduces..."}
    ]}

    Best fit when the target deployment is a chat or assistant interface. Naturally supports multi-turn examples. Most modern training frameworks (TRL’s SFTTrainer, Axolotl, Unsloth) accept this format directly.

    Format B — Instruct (prompt/response pairs)

    The legacy Stanford Alpaca format. Three fields: instruction (what to do), input (optional context), output (the target response).

    {"instruction": "Classify the regulatory framework.",
     "input": "MiCA regulation 2023/1114",
     "output": "MiCA is the EU Markets in Crypto-Assets regulation..."}

    Useful for single-turn task-specific training, especially when the dataset originated as a classification or extraction set. Easier to construct from CSV-style sources but loses information when the target deployment is multi-turn. Convert to chat format if you plan to ship a conversational interface.

    Format C — Pre-templated text

    A single text field per example, already formatted with the model’s chat template and special tokens. The training framework does no further processing.

    {"text": "<|im_start|>systemnYou are a compliance assistant.<|im_end|>n<|im_start|>usernSummarise DORA art. 5.<|im_end|>n<|im_start|>assistantnDORA Article 5...<|im_end|>"}

    Use when you need full control over how the prompt is tokenized — for instance, when training on a custom template or when reproducing an exact published recipe. Avoid for general work: pre-templating couples the dataset to a specific tokenizer version, making the dataset non-portable.

    Chat templates: the silent regression source

    Common failure mode

    A Qwen-templated dataset trained on a Llama base model is a guaranteed accuracy regression. Always regenerate the templated text when you switch base models — do not reuse cached templates.

    Every modern base model ships with a chat template that defines how messages are serialized into a single string before tokenization. Llama 3, Qwen 2.5, Mistral, Gemma 2, and the open-weights cohort each use a different template, and using the wrong one will silently degrade performance.

    • Llama 3: <|begin_of_text|>, <|start_header_id|>role<|end_header_id|>, <|eot_id|>.
    • Qwen 2.5 / Qwen 3: ChatML — <|im_start|>role and <|im_end|>.
    • Mistral: [INST] ... [/INST] with optional system prefix.
    • Gemma 2: <start_of_turn>role and <end_of_turn>.

    Two practical rules:

    1. Always use tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) instead of hard-coding strings. The function is owned by the model authors and updates if the template evolves.
    2. If you switch base models mid-project, regenerate the templated text rather than reusing the cached version. A Qwen-templated dataset trained on a Llama base model is a guaranteed accuracy regression.

    Loss masking — train on the answer, not the question

    Building an SFT dataset for regulated industries?

    Our French finance and regulatory corpus ships with instruction-response pairs already filtered, deduplicated, and AI Act art. 10 documented.

    In SFT, you want the model to learn to generate the assistant’s response, not to memorize the prompt. The mechanism is loss masking: during training, the loss is computed only on the tokens belonging to the assistant turn(s). System and user tokens are masked out.

    TRL’s SFTTrainer handles this automatically when using the chat format and supplying a completion_only_loss=True flag (or by passing a response template marker). When in doubt, sanity-check by printing one batch’s loss mask and confirming that only the assistant tokens are unmasked. Teams that skip this check sometimes end up training their model to also predict the user’s next question — a subtle failure mode that does not appear in eval loss but does appear in production.

    Quality bar — what to remove, what to keep

    1 000 curated SFT examples beat 50 000 unfiltered ones. The 2023 LIMA paper is consensus, not opinion.

    The 2023 LIMA paper established that 1 000 carefully filtered SFT examples can outperform 50 000 unfiltered ones. Three years of replication studies have confirmed the principle. The filters that matter:

    • Length filter. Drop examples shorter than a sane minimum (10 tokens of assistant response) and longer than your training sequence length. Both extremes corrupt training.
    • Refusal filter. Drop or rewrite “I cannot help with that” responses inherited from upstream datasets, unless refusal behaviour is exactly what you want to fine-tune. Most fine-tunes inherit refusals accidentally and inherit their generalization.
    • Duplicate filter. Exact and near-duplicate examples bias the model. MinHash LSH at the example level catches obvious cases; embedding-cosine clustering at sentence level catches paraphrase duplicates.
    • Quality filter. Either human-reviewed at small scale (under 5 000 examples) or LLM-as-judge scored at larger scale, with the bottom 10–30 % dropped. Score on accuracy, faithfulness to the input, and format compliance.
    • Topical balance. If your dataset is heavily skewed toward one task or one style (a common artefact of LLM-generated data), the model overfits to that style. Either rebalance by sub-sampling or use task-mixing during training.

    Where SFT data comes from

    Four sources, with different trade-offs:

    • Open-source datasets. Tulu 3, Open-Orca, UltraChat, Dolma-derived, and the curated lists maintained by the community. Free, large, but generic — usually a starting layer, not a finishing layer.
    • Synthetic data from a larger model. Generate questions and responses with a stronger model (GPT-4-class, Claude, larger open-weights). Cheap, scalable, but inherits biases and refusals of the generator model. License terms vary — read carefully if you plan to ship commercially.
    • Human-curated, in-domain. Domain experts write and review examples that reflect actual user queries. Slowest, most expensive, and the best quality. The default for vertical fine-tunes that need to perform in production.
    • Logs from a deployed system. Real user queries paired with reviewed model outputs (or corrected outputs). Highest signal because it matches production distribution. Requires consent, privacy review, and a redaction pipeline.

    A workable mix for a vertical fine-tune is 30 % open-source generic, 30 % synthetic in-domain, 30 % human-curated in-domain, 10 % production logs once available. Re-balance once you have real eval scores.

    Storage format and versioning

    JSONL is the lingua franca for SFT datasets. It is line-oriented, streamable, diff-friendly, and supported by every training library. For datasets above a few hundred megabytes, Parquet with a defined schema offers better compression and faster random access, at the cost of less convenient inspection. The choice between them depends on size and the rest of your pipeline; we cover it in detail in our article on dataset formats.

    Whatever format you choose, version the dataset. A SFT dataset with no version identifier is a model card defect. The minimum versioning record:

    • A SHA-256 of the content (concatenated example hashes, sorted deterministically).
    • A short identifier (semver or date-based) included in the model card.
    • A pinned reference to the generation or curation pipeline (git commit or container digest).
    • A change log noting additions, removals, and quality-filter version bumps.

    Putting it together — a one-page SFT dataset checklist

    • Format chosen: chat / instruct / pre-templated. Reason documented.
    • Chat template applied with the model’s official tokenizer.
    • Loss mask verified on a sample batch.
    • Filters applied: length, refusal, duplicate, quality, balance — each with measured drop rate.
    • Train/validation split: stratified by source, 5–10 % held out.
    • Provenance per example: source, license, generation method, capture date.
    • Dataset hash + version identifier recorded in the model card.
    • Retraction procedure: how a specific example or a class of examples is removed and the dataset re-hashed.

    A team that crosses every item off this list ships a fine-tune that works in production and that survives the documentation audit a regulated customer will eventually request. A team that skips half of them ships a fine-tune that demos well on the team’s laptop and then quietly fails the moment the data distribution shifts.

    For a fuller view of how an SFT dataset connects to the rest of the fine-tuning workflow, see our companion guide on training an LLM on your own data. For the regulatory layer that determines whether your fine-tuned model can be deployed in finance, healthcare, or law enforcement under the EU AI Act, see our article on AI Act Article 10 documentation.

    See also: SFT vs DPO vs RLHF dataset shapes and the French legal NLP landscape.

    Frequently asked questions

    Chat format or instruct format?

    Chat (messages array) for any product that ships as an assistant. Instruct (prompt/response) only for single-turn task-specific work where multi-turn doesn’t apply. Pre-templated text only when you need full tokenizer control.

    How do I avoid loss-masking bugs?

    Print one batch’s loss mask early in the run and verify only assistant tokens are unmasked. TRL’s SFTTrainer handles this automatically with the chat format and completion_only_loss=True.

    How many examples do I need?

    1 000–10 000 carefully filtered is the sweet spot. Above 100 000 diminishing returns set in — each new 10 K adds 0.5–1 % on benchmarks at most.

    Is synthetic SFT data acceptable?

    Yes, mixed with human-curated data. Synthetic is cheap and scalable but inherits the generator model’s biases and refusals. Read license terms carefully if you ship commercially.

    How do I version an SFT dataset?

    SHA-256 of concatenated example hashes (sorted deterministically), plus a short version id and a pinned reference to the curation pipeline. Bumps every time examples are added, removed, or re-filtered.

    Need an SFT-ready corpus for finance or regulation?

    Our French regulatory and financial text corpus is delivered as Parquet shards with the structure SFT pipelines expect — pre-filtered, AI Act-documented, tier-licensed.


    Keep reading

    Read next

    How to train an LLM on your own data

    Where SFT fits relative to RAG and continued pre-training.

    Read next

    Choosing a dataset format

    JSONL, Parquet, or Arrow for your SFT files.

    Read next

    Training data size for LLMs

    How many examples are enough for each training stage.

  • How to train an LLM on your own data — a practical 2026 guide.

    "Train an LLM on your own data" can mean six different things, and choosing the wrong one is the most expensive mistake teams make before writing a single line of code. This guide walks through the decision tree, data prep, the 2026 stack, and the governance layer that determines whether your fine-tune ships in a regulated industry.

    Key takeaways

    • Start with RAG; move to SFT only when RAG plateaus on task accuracy.
    • QLoRA on a 4-bit base model is the 2026 default — 70B fits on a single H100.
    • Data curation (dedup, quality filter, topical filter) matters more than hyperparameters.
    • 1 000 high-quality SFT examples beat 100 000 noisy ones — LIMA replicated three times over.
    • Ship with a provenance manifest — required for AI Act Article 10 compliance in the EU.

    “Train an LLM on your own data” can mean six different things, and choosing the wrong one is the most expensive mistake teams make before writing a single line of code. This guide walks through the decision tree, the data preparation steps that actually move the needle, the 2026 fine-tuning stack, and the governance layer that increasingly determines whether your fine-tuned model can be deployed in a regulated industry at all.

    Step 1 — Decide what “training on your own data” actually means

    Start with RAG. Move to SFT when RAG plateaus. Consider continued pre-training only when measurable domain-specific gaps remain.

    Four approaches sit under that phrase, with order-of-magnitude differences in cost, control, and lock-in:

    • Retrieval-Augmented Generation (RAG). The model is unchanged; your data is indexed in a vector store and injected into the prompt at query time. Cheapest, fastest to ship, easy to update. Best fit when answers must reflect documents that change weekly and when traceability to the source document is mandatory.
    • Supervised Fine-Tuning (SFT) with LoRA/QLoRA. A small set of adapter weights is trained on instruction–response pairs. Affordable (single GPU, hours rather than weeks), preserves the base model, and gives meaningful gains on domain-specific tasks. The default 2026 choice for most projects.
    • Continued pre-training. Resume the base model’s pre-training on a large in-domain corpus (10–100 B tokens). Useful when the base model has weak vocabulary or weak fluency in your domain. Expensive (multi-node, days to weeks of GPU time), and rarely the right first step.
    • From-scratch pre-training. Building a foundation model from the ground up. Justified only when no open-source base meets your latency, licensing, or sovereignty constraints. Budget for it in tens of thousands of GPU-hours and in legal review hours for the training corpus.

    Decision rule: start with RAG. Move to SFT when RAG plateaus on task accuracy. Consider continued pre-training only when you have measurable, domain-specific tokenization or fluency gaps that SFT cannot fix.

    Step 2 — Curate the data before you format it

    Quality > quantity

    For SFT specifically, 1 000 high-quality instruction-response pairs beat 100 000 low-quality ones. The 2023 LIMA paper plus three years of replication studies have made this consensus, not opinion.

    The most underrated step. Most “fine-tuned LLM” projects fail not at the training run, but at data quality. Three filters to apply before any formatting:

    • Deduplication. Exact and near-duplicate examples inflate training time and bias the model toward overrepresented patterns. Use MinHash LSH or SimHash to detect near-duplicates at the document and paragraph level. Expect 10 % to 40 % reduction in raw corpora.
    • Quality scoring. Length filters, language-ID filters, perplexity filters from a small reference model, and an LLM-as-judge pass on a representative sample. The aim is to remove machine-generated boilerplate, broken extractions, and out-of-scope content.
    • Topical filtering. If you are building a vertical model, keep only documents that match the target domain. A 100 K-document corpus where 80 % are on-topic outperforms a 500 K-document corpus where 30 % are on-topic — every time.

    For SFT specifically, the working rule is: 1 000 high-quality instruction–response pairs beat 100 000 low-quality ones. The 2023 LIMA paper and three years of replication studies have made this consensus, not opinion.

    Step 3 — Format the data for your training method

    The format that wins depends on the training stage:

    • For RAG: chunked text with embeddings. Chunk size 200–800 tokens, with 10–20 % overlap. Store source URL, chunk index, and an immutable document hash next to each chunk — the hash becomes your provenance anchor.
    • For SFT: JSONL with one example per line. Schema typically {"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", ...}]}. Apply the base model’s chat template (Llama 3, Qwen 2.5, Mistral all have published templates) to convert messages to tokenized strings.
    • For continued pre-training: Parquet shards of raw text with a stable schema. Each row carries the text, the source identifier, the license, and a hash. Tokenize on the fly during training, do not pre-tokenize to disk — tokenizer changes invalidate the cache.

    If you are unsure which file format to use for storage and downstream pipelines, the trade-offs between Parquet, JSONL, and Arrow are covered in our companion article on dataset formats.

    Step 4 — Choose the parameter-efficient method that fits your budget

    Full fine-tuning of a 7B model requires roughly 70–110 GB of GPU memory. Most teams cannot or should not provision that. Three parameter-efficient alternatives dominate in 2026:

    • LoRA — Low-Rank Adaptation. Trains rank-8 to rank-64 update matrices on selected layers. Memory footprint drops by 3–5×; quality is within 1–2 % of full fine-tuning on most benchmarks. The default.
    • QLoRA — LoRA on a 4-bit quantized base model. Memory drops by 10–15× compared to full fine-tuning. A 7B model fine-tunes on a single 24 GB consumer GPU; a 13B fits on 32 GB; a 70B fits on a single H100. Slightly slower per step than LoRA, but most teams accept the trade-off.
    • DoRA, NEFTune, ReFT — newer variants that add small accuracy gains in specific configurations. Worth benchmarking once your baseline LoRA pipeline works, not before.

    Rule of thumb: start with QLoRA, rank 16, on the attention projection layers. Tune from there only if your eval set shows specific weaknesses.

    Step 5 — The 2026 fine-tuning stack

    Building a fine-tune for finance, regulation, or compliance?

    We publish French regulatory and financial training corpora — pre-filtered, deduplicated, AI Act art. 10 ready, with per-document provenance.

    The toolchain has consolidated significantly. A minimal, production-grade SFT pipeline in May 2026 uses:

    • Base layer: Python 3.11+, PyTorch 2.5+, CUDA 12.x.
    • Hugging Face ecosystem: transformers for model loading, datasets for streaming, peft for LoRA adapters, trl for SFT trainers and DPO, accelerate for multi-GPU.
    • Throughput optimization: Unsloth provides patched kernels that reduce memory by 30–50 % and accelerate training by 2–3× on supported architectures. Axolotl offers a higher-level YAML config wrapper if you prefer config over code.
    • Quantization: bitsandbytes for 4-bit and 8-bit, plus the native bf16 and fp16 paths.
    • Inference and packaging: vLLM for serving, llama.cpp or MLX for on-device deployment, GGUF or ONNX for portable model artefacts.

    If you are starting from zero, Unsloth’s documented notebooks are the fastest path to a working baseline. If you need fine-grained control or are running on non-NVIDIA hardware (AMD, Intel Gaudi, Apple Silicon), drop down to transformers + peft + trl directly.

    Step 6 — Training run essentials

    Six hyperparameters and decisions that disproportionately affect outcomes:

    1. Learning rate. For LoRA: 1e-4 to 3e-4. For full fine-tuning: 1e-5 to 5e-5. Use a linear warm-up over 3–10 % of steps, then cosine decay.
    2. Batch size. Effective batch size (per-device × gradient accumulation × num devices) of 32–128 is a workable starting range for SFT. Smaller batches overfit on small datasets; larger batches lose signal.
    3. Epochs. 1 to 3 epochs for SFT on a curated dataset of a few thousand examples. More epochs typically degrade base model behaviour through catastrophic forgetting.
    4. Sequence length. Set to the longest reasonable example, not the longest possible one. Padding inflates memory and slows training. Sort examples by length and use packing (multiple short examples per training sequence) when supported.
    5. Validation set. Hold out 5–10 % of examples, stratified by source and topic. Compute eval loss every 50–200 steps. Stop training when eval loss plateaus or rises.
    6. Checkpointing. Save every N steps and at the end of each epoch. Keep the last three checkpoints. The cheapest insurance against an aborted run.

    Step 7 — Evaluate beyond loss

    Training loss tells you the model learned something. It does not tell you whether the model is better than the base model on tasks you care about. Three evaluation layers in increasing order of cost:

    • Automatic benchmarks. A small task-specific test set with exact-match or F1 scoring, plus a general benchmark (MMLU-style, language-specific) to detect regression on out-of-domain capabilities. Cheap, fast, but blind to fluency and style.
    • LLM-as-judge. A larger model (or the same model with a structured rubric) scores outputs on a held-out test set across 3–5 axes (accuracy, helpfulness, faithfulness, format, safety). Useful but not reliable enough to ship without human spot-check.
    • Human evaluation. 100–300 examples reviewed by domain experts, comparing base vs. fine-tuned outputs side-by-side. Expensive, slow, and the only signal that genuinely correlates with downstream user satisfaction. Reserve for go/no-go decisions.

    Track all three over time. A fine-tune that gains five points on the task benchmark but loses ten on the general benchmark is usually a net regression — your fine-tune has acquired domain skill at the expense of generality.

    Step 8 — Deploy with the governance layer in place

    Fine-tuned models deployed in 2026 in the European Union for high-risk use cases (finance, healthcare, employment, law enforcement, critical infrastructure) fall under the EU AI Act’s Article 10 obligations. The same goes for the training data that produced them. Concretely, ship with:

    • Training data provenance manifest. A per-document record of source URL, capture date, license, processing chain, and content hash. Stored next to the model artefact, not in a separate spreadsheet.
    • Versioning. Each training run produces a model card with the dataset hash, base model identifier, hyperparameters, and evaluation scores. Re-training without re-versioning is a compliance defect.
    • Retraction procedure. A documented path for removing a specific document (or a class of documents — e.g. a data subject’s personal data) and either re-training or accepting a measured drift. GDPR Article 17 makes this concrete; AI Act Article 10 makes it auditable.

    Teams that bolt this layer on after deployment spend two to four months retrofitting. Teams that add it during data prep spend two to four days. The pattern transfers directly across modalities, including the regulated-text corpora we build for our own customers.

    When fine-tuning is not the answer

    Three diagnostic questions before you start a training run:

    1. Has a clean RAG baseline been measured on the same evaluation set? If not, build it first. Fine-tuning to fix a problem RAG would have solved is the most common waste.
    2. Does the base model already know your domain vocabulary? Run a short qualitative probe — ask it 20 domain questions. If responses are coherent but bland, SFT will help. If responses are hallucinatory or wrong on basics, you may need continued pre-training or a different base model.
    3. Is the data you would train on the right shape? For SFT you need instructions and responses, not documents. Converting unstructured documents into instruction–response pairs is a sub-project in itself.

    Bottom line

    The 2026 path to a useful domain-specific LLM is rarely from-scratch training. It is a curated dataset, a QLoRA fine-tune on a sensible open base model, an evaluation harness that catches regressions early, and a governance layer that lets the result be deployed in places that matter. The bottleneck is almost always the data, not the model.

    If the data you need is French-language financial, regulatory, or economic text — codes, doctrine, EU regulations, prudential positions — we build, version, and license that corpus with the governance layer described above already in place. The methodology we use is the same methodology described in our writing on training datasets and AI Act compliance.

    See also: the best public LLM datasets in 2026, what makes a corpus retrieval-friendly, and SFT vs DPO vs RLHF dataset shapes.

    Frequently asked questions

    Should I fine-tune or use RAG?

    Use RAG when answers must reflect documents that change frequently and when traceability to source is mandatory. Use SFT when you need to teach a specific output format, persona, or task behaviour. The best production systems usually combine both.

    How much data do I need to fine-tune?

    For SFT, 1 000–10 000 carefully curated instruction-response pairs is the sweet spot. For continued pre-training, 1–10 B in-domain tokens. See our training data size article for stage-by-stage volumes.

    Can I fine-tune on a single GPU?

    Yes, with QLoRA. A 7B fits on 24 GB; 13B on 32 GB; 70B on a single H100. Throughput is lower than multi-GPU but the budget gap is two orders of magnitude.

    Is my fine-tuned model subject to the EU AI Act?

    If deployed in the EU or used by an EU person for any Annex III high-risk use case (finance, healthcare, employment, biometrics, law enforcement, justice, critical infrastructure) — yes. Article 10 obligations apply from 2 August 2026.

    What is the cheapest evaluation strategy?

    Hold out 200–500 examples scored on a task-specific F1 metric, plus a small general benchmark to detect regression. Run after every training change. Reserve human eval for go/no-go decisions.

    Need a vertical training corpus for finance or regulation?

    We build, version, and license French regulatory, financial, and economic text corpora — AI Act art. 10 ready, per-document provenance, tiered licensing for sample / standard / premium / enterprise.


    Keep reading

    Read next

    SFT datasets — format and best practices

    The three SFT formats, chat templates, loss masking, and the quality bar.

    Read next

    Training data size for LLMs

    Concrete token volumes for pre-training, continued pre-training, SFT, DPO, evaluation, and RAG.

    Read next

    EU AI Act Article 10

    What training data documentation actually requires under the AI Act.