A supervised fine-tuning dataset looks deceptively simple — inputs and target outputs. The difficulty is hidden in the format you choose, the template you apply, and the loss mask you compute. Get any wrong and your training run completes cleanly while teaching the model something different from what you intended.
Key takeaways
- Chat format (messages array) is the 2026 default — adopt unless you have a specific reason not to.
- Always use tokenizer.apply_chat_template() — never hardcode the chat string.
- Loss-mask the prompt; train only on the assistant tokens.
- 1 000 curated examples outperform 50 000 unfiltered ones. LIMA is consensus, not opinion.
- Version the dataset (hash + version id in the model card) — Article 10 expects it.
In this article
- What an SFT dataset is, and is not
- The three SFT dataset formats in 2026
- Chat templates: the silent regression source
- Loss masking — train on the answer, not the question
- Quality bar — what to remove, what to keep
- Where SFT data comes from
- Storage format and versioning
- Putting it together — a one-page SFT dataset checklist
- Frequently asked questions
A supervised fine-tuning dataset looks deceptively simple: a list of inputs and the responses you want the model to produce. The difficulty is hidden in three places — the format you choose, the template you apply, and the loss mask you compute. Get any of those wrong and the training run will complete cleanly while teaching the model something different from what you intended. This guide covers what an SFT dataset actually is, the three formats in use in 2026, the chat templates you cannot ignore, and the quality bar that distinguishes a usable dataset from a dataset that quietly poisons your fine-tune.
What an SFT dataset is, and is not
An SFT dataset is a curated collection of input–output examples that demonstrate the behaviour you want the model to learn. It is not a knowledge base, not a search index, and not raw text. Three properties separate it from anything else:
- Each example has a clear target output. There is no ambiguity about what the model should produce.
- Each example carries an implicit format contract. The model learns the prefix structure (system prompt, user role, special tokens) as much as it learns the content.
- The dataset has a defined scope. A mixed dataset of summarization plus code generation plus tool calling teaches each task less well than three focused datasets, unless balanced with deliberate care.
If your goal is to inject knowledge that the base model lacks, SFT is the wrong instrument — use retrieval-augmented generation or continued pre-training instead. SFT teaches behaviour and format, not facts.
The three SFT dataset formats in 2026
Three formats dominate, with different fit depending on whether you are doing single-turn instruction following, multi-turn assistant behaviour, or pre-templated training.
Format A — Chat (messages array)
The dominant format in 2026. Each example is a list of messages with a role field (system, user, assistant, occasionally tool) and a content field. JSONL with one example per line.
{"messages": [
{"role": "system", "content": "You are a financial compliance assistant."},
{"role": "user", "content": "Summarise the key changes in DORA art. 5."},
{"role": "assistant", "content": "DORA Article 5 introduces..."}
]}
Best fit when the target deployment is a chat or assistant interface. Naturally supports multi-turn examples. Most modern training frameworks (TRL’s SFTTrainer, Axolotl, Unsloth) accept this format directly.
Format B — Instruct (prompt/response pairs)
The legacy Stanford Alpaca format. Three fields: instruction (what to do), input (optional context), output (the target response).
{"instruction": "Classify the regulatory framework.",
"input": "MiCA regulation 2023/1114",
"output": "MiCA is the EU Markets in Crypto-Assets regulation..."}
Useful for single-turn task-specific training, especially when the dataset originated as a classification or extraction set. Easier to construct from CSV-style sources but loses information when the target deployment is multi-turn. Convert to chat format if you plan to ship a conversational interface.
Format C — Pre-templated text
A single text field per example, already formatted with the model’s chat template and special tokens. The training framework does no further processing.
{"text": "<|im_start|>systemnYou are a compliance assistant.<|im_end|>n<|im_start|>usernSummarise DORA art. 5.<|im_end|>n<|im_start|>assistantnDORA Article 5...<|im_end|>"}
Use when you need full control over how the prompt is tokenized — for instance, when training on a custom template or when reproducing an exact published recipe. Avoid for general work: pre-templating couples the dataset to a specific tokenizer version, making the dataset non-portable.
Chat templates: the silent regression source
Common failure mode
A Qwen-templated dataset trained on a Llama base model is a guaranteed accuracy regression. Always regenerate the templated text when you switch base models — do not reuse cached templates.
Every modern base model ships with a chat template that defines how messages are serialized into a single string before tokenization. Llama 3, Qwen 2.5, Mistral, Gemma 2, and the open-weights cohort each use a different template, and using the wrong one will silently degrade performance.
- Llama 3:
<|begin_of_text|>,<|start_header_id|>role<|end_header_id|>,<|eot_id|>. - Qwen 2.5 / Qwen 3: ChatML —
<|im_start|>roleand<|im_end|>. - Mistral:
[INST] ... [/INST]with optional system prefix. - Gemma 2:
<start_of_turn>roleand<end_of_turn>.
Two practical rules:
- Always use
tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)instead of hard-coding strings. The function is owned by the model authors and updates if the template evolves. - If you switch base models mid-project, regenerate the templated text rather than reusing the cached version. A Qwen-templated dataset trained on a Llama base model is a guaranteed accuracy regression.
Loss masking — train on the answer, not the question
Building an SFT dataset for regulated industries?
Our French finance and regulatory corpus ships with instruction-response pairs already filtered, deduplicated, and AI Act art. 10 documented.
In SFT, you want the model to learn to generate the assistant’s response, not to memorize the prompt. The mechanism is loss masking: during training, the loss is computed only on the tokens belonging to the assistant turn(s). System and user tokens are masked out.
TRL’s SFTTrainer handles this automatically when using the chat format and supplying a completion_only_loss=True flag (or by passing a response template marker). When in doubt, sanity-check by printing one batch’s loss mask and confirming that only the assistant tokens are unmasked. Teams that skip this check sometimes end up training their model to also predict the user’s next question — a subtle failure mode that does not appear in eval loss but does appear in production.
Quality bar — what to remove, what to keep
1 000 curated SFT examples beat 50 000 unfiltered ones. The 2023 LIMA paper is consensus, not opinion.
The 2023 LIMA paper established that 1 000 carefully filtered SFT examples can outperform 50 000 unfiltered ones. Three years of replication studies have confirmed the principle. The filters that matter:
- Length filter. Drop examples shorter than a sane minimum (10 tokens of assistant response) and longer than your training sequence length. Both extremes corrupt training.
- Refusal filter. Drop or rewrite “I cannot help with that” responses inherited from upstream datasets, unless refusal behaviour is exactly what you want to fine-tune. Most fine-tunes inherit refusals accidentally and inherit their generalization.
- Duplicate filter. Exact and near-duplicate examples bias the model. MinHash LSH at the example level catches obvious cases; embedding-cosine clustering at sentence level catches paraphrase duplicates.
- Quality filter. Either human-reviewed at small scale (under 5 000 examples) or LLM-as-judge scored at larger scale, with the bottom 10–30 % dropped. Score on accuracy, faithfulness to the input, and format compliance.
- Topical balance. If your dataset is heavily skewed toward one task or one style (a common artefact of LLM-generated data), the model overfits to that style. Either rebalance by sub-sampling or use task-mixing during training.
Where SFT data comes from
Four sources, with different trade-offs:
- Open-source datasets. Tulu 3, Open-Orca, UltraChat, Dolma-derived, and the curated lists maintained by the community. Free, large, but generic — usually a starting layer, not a finishing layer.
- Synthetic data from a larger model. Generate questions and responses with a stronger model (GPT-4-class, Claude, larger open-weights). Cheap, scalable, but inherits biases and refusals of the generator model. License terms vary — read carefully if you plan to ship commercially.
- Human-curated, in-domain. Domain experts write and review examples that reflect actual user queries. Slowest, most expensive, and the best quality. The default for vertical fine-tunes that need to perform in production.
- Logs from a deployed system. Real user queries paired with reviewed model outputs (or corrected outputs). Highest signal because it matches production distribution. Requires consent, privacy review, and a redaction pipeline.
A workable mix for a vertical fine-tune is 30 % open-source generic, 30 % synthetic in-domain, 30 % human-curated in-domain, 10 % production logs once available. Re-balance once you have real eval scores.
Storage format and versioning
JSONL is the lingua franca for SFT datasets. It is line-oriented, streamable, diff-friendly, and supported by every training library. For datasets above a few hundred megabytes, Parquet with a defined schema offers better compression and faster random access, at the cost of less convenient inspection. The choice between them depends on size and the rest of your pipeline; we cover it in detail in our article on dataset formats.
Whatever format you choose, version the dataset. A SFT dataset with no version identifier is a model card defect. The minimum versioning record:
- A SHA-256 of the content (concatenated example hashes, sorted deterministically).
- A short identifier (semver or date-based) included in the model card.
- A pinned reference to the generation or curation pipeline (git commit or container digest).
- A change log noting additions, removals, and quality-filter version bumps.
Putting it together — a one-page SFT dataset checklist
- Format chosen: chat / instruct / pre-templated. Reason documented.
- Chat template applied with the model’s official tokenizer.
- Loss mask verified on a sample batch.
- Filters applied: length, refusal, duplicate, quality, balance — each with measured drop rate.
- Train/validation split: stratified by source, 5–10 % held out.
- Provenance per example: source, license, generation method, capture date.
- Dataset hash + version identifier recorded in the model card.
- Retraction procedure: how a specific example or a class of examples is removed and the dataset re-hashed.
A team that crosses every item off this list ships a fine-tune that works in production and that survives the documentation audit a regulated customer will eventually request. A team that skips half of them ships a fine-tune that demos well on the team’s laptop and then quietly fails the moment the data distribution shifts.
For a fuller view of how an SFT dataset connects to the rest of the fine-tuning workflow, see our companion guide on training an LLM on your own data. For the regulatory layer that determines whether your fine-tuned model can be deployed in finance, healthcare, or law enforcement under the EU AI Act, see our article on AI Act Article 10 documentation.
See also: SFT vs DPO vs RLHF dataset shapes and the French legal NLP landscape.
Frequently asked questions
Chat format or instruct format?
Chat (messages array) for any product that ships as an assistant. Instruct (prompt/response) only for single-turn task-specific work where multi-turn doesn’t apply. Pre-templated text only when you need full tokenizer control.
How do I avoid loss-masking bugs?
Print one batch’s loss mask early in the run and verify only assistant tokens are unmasked. TRL’s SFTTrainer handles this automatically with the chat format and completion_only_loss=True.
How many examples do I need?
1 000–10 000 carefully filtered is the sweet spot. Above 100 000 diminishing returns set in — each new 10 K adds 0.5–1 % on benchmarks at most.
Is synthetic SFT data acceptable?
Yes, mixed with human-curated data. Synthetic is cheap and scalable but inherits the generator model’s biases and refusals. Read license terms carefully if you ship commercially.
How do I version an SFT dataset?
SHA-256 of concatenated example hashes (sorted deterministically), plus a short version id and a pinned reference to the curation pipeline. Bumps every time examples are added, removed, or re-filtered.
Need an SFT-ready corpus for finance or regulation?
Our French regulatory and financial text corpus is delivered as Parquet shards with the structure SFT pipelines expect — pre-filtered, AI Act-documented, tier-licensed.
Keep reading
Read next
How to train an LLM on your own data
Where SFT fits relative to RAG and continued pre-training.
Leave a Reply