Instruction tuning, supervised fine-tuning, and RLHF/DPO all live under the same vague label of “post-training” in 2026. They are not interchangeable. Each requires a differently shaped dataset, captures different signal, and answers different deployment questions.
Key takeaways
- SFT is the umbrella. Instruction tuning is a flavor of SFT with explicit task instructions. RLHF and DPO are different — they require preference pairs, not just demonstrations.
- Instruction tuning datasets are (instruction, optional context, response) triples. Quality beats quantity past ~10K examples — LIMA showed strong results at 1K well-curated examples.
- RLHF and DPO need (prompt, preferred response, rejected response) triples. Building these is 5–10× the per-example cost of SFT examples, but the alignment signal is sharper.
- Domain-specific SFT on a permissive corpus is the highest-ROI post-training step for vertical applications. 5,000–30,000 well-curated examples typically outperform a 1M-example generalist SFT mix on domain tasks.
- Eval frames the dataset shape. If your eval is “does the model follow this instruction,” build instruction data. If it is “does the model prefer the safer answer,” build preference data. Mismatched eval-vs-data is the most common failure mode.
In this article
- Terminology — what each method actually is
- SFT data shape — examples and counts
- Instruction tuning data shape — examples and counts
- RLHF and DPO data shape — preference pairs
- Quality vs quantity — the LIMA finding
- Synthetic data — when it helps and when it hurts
- Eval shapes the dataset shape
- A practical post-training mix for vertical LLMs
- Frequently asked questions
Terminology — what each method actually is
Supervised fine-tuning (SFT) is the umbrella term: you continue training a pretrained model on labeled examples of (input, desired output). Instruction tuning is a flavor of SFT where the inputs are explicit task instructions. RLHF (reinforcement learning from human feedback) trains the model to prefer responses humans prefer, using a reward model. DPO (direct preference optimization) achieves a similar effect without a separate reward model, by directly optimizing the preference loss.
In production 2026, the typical post-training stack on a frontier-adjacent open model is: SFT pass (general instruction following) → SFT pass (domain specialization) → DPO or KTO pass (alignment refinement) → optional RLHF pass for tail behaviors. Each step uses a differently shaped dataset.
SFT data shape — examples and counts
The minimum shape is (prompt, completion). Examples are stored as JSONL with one record per row, or as a Hugging Face Dataset. For chat-template models, prompts include the system message and conversation history; for completion-style models, prompts are the raw input string.
| Field | Type | Example |
|---|---|---|
| messages | list of dicts | [{role: system, content: …}, {role: user, content: …}] |
| completion | string | Assistant’s response text |
| source | string | tulu-3 / magpie / cosmopedia / custom |
| quality_score | float | 0.0–1.0 (optional) |
| domain | string | general / code / math / french-legal |
Counts: general SFT mixes are 100K–1M examples (Tulu-3 is 940K, OpenHermes-2.5 is ~1M). Domain-specific SFT works at 5K–50K examples. Beyond ~100K examples on a single domain, returns diminish quickly.
Instruction tuning data shape — examples and counts
Instruction tuning is SFT with explicit task framing. Each example reads as “do X with input Y, here is the expected output Z.” The classic Alpaca format uses three fields: instruction, input, output. The input field is optional — many tasks have only an instruction (e.g., “write a poem about autumn”).
ALPACA VS CHATML
The Alpaca format is convenient for academic work but production deployments use ChatML or the model’s native chat template (Llama, Mistral, Qwen all differ). Convert Alpaca-style examples to the target chat template at training time, not at dataset creation time. Keep the dataset in the most general format.
The 2024–2025 instruction-tuning corpora — Tulu-3, OpenHermes-2.5, Magpie — all ship in ChatML or close variants. Hugging Face datasets typically include both the raw format and the chat-template-applied format for convenience. Diversity of instruction types matters more than depth — 50K diverse instructions beat 200K instructions that all look like “summarize X.”
RLHF and DPO data shape — preference pairs
Preference data has a different shape: each row is (prompt, chosen_response, rejected_response). The model learns which response a human (or a model acting as judge) preferred. Datasets: Anthropic HH-RLHF, OpenAI WebGPT preferences, UltraFeedback, Allen AI Tulu DPO sets.
| Field | Type | Notes |
|---|---|---|
| prompt | string or chat | Same as SFT |
| chosen | string | The preferred response |
| rejected | string | The dispreferred response |
| chosen_score | float | Optional reward-model score |
| rejected_score | float | Optional reward-model score |
Building preference pairs costs 5–10× what building SFT examples costs, because every prompt needs two candidate responses and a judgment. In practice 2026 teams build preference data semi-synthetically: generate K=4–8 candidates per prompt with the SFT model, score them with a strong judge model (GPT-4o, Claude, Llama-Nemotron-70B), then keep the top-1 / bottom-1 pair. UltraFeedback and the Magpie-DPO series follow exactly this pattern.
If you cannot articulate why one response is better than another in your domain, you are not ready to do RLHF. SFT first; refine when the preferences are clearly statable.
Quality vs quantity — the LIMA finding
LIMA (Less Is More for Alignment, Zhou et al. 2023) showed that a 65B model SFT-tuned on 1,000 carefully curated examples matched the performance of the same model tuned on the full FLAN collection (~600K examples). The replicated findings since: past ~10,000 high-quality examples in a target domain, additional volume produces diminishing returns or actively harms quality when low-quality examples dilute the mix.
The corollary: invest in curation. For domain-specific SFT, 5,000 examples that domain experts have reviewed beat 50,000 examples a moderately capable judge model accepted. The bottleneck is reviewer time, not generation cost. The reviewers also need to be actual domain experts — generic annotators undercatch the technical errors that matter.
A French finance + regulatory corpus, primed for SFT
Use our vertical corpus as the foundation for domain SFT, instruction tuning, or preference data construction. Pseudonymized, audit-trailed, ready to drop into Hugging Face Datasets.
Synthetic data — when it helps and when it hurts
Synthetic data — examples generated by a strong model — is now standard for instruction tuning and increasingly used in preference data. The main 2025 entries: Cosmopedia v2 (textbooks generated by Mixtral), Llama-Nemotron post-training data (Llama-3 generated), Magpie (extracted Llama-3 outputs), Tulu-3 synthetic complement, OpenCoder synthetic instructions.
When synthetic helps: instruction following, format compliance, long-tail edge cases where real examples are rare. When synthetic hurts: deep domain expertise (synthetic regulatory analysis tends to hallucinate citations), nuanced preference judgments, and tasks where the synthetic generator does not actually understand the task. The tell-tale: if the generator model itself cannot do the task reliably, its synthetic training data will encode its own errors.
- Mix real and synthetic. 70 % synthetic / 30 % real is a common production ratio for instruction tuning.
- Cite the generator. Synthetic data inherits the generator’s biases and license. Document which model generated which subset.
- Filter by judge agreement. Use a different model to filter the generator’s output. Reduces the bias drift from training on a single model’s preferences.
Eval shapes the dataset shape
The most common post-training failure: shipping a dataset shape that does not match the eval. If your eval is GSM8K (math reasoning), your dataset should heavily include math problems with step-by-step solutions. If your eval is a customer-call summary task, your dataset should heavily include conversation-to-summary pairs. Generic instruction mixes lift generic benchmarks; they do not lift specific evals.
Vertical evaluation, where the eval is “does the model correctly answer this regulatory question with a valid citation,” needs SFT data shaped exactly like the eval task. A retrieval-augmented eval needs retrieval-augmented SFT. A citation-required eval needs citation-bearing SFT.
A practical post-training mix for vertical LLMs
A 2026 reference recipe for a French regulatory assistant built on a 7–14B pretrained open model:
- SFT pass 1 — general instruction following. Tulu-3 + Cosmopedia v2 subset, ~200K examples, 3 epochs.
- SFT pass 2 — domain specialization. 8K–15K vertical examples from your French regulatory corpus, hand-curated, 1–2 epochs. This is the block that decides whether the model can answer regulatory questions usefully.
- DPO pass — alignment refinement. 5K–10K preference pairs, generated semi-synthetically from the SFT model and judged by GPT-4o-class or Llama-Nemotron-70B. Targets faithfulness, citation use, and refusal behavior.
- Optional safety SFT. 2K examples covering known regulator-side expectations (privacy, scope, refusal patterns). Hand-written.
Total: ~30K–230K labeled examples depending on whether you count the generalist first pass. The vertical contribution is 13K–25K examples — small in absolute terms, decisive in domain performance.
A French regulatory corpus for domain-specific SFT
Pseudonymized, audit-trailed, AI Act Article 10 documented. Use it as the foundation for your SFT pass 2 — the block that decides whether the model can answer regulatory questions.
Frequently asked questions
Should I do SFT before DPO, or together?
Sequentially. SFT establishes the model’s instruction-following baseline. DPO refines the model’s preferences on top of that baseline. Doing them together — or doing DPO without a prior SFT — gives unstable training dynamics because the preference loss is conditioned on the model’s current distribution.
How do I know if my SFT dataset is good enough?
Three checks. (1) Inter-annotator agreement on a 100-example sample — if two domain experts disagree on what the correct output should be, the dataset is ambiguous. (2) A held-out eval that mirrors your target task — small SFT runs on subsets should show monotonic improvement with more data. (3) Failure-mode analysis after a small training run — the failure cases tell you what additional examples to add.
Can I use the same data for both SFT and DPO?
Not directly. SFT data is (prompt, response) pairs; DPO data is (prompt, chosen, rejected) triples. You can derive a DPO set from an SFT set by generating multiple candidates per prompt and scoring them, but the SFT and DPO sets serve different purposes.
Is RLHF still relevant in 2026 or has DPO won?
DPO is the production default for most alignment work in 2026 — simpler, no reward model to train and maintain. RLHF retains an edge on hard alignment problems where the reward function is non-trivial (safety, multi-turn coherence, long-form faithfulness). Anthropic and DeepMind still publish active RLHF research. For most teams shipping a fine-tuned model in 2026, DPO is the right starting point.
What about KTO and IPO?
Variants of DPO with different loss formulations. KTO (Kahneman-Tversky Optimization) needs only single-response labels, not pairs, which is cheaper. IPO (Identity Preference Optimization) addresses a bias in the original DPO loss. Both are worth experimenting with after you have a working DPO pipeline. None changes the dataset shape requirements significantly.
Keep reading
Read next
SFT datasets — format, structure, instruction-tuning best practices
Field-by-field deep dive into the SFT data format, with the common pitfalls in chat template handling and turn-level masking.
Read next
How to train an LLM on your own data — a practical 2026 guide
End-to-end fine-tuning playbook — hardware, dataset prep, framework choice, evaluation.
Read next
Best public datasets for training generative AI models in 2026
The pretraining and post-training datasets you can actually train on without legal review — license, language, size, posture.
Leave a Reply