Instruction tuning vs SFT vs RLHF — choosing the right dataset shape.

Instruction tuning, supervised fine-tuning, and RLHF/DPO all live under the same vague label of “post-training” in 2026. They are not interchangeable. Each requires a differently shaped dataset, captures different signal, and answers different deployment questions.

Key takeaways

SFT is the umbrella. Instruction tuning is a flavor of SFT with explicit task instructions. RLHF and DPO are different — they require preference pairs, not just demonstrations.
Instruction tuning datasets are (instruction, optional context, response) triples. Quality beats quantity past ~10K examples — LIMA showed strong results at 1K well-curated examples.
RLHF and DPO need (prompt, preferred response, rejected response) triples. Building these is 5–10× the per-example cost of SFT examples, but the alignment signal is sharper.
Domain-specific SFT on a permissive corpus is the highest-ROI post-training step for vertical applications. 5,000–30,000 well-curated examples typically outperform a 1M-example generalist SFT mix on domain tasks.
Eval frames the dataset shape. If your eval is “does the model follow this instruction,” build instruction data. If it is “does the model prefer the safer answer,” build preference data. Mismatched eval-vs-data is the most common failure mode.

Terminology — what each method actually is
SFT data shape — examples and counts
Instruction tuning data shape — examples and counts
RLHF and DPO data shape — preference pairs
Quality vs quantity — the LIMA finding
Synthetic data — when it helps and when it hurts
Eval shapes the dataset shape
A practical post-training mix for vertical LLMs
Frequently asked questions

Terminology — what each method actually is

Supervised fine-tuning (SFT) is the umbrella term: you continue training a pretrained model on labeled examples of (input, desired output). Instruction tuning is a flavor of SFT where the inputs are explicit task instructions. RLHF (reinforcement learning from human feedback) trains the model to prefer responses humans prefer, using a reward model. DPO (direct preference optimization) achieves a similar effect without a separate reward model, by directly optimizing the preference loss.

In production 2026, the typical post-training stack on a frontier-adjacent open model is: SFT pass (general instruction following) → SFT pass (domain specialization) → DPO or KTO pass (alignment refinement) → optional RLHF pass for tail behaviors. Each step uses a differently shaped dataset.

SFT data shape — examples and counts

The minimum shape is (prompt, completion). Examples are stored as JSONL with one record per row, or as a Hugging Face Dataset. For chat-template models, prompts include the system message and conversation history; for completion-style models, prompts are the raw input string.

Field	Type	Example
messages	list of dicts	[{role: system, content: …}, {role: user, content: …}]
completion	string	Assistant’s response text
source	string	tulu-3 / magpie / cosmopedia / custom
quality_score	float	0.0–1.0 (optional)
domain	string	general / code / math / french-legal

Counts: general SFT mixes are 100K–1M examples (Tulu-3 is 940K, OpenHermes-2.5 is ~1M). Domain-specific SFT works at 5K–50K examples. Beyond ~100K examples on a single domain, returns diminish quickly.

Instruction tuning data shape — examples and counts

Instruction tuning is SFT with explicit task framing. Each example reads as “do X with input Y, here is the expected output Z.” The classic Alpaca format uses three fields: instruction, input, output. The input field is optional — many tasks have only an instruction (e.g., “write a poem about autumn”).

ALPACA VS CHATML

The Alpaca format is convenient for academic work but production deployments use ChatML or the model’s native chat template (Llama, Mistral, Qwen all differ). Convert Alpaca-style examples to the target chat template at training time, not at dataset creation time. Keep the dataset in the most general format.

The 2024–2025 instruction-tuning corpora — Tulu-3, OpenHermes-2.5, Magpie — all ship in ChatML or close variants. Hugging Face datasets typically include both the raw format and the chat-template-applied format for convenience. Diversity of instruction types matters more than depth — 50K diverse instructions beat 200K instructions that all look like “summarize X.”

RLHF and DPO data shape — preference pairs

Preference data has a different shape: each row is (prompt, chosen_response, rejected_response). The model learns which response a human (or a model acting as judge) preferred. Datasets: Anthropic HH-RLHF, OpenAI WebGPT preferences, UltraFeedback, Allen AI Tulu DPO sets.

Field	Type	Notes
prompt	string or chat	Same as SFT
chosen	string	The preferred response
rejected	string	The dispreferred response
chosen_score	float	Optional reward-model score
rejected_score	float	Optional reward-model score

Building preference pairs costs 5–10× what building SFT examples costs, because every prompt needs two candidate responses and a judgment. In practice 2026 teams build preference data semi-synthetically: generate K=4–8 candidates per prompt with the SFT model, score them with a strong judge model (GPT-4o, Claude, Llama-Nemotron-70B), then keep the top-1 / bottom-1 pair. UltraFeedback and the Magpie-DPO series follow exactly this pattern.

If you cannot articulate why one response is better than another in your domain, you are not ready to do RLHF. SFT first; refine when the preferences are clearly statable.

Quality vs quantity — the LIMA finding

LIMA (Less Is More for Alignment, Zhou et al. 2023) showed that a 65B model SFT-tuned on 1,000 carefully curated examples matched the performance of the same model tuned on the full FLAN collection (~600K examples). The replicated findings since: past ~10,000 high-quality examples in a target domain, additional volume produces diminishing returns or actively harms quality when low-quality examples dilute the mix.

The corollary: invest in curation. For domain-specific SFT, 5,000 examples that domain experts have reviewed beat 50,000 examples a moderately capable judge model accepted. The bottleneck is reviewer time, not generation cost. The reviewers also need to be actual domain experts — generic annotators undercatch the technical errors that matter.

A French finance + regulatory corpus, primed for SFT

Use our vertical corpus as the foundation for domain SFT, instruction tuning, or preference data construction. Pseudonymized, audit-trailed, ready to drop into Hugging Face Datasets.

See the dataset

Synthetic data — when it helps and when it hurts

Synthetic data — examples generated by a strong model — is now standard for instruction tuning and increasingly used in preference data. The main 2025 entries: Cosmopedia v2 (textbooks generated by Mixtral), Llama-Nemotron post-training data (Llama-3 generated), Magpie (extracted Llama-3 outputs), Tulu-3 synthetic complement, OpenCoder synthetic instructions.

When synthetic helps: instruction following, format compliance, long-tail edge cases where real examples are rare. When synthetic hurts: deep domain expertise (synthetic regulatory analysis tends to hallucinate citations), nuanced preference judgments, and tasks where the synthetic generator does not actually understand the task. The tell-tale: if the generator model itself cannot do the task reliably, its synthetic training data will encode its own errors.

Mix real and synthetic. 70 % synthetic / 30 % real is a common production ratio for instruction tuning.
Cite the generator. Synthetic data inherits the generator’s biases and license. Document which model generated which subset.
Filter by judge agreement. Use a different model to filter the generator’s output. Reduces the bias drift from training on a single model’s preferences.

Eval shapes the dataset shape

The most common post-training failure: shipping a dataset shape that does not match the eval. If your eval is GSM8K (math reasoning), your dataset should heavily include math problems with step-by-step solutions. If your eval is a customer-call summary task, your dataset should heavily include conversation-to-summary pairs. Generic instruction mixes lift generic benchmarks; they do not lift specific evals.

Vertical evaluation, where the eval is “does the model correctly answer this regulatory question with a valid citation,” needs SFT data shaped exactly like the eval task. A retrieval-augmented eval needs retrieval-augmented SFT. A citation-required eval needs citation-bearing SFT.

A practical post-training mix for vertical LLMs

A 2026 reference recipe for a French regulatory assistant built on a 7–14B pretrained open model:

SFT pass 1 — general instruction following. Tulu-3 + Cosmopedia v2 subset, ~200K examples, 3 epochs.
SFT pass 2 — domain specialization. 8K–15K vertical examples from your French regulatory corpus, hand-curated, 1–2 epochs. This is the block that decides whether the model can answer regulatory questions usefully.
DPO pass — alignment refinement. 5K–10K preference pairs, generated semi-synthetically from the SFT model and judged by GPT-4o-class or Llama-Nemotron-70B. Targets faithfulness, citation use, and refusal behavior.
Optional safety SFT. 2K examples covering known regulator-side expectations (privacy, scope, refusal patterns). Hand-written.

Total: ~30K–230K labeled examples depending on whether you count the generalist first pass. The vertical contribution is 13K–25K examples — small in absolute terms, decisive in domain performance.

A French regulatory corpus for domain-specific SFT

Pseudonymized, audit-trailed, AI Act Article 10 documented. Use it as the foundation for your SFT pass 2 — the block that decides whether the model can answer regulatory questions.

Talk to us

Frequently asked questions

Should I do SFT before DPO, or together?

Sequentially. SFT establishes the model’s instruction-following baseline. DPO refines the model’s preferences on top of that baseline. Doing them together — or doing DPO without a prior SFT — gives unstable training dynamics because the preference loss is conditioned on the model’s current distribution.

How do I know if my SFT dataset is good enough?

Three checks. (1) Inter-annotator agreement on a 100-example sample — if two domain experts disagree on what the correct output should be, the dataset is ambiguous. (2) A held-out eval that mirrors your target task — small SFT runs on subsets should show monotonic improvement with more data. (3) Failure-mode analysis after a small training run — the failure cases tell you what additional examples to add.

Can I use the same data for both SFT and DPO?

Not directly. SFT data is (prompt, response) pairs; DPO data is (prompt, chosen, rejected) triples. You can derive a DPO set from an SFT set by generating multiple candidates per prompt and scoring them, but the SFT and DPO sets serve different purposes.

Is RLHF still relevant in 2026 or has DPO won?

DPO is the production default for most alignment work in 2026 — simpler, no reward model to train and maintain. RLHF retains an edge on hard alignment problems where the reward function is non-trivial (safety, multi-turn coherence, long-form faithfulness). Anthropic and DeepMind still publish active RLHF research. For most teams shipping a fine-tuned model in 2026, DPO is the right starting point.

What about KTO and IPO?

Variants of DPO with different loss formulations. KTO (Kahneman-Tversky Optimization) needs only single-response labels, not pairs, which is cheaper. IPO (Identity Preference Optimization) addresses a bias in the original DPO loss. Both are worth experimenting with after you have a working DPO pipeline. None changes the dataset shape requirements significantly.

Instruction tuning vs SFT vs RLHF — choosing the right dataset shape.

Key takeaways

In this article

Terminology — what each method actually is

SFT data shape — examples and counts

Instruction tuning data shape — examples and counts

RLHF and DPO data shape — preference pairs

Quality vs quantity — the LIMA finding

A French finance + regulatory corpus, primed for SFT

Synthetic data — when it helps and when it hurts

Eval shapes the dataset shape

A practical post-training mix for vertical LLMs

A French regulatory corpus for domain-specific SFT

Frequently asked questions

Should I do SFT before DPO, or together?

How do I know if my SFT dataset is good enough?

Can I use the same data for both SFT and DPO?

Is RLHF still relevant in 2026 or has DPO won?

What about KTO and IPO?

Keep reading

SFT datasets — format, structure, instruction-tuning best practices

How to train an LLM on your own data — a practical 2026 guide

Best public datasets for training generative AI models in 2026

Comments

Leave a Reply Cancel reply

Keep reading.

French legal NLP — corpora, benchmarks, and evaluation tasks in 2026.

Instruction tuning vs SFT vs RLHF — choosing the right dataset shape.

RAG datasets — what makes a corpus retrieval-friendly in 2026.

Data deduplication for LLM corpora — MinHash LSH, exact match, semantic.