"Train an LLM on your own data" can mean six different things, and choosing the wrong one is the most expensive mistake teams make before writing a single line of code. This guide walks through the decision tree, data prep, the 2026 stack, and the governance layer that determines whether your fine-tune ships in a regulated industry.
Key takeaways
- Start with RAG; move to SFT only when RAG plateaus on task accuracy.
- QLoRA on a 4-bit base model is the 2026 default — 70B fits on a single H100.
- Data curation (dedup, quality filter, topical filter) matters more than hyperparameters.
- 1 000 high-quality SFT examples beat 100 000 noisy ones — LIMA replicated three times over.
- Ship with a provenance manifest — required for AI Act Article 10 compliance in the EU.
In this article
- Step 1 — Decide what “training on your own data” actually means
- Step 2 — Curate the data before you format it
- Step 3 — Format the data for your training method
- Step 4 — Choose the parameter-efficient method that fits your budget
- Step 5 — The 2026 fine-tuning stack
- Step 6 — Training run essentials
- Step 7 — Evaluate beyond loss
- Step 8 — Deploy with the governance layer in place
- When fine-tuning is not the answer
- Bottom line
- Frequently asked questions
“Train an LLM on your own data” can mean six different things, and choosing the wrong one is the most expensive mistake teams make before writing a single line of code. This guide walks through the decision tree, the data preparation steps that actually move the needle, the 2026 fine-tuning stack, and the governance layer that increasingly determines whether your fine-tuned model can be deployed in a regulated industry at all.
Step 1 — Decide what “training on your own data” actually means
Start with RAG. Move to SFT when RAG plateaus. Consider continued pre-training only when measurable domain-specific gaps remain.
Four approaches sit under that phrase, with order-of-magnitude differences in cost, control, and lock-in:
- Retrieval-Augmented Generation (RAG). The model is unchanged; your data is indexed in a vector store and injected into the prompt at query time. Cheapest, fastest to ship, easy to update. Best fit when answers must reflect documents that change weekly and when traceability to the source document is mandatory.
- Supervised Fine-Tuning (SFT) with LoRA/QLoRA. A small set of adapter weights is trained on instruction–response pairs. Affordable (single GPU, hours rather than weeks), preserves the base model, and gives meaningful gains on domain-specific tasks. The default 2026 choice for most projects.
- Continued pre-training. Resume the base model’s pre-training on a large in-domain corpus (10–100 B tokens). Useful when the base model has weak vocabulary or weak fluency in your domain. Expensive (multi-node, days to weeks of GPU time), and rarely the right first step.
- From-scratch pre-training. Building a foundation model from the ground up. Justified only when no open-source base meets your latency, licensing, or sovereignty constraints. Budget for it in tens of thousands of GPU-hours and in legal review hours for the training corpus.
Decision rule: start with RAG. Move to SFT when RAG plateaus on task accuracy. Consider continued pre-training only when you have measurable, domain-specific tokenization or fluency gaps that SFT cannot fix.
Step 2 — Curate the data before you format it
Quality > quantity
For SFT specifically, 1 000 high-quality instruction-response pairs beat 100 000 low-quality ones. The 2023 LIMA paper plus three years of replication studies have made this consensus, not opinion.
The most underrated step. Most “fine-tuned LLM” projects fail not at the training run, but at data quality. Three filters to apply before any formatting:
- Deduplication. Exact and near-duplicate examples inflate training time and bias the model toward overrepresented patterns. Use MinHash LSH or SimHash to detect near-duplicates at the document and paragraph level. Expect 10 % to 40 % reduction in raw corpora.
- Quality scoring. Length filters, language-ID filters, perplexity filters from a small reference model, and an LLM-as-judge pass on a representative sample. The aim is to remove machine-generated boilerplate, broken extractions, and out-of-scope content.
- Topical filtering. If you are building a vertical model, keep only documents that match the target domain. A 100 K-document corpus where 80 % are on-topic outperforms a 500 K-document corpus where 30 % are on-topic — every time.
For SFT specifically, the working rule is: 1 000 high-quality instruction–response pairs beat 100 000 low-quality ones. The 2023 LIMA paper and three years of replication studies have made this consensus, not opinion.
Step 3 — Format the data for your training method
The format that wins depends on the training stage:
- For RAG: chunked text with embeddings. Chunk size 200–800 tokens, with 10–20 % overlap. Store source URL, chunk index, and an immutable document hash next to each chunk — the hash becomes your provenance anchor.
- For SFT: JSONL with one example per line. Schema typically
{"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", ...}]}. Apply the base model’s chat template (Llama 3, Qwen 2.5, Mistral all have published templates) to convert messages to tokenized strings. - For continued pre-training: Parquet shards of raw text with a stable schema. Each row carries the text, the source identifier, the license, and a hash. Tokenize on the fly during training, do not pre-tokenize to disk — tokenizer changes invalidate the cache.
If you are unsure which file format to use for storage and downstream pipelines, the trade-offs between Parquet, JSONL, and Arrow are covered in our companion article on dataset formats.
Step 4 — Choose the parameter-efficient method that fits your budget
Full fine-tuning of a 7B model requires roughly 70–110 GB of GPU memory. Most teams cannot or should not provision that. Three parameter-efficient alternatives dominate in 2026:
- LoRA — Low-Rank Adaptation. Trains rank-8 to rank-64 update matrices on selected layers. Memory footprint drops by 3–5×; quality is within 1–2 % of full fine-tuning on most benchmarks. The default.
- QLoRA — LoRA on a 4-bit quantized base model. Memory drops by 10–15× compared to full fine-tuning. A 7B model fine-tunes on a single 24 GB consumer GPU; a 13B fits on 32 GB; a 70B fits on a single H100. Slightly slower per step than LoRA, but most teams accept the trade-off.
- DoRA, NEFTune, ReFT — newer variants that add small accuracy gains in specific configurations. Worth benchmarking once your baseline LoRA pipeline works, not before.
Rule of thumb: start with QLoRA, rank 16, on the attention projection layers. Tune from there only if your eval set shows specific weaknesses.
Step 5 — The 2026 fine-tuning stack
Building a fine-tune for finance, regulation, or compliance?
We publish French regulatory and financial training corpora — pre-filtered, deduplicated, AI Act art. 10 ready, with per-document provenance.
The toolchain has consolidated significantly. A minimal, production-grade SFT pipeline in May 2026 uses:
- Base layer: Python 3.11+, PyTorch 2.5+, CUDA 12.x.
- Hugging Face ecosystem:
transformersfor model loading,datasetsfor streaming,peftfor LoRA adapters,trlfor SFT trainers and DPO,acceleratefor multi-GPU. - Throughput optimization: Unsloth provides patched kernels that reduce memory by 30–50 % and accelerate training by 2–3× on supported architectures. Axolotl offers a higher-level YAML config wrapper if you prefer config over code.
- Quantization: bitsandbytes for 4-bit and 8-bit, plus the native bf16 and fp16 paths.
- Inference and packaging: vLLM for serving, llama.cpp or MLX for on-device deployment, GGUF or ONNX for portable model artefacts.
If you are starting from zero, Unsloth’s documented notebooks are the fastest path to a working baseline. If you need fine-grained control or are running on non-NVIDIA hardware (AMD, Intel Gaudi, Apple Silicon), drop down to transformers + peft + trl directly.
Step 6 — Training run essentials
Six hyperparameters and decisions that disproportionately affect outcomes:
- Learning rate. For LoRA: 1e-4 to 3e-4. For full fine-tuning: 1e-5 to 5e-5. Use a linear warm-up over 3–10 % of steps, then cosine decay.
- Batch size. Effective batch size (per-device × gradient accumulation × num devices) of 32–128 is a workable starting range for SFT. Smaller batches overfit on small datasets; larger batches lose signal.
- Epochs. 1 to 3 epochs for SFT on a curated dataset of a few thousand examples. More epochs typically degrade base model behaviour through catastrophic forgetting.
- Sequence length. Set to the longest reasonable example, not the longest possible one. Padding inflates memory and slows training. Sort examples by length and use packing (multiple short examples per training sequence) when supported.
- Validation set. Hold out 5–10 % of examples, stratified by source and topic. Compute eval loss every 50–200 steps. Stop training when eval loss plateaus or rises.
- Checkpointing. Save every N steps and at the end of each epoch. Keep the last three checkpoints. The cheapest insurance against an aborted run.
Step 7 — Evaluate beyond loss
Training loss tells you the model learned something. It does not tell you whether the model is better than the base model on tasks you care about. Three evaluation layers in increasing order of cost:
- Automatic benchmarks. A small task-specific test set with exact-match or F1 scoring, plus a general benchmark (MMLU-style, language-specific) to detect regression on out-of-domain capabilities. Cheap, fast, but blind to fluency and style.
- LLM-as-judge. A larger model (or the same model with a structured rubric) scores outputs on a held-out test set across 3–5 axes (accuracy, helpfulness, faithfulness, format, safety). Useful but not reliable enough to ship without human spot-check.
- Human evaluation. 100–300 examples reviewed by domain experts, comparing base vs. fine-tuned outputs side-by-side. Expensive, slow, and the only signal that genuinely correlates with downstream user satisfaction. Reserve for go/no-go decisions.
Track all three over time. A fine-tune that gains five points on the task benchmark but loses ten on the general benchmark is usually a net regression — your fine-tune has acquired domain skill at the expense of generality.
Step 8 — Deploy with the governance layer in place
Fine-tuned models deployed in 2026 in the European Union for high-risk use cases (finance, healthcare, employment, law enforcement, critical infrastructure) fall under the EU AI Act’s Article 10 obligations. The same goes for the training data that produced them. Concretely, ship with:
- Training data provenance manifest. A per-document record of source URL, capture date, license, processing chain, and content hash. Stored next to the model artefact, not in a separate spreadsheet.
- Versioning. Each training run produces a model card with the dataset hash, base model identifier, hyperparameters, and evaluation scores. Re-training without re-versioning is a compliance defect.
- Retraction procedure. A documented path for removing a specific document (or a class of documents — e.g. a data subject’s personal data) and either re-training or accepting a measured drift. GDPR Article 17 makes this concrete; AI Act Article 10 makes it auditable.
Teams that bolt this layer on after deployment spend two to four months retrofitting. Teams that add it during data prep spend two to four days. The pattern transfers directly across modalities, including the regulated-text corpora we build for our own customers.
When fine-tuning is not the answer
Three diagnostic questions before you start a training run:
- Has a clean RAG baseline been measured on the same evaluation set? If not, build it first. Fine-tuning to fix a problem RAG would have solved is the most common waste.
- Does the base model already know your domain vocabulary? Run a short qualitative probe — ask it 20 domain questions. If responses are coherent but bland, SFT will help. If responses are hallucinatory or wrong on basics, you may need continued pre-training or a different base model.
- Is the data you would train on the right shape? For SFT you need instructions and responses, not documents. Converting unstructured documents into instruction–response pairs is a sub-project in itself.
Bottom line
The 2026 path to a useful domain-specific LLM is rarely from-scratch training. It is a curated dataset, a QLoRA fine-tune on a sensible open base model, an evaluation harness that catches regressions early, and a governance layer that lets the result be deployed in places that matter. The bottleneck is almost always the data, not the model.
If the data you need is French-language financial, regulatory, or economic text — codes, doctrine, EU regulations, prudential positions — we build, version, and license that corpus with the governance layer described above already in place. The methodology we use is the same methodology described in our writing on training datasets and AI Act compliance.
See also: the best public LLM datasets in 2026, what makes a corpus retrieval-friendly, and SFT vs DPO vs RLHF dataset shapes.
Frequently asked questions
Should I fine-tune or use RAG?
Use RAG when answers must reflect documents that change frequently and when traceability to source is mandatory. Use SFT when you need to teach a specific output format, persona, or task behaviour. The best production systems usually combine both.
How much data do I need to fine-tune?
For SFT, 1 000–10 000 carefully curated instruction-response pairs is the sweet spot. For continued pre-training, 1–10 B in-domain tokens. See our training data size article for stage-by-stage volumes.
Can I fine-tune on a single GPU?
Yes, with QLoRA. A 7B fits on 24 GB; 13B on 32 GB; 70B on a single H100. Throughput is lower than multi-GPU but the budget gap is two orders of magnitude.
Is my fine-tuned model subject to the EU AI Act?
If deployed in the EU or used by an EU person for any Annex III high-risk use case (finance, healthcare, employment, biometrics, law enforcement, justice, critical infrastructure) — yes. Article 10 obligations apply from 2 August 2026.
What is the cheapest evaluation strategy?
Hold out 200–500 examples scored on a task-specific F1 metric, plus a small general benchmark to detect regression. Run after every training change. Reserve human eval for go/no-go decisions.
Need a vertical training corpus for finance or regulation?
We build, version, and license French regulatory, financial, and economic text corpora — AI Act art. 10 ready, per-document provenance, tiered licensing for sample / standard / premium / enterprise.
Keep reading
Read next
SFT datasets — format and best practices
The three SFT formats, chat templates, loss masking, and the quality bar.
Read next
Training data size for LLMs
Concrete token volumes for pre-training, continued pre-training, SFT, DPO, evaluation, and RAG.
Leave a Reply