Editorial · French Corpus LLM · regulatory & generative AI

Author: Editorial Team

  • Building an audit-ready provenance trail for training datasets.

    An audit-ready provenance trail is the artefact that lets someone reconstruct, for any record in your training dataset, four answers: where did this come from, how did we get it, what did we do to it, and is it still allowed to be here? Without that trail, a regulator’s request to remove a data subject’s content becomes a months-long forensic exercise.

    Key takeaways

    • Per-record JSON-LD provenance is now a hard requirement under the EU AI Act, GDPR Article 17, and copyright transparency rules.
    • Use the W3C PROV-O vocabulary — regulators recognize it; in-house schemas they don’t.
    • Sidecar files (provenance separate from training data) win over embedded fields at scale.
    • Hash-chain the shards. A signed manifest plus versioned object storage is forensically equivalent to a blockchain at 1 % the cost.
    • Built during ingestion: 5–10 % of dataset effort. Retrofitted after deployment: 5–10× the original effort.

    An audit-ready provenance trail is the artefact that lets someone reconstruct, for any record in your training dataset, the exact answer to four questions: where did this come from, how did we get it, what did we do to it, and is it still allowed to be here? Without that trail, a regulator’s request to remove a data subject’s content becomes a months-long forensic exercise. With it, the request becomes a one-day pipeline run. This guide walks through what an audit-ready provenance record contains, the W3C PROV-O vocabulary that has become the de facto standard, the implementation pattern that scales to billions of records, and the operational discipline that keeps the trail trustworthy.

    Why a provenance trail is now a hard requirement

    Built during ingestion: 5–10 % of dataset effort. Retrofitted under regulatory pressure: 5–10× the original dataset effort.

    Three regulatory pressures, all converging in 2026, have moved provenance from “nice to have” to “deploy-blocker”:

    • EU AI Act Article 10 requires documented data collection, preparation, and origin for every dataset used to train a high-risk system. Enforcement applies from 2 August 2026.
    • GDPR Article 17 gives data subjects a right to erasure, including from training datasets. A team that cannot identify which records came from which subject cannot honour the request without retraining from scratch.
    • Copyright clarity — EU Directive 2019/790 plus the AI Act’s transparency obligations for general-purpose AI providers require disclosure of the data sources used during training, with sufficient detail to allow rights-holders to verify or contest.

    The teams that come out of this period intact are the ones that built the layer during data preparation. The teams that retrofitted it after launch are the ones still negotiating timelines with their regulator a year in.

    Anatomy of an audit-ready provenance record

    For every individual training example — every row of your Parquet, every JSONL line, every annotation — the provenance record should answer the four questions. A minimal but production-grade record:

    {
      "record_id": "sha256:7f3c…a1b8",
      "source": {
        "name": "BOFiP-Impôts",
        "url": "https://bofip.impots.gouv.fr/bofip/12345-PGP.html",
        "license": "Licence Ouverte 2.0",
        "license_url": "https://www.etalab.gouv.fr/licence-ouverte-open-licence",
        "rights_holder": "DGFiP (Direction générale des Finances publiques)",
        "captured_at": "2026-05-08T14:22:31Z",
        "capture_method": "official-api"
      },
      "content_hash": "sha256:f2e8…9c44",
      "pipeline": {
        "extractor_version": "extractor-bofip@v1.4.2",
        "pipeline_commit": "git:c8380cc",
        "transformations": [
          "html_to_text",
          "language_detect",
          "minhash_dedup",
          "topical_filter"
        ]
      },
      "ai_act_declaration": {
        "intended_purpose": "high-risk-finance-llm-fine-tuning",
        "personal_data_present": false,
        "special_categories_present": false,
        "retraction_path": "DELETE /api/v1/records/{record_id}"
      },
      "ingested_at": "2026-05-13T09:14:02Z"
    }

    This is roughly 600 bytes per record uncompressed, ~150 bytes after gzip. For a 2 M-record corpus, the provenance sidecar weighs ~300 MB compressed — negligible relative to the training data itself.

    PROV-O — the vocabulary that aligns with regulator expectations

    The W3C PROV-O ontology is the standard vocabulary for representing provenance. It defines three core types:

    • Entity — a thing that exists. The raw HTML page, the cleaned text, the final Parquet record. Each is an Entity with its own URI.
    • Activity — a process that uses and produces Entities. Extraction, cleaning, deduplication, annotation are each Activities.
    • Agent — a person, organization, or piece of software responsible for an Activity. The DGFiP that published the original document, the extractor script, the annotator who reviewed the result.

    The PROV-JSON-LD serialization makes this representation machine-readable and Linked-Data-compliant in one format. It is the closest the field has to a regulator-recognized standard. Three reasons to use it:

    1. An EU AI Act auditor will recognise PROV-O on sight. They will not recognise your in-house JSON schema.
    2. PROV-O graphs can be queried with SPARQL or any RDF tool, which makes “show me every record affected by removing this source” a single query.
    3. The vocabulary is stable. The W3C standard has not changed substantively since 2013. Investments in PROV-O compliance do not need to be rewritten when the next provenance fad arrives.

    Implementation pattern — sidecar files, not embedded fields

    Two architectural choices in tension:

    • Embedded: the provenance fields live inside each training record. Simple, no joins needed. Drawback: changes to provenance schema require rewriting the entire dataset. Retraction is awkward because the record and its provenance share the same row.
    • Sidecar: provenance lives in a separate file (Parquet shard or JSON-LD graph) keyed by record_id. Slightly more complex to query, but provenance can evolve independently of the training data. Retraction is clean — a deletion in the sidecar plus the same in the dataset.

    The sidecar pattern wins at scale. The minor query overhead is paid back the first time you need to update the provenance schema (and you will, when the next regulatory guidance lands).

    Layout for a multi-source corpus that we use ourselves:

    /dataset/
      s5_package/
        sample/
          data-00000.parquet           (training records)
          data-00001.parquet
          ...
      s7_audit_trail/
        sample/
          provenance-00000.jsonl.gz    (PROV-O records, keyed by record_id)
          provenance-00001.jsonl.gz
          ...
        MANIFEST.json                  (dataset hash, version, signed)
        ATTESTATION_AI_ACT_ART10.md    (human-readable executive summary)

    The manifest hashes the entire provenance trail, the training data, and the pipeline code together. Tampering with any of the three breaks the manifest signature.

    Hash chains and immutability

    Want a corpus with provenance already built in?

    Every record in our French regulatory and financial corpus carries a PROV-O entry, every shard is hashed, every release ships with an Article 10 attestation document.

    Provenance records must be append-only and tamper-evident. A chain of content hashes provides both properties without requiring a blockchain or a notary service:

    1. Each provenance record carries the SHA-256 of its training record’s content.
    2. The provenance file (a single Parquet shard or JSONL chunk) carries the SHA-256 of the previous shard plus its own content. Removing or modifying a shard invalidates all downstream hashes.
    3. The manifest carries the SHA-256 of all shards plus the pipeline commit hash. Re-running the pipeline on the same source produces a different timestamp but the same content hashes — which is the property auditors look for.

    This is roughly the same construction as Git’s Merkle tree. For most teams, the operational cost is low: a small Python helper runs at the end of every dataset release and writes the hashes.

    The retraction procedure

    The provenance trail’s value is most concrete when someone invokes a removal right. A workable retraction procedure has four stages:

    1. Identify. Query the provenance graph for all records matching the removal criterion (URL, rights-holder, document ID). Output: a list of record_id values.
    2. Tombstone. Mark the records as retracted in the provenance trail. The record remains in the file (immutability) but is flagged. Downstream consumers filter retracted records before use.
    3. Re-release. At the next scheduled release, the training data and provenance trail are regenerated without the retracted records. The manifest is updated; the dataset version bumps.
    4. Re-evaluate. The model that was trained on the previous version is either re-trained on the new version, or its continued use is justified with a documented assessment that the retracted content’s removal does not materially affect the model. Both are acceptable; the choice must be documented.

    The hard part is step 1. A team that built provenance during ingestion completes it in minutes. A team that relies on grep over original sources can spend weeks.

    What “trustworthy” provenance actually looks like to an auditor

    Three properties that distinguish provenance an auditor accepts from provenance they treat as suspicious:

    • Continuity. Provenance timestamps lie on or near the training-run timestamps. Provenance created weeks after a model release is interpreted as retroactive — possibly legitimate, often not.
    • Granularity. Per-record, not per-source. A “we used these 12 datasets” declaration is not provenance; it is metadata. A “this specific record came from this specific URL captured at this specific time under this specific license” record is provenance.
    • Verifiability. An auditor can pick a random record, read its provenance entry, and trace the chain back to the original source. Brittle chains (URLs that 404, hashes that do not reproduce, missing pipeline scripts) erode the entire trail’s credibility, not just the affected record’s.

    Common shortcuts that backfire

    Source-level is not enough

    "We documented sources, that’s enough" does not survive a per-record retraction request. The provenance must descend to the record granularity, or the retraction work scales with the dataset size, not the request size.

    • “We log to a SaaS data-catalog tool; that’s our provenance.” Most data-catalog tools document tables, not records. They are necessary but not sufficient.
    • “We rely on Git history for our pipeline scripts.” Git tells you what code existed when. It does not tell you which records were processed by which run. You need a runtime log keyed to record hashes, not just code commits.
    • “We use a blockchain to anchor our provenance.” Overkill for nearly every use case. A signed SHA-256 manifest stored in object storage with versioning enabled is forensically equivalent and several orders of magnitude cheaper.
    • “We documented sources, that’s enough.” Source-level documentation does not survive a per-record retraction request. The provenance must descend to the record granularity, or the retraction work scales with the dataset size, not the request size.

    Bottom line

    An audit-ready provenance trail is a per-record JSON-LD sidecar using the PROV-O vocabulary, organized as immutable hashed shards alongside the training data, with a manifest that signs the whole. It costs about 5–10 % of the engineering effort of building the dataset itself, when added during ingestion. It costs 5–10× the original dataset effort when added after deployment under regulatory pressure.

    This is the pattern we use for the French regulatory text corpus we publish — every record carries a PROV-O entry, every shard is hashed, every release ships with an Article 10 attestation document. For the regulatory layer that determines whether the trail is sufficient, see our article on EU AI Act Article 10. For the engineering decisions that make per-record provenance practical at scale, see our writing on dataset formats and training data size.

    See also: GDPR pseudonymization for LLM training data and the AI Act Article 10 datasheet template.

    Frequently asked questions

    Why PROV-O and not a custom schema?

    Regulators recognize W3C PROV-O on sight. They do not recognize in-house JSON schemas. The vocabulary has been stable since 2013 — investments in PROV-O compliance don’t need to be rewritten when the next provenance fad arrives.

    Sidecar files or embedded provenance?

    Sidecar wins at scale. Provenance can evolve independently of training data; retraction is clean (delete in sidecar plus same in dataset); schema changes don’t require rewriting the entire corpus.

    Do I need a blockchain?

    No. A signed SHA-256 manifest stored in object storage with versioning enabled is forensically equivalent and several orders of magnitude cheaper. Blockchains are overkill for nearly every training data use case.

    How does retraction actually work?

    Four stages: identify all records matching the criterion via provenance query (minutes), tombstone them in the trail, regenerate the dataset and manifest at the next release, and either re-train or document why continued use is acceptable.

    What if an auditor picks a random record?

    They can trace it back to the source URL, capture date, and license — within seconds. Brittle chains (404 URLs, non-reproducible hashes, missing pipeline scripts) erode the entire trail’s credibility, not just the affected record’s.

    Need a corpus with provenance already in place?

    Our French regulatory and financial corpus ships with PROV-O sidecar files, hash-chained shards, and an Article 10 attestation document per release.


    Keep reading

    Read next

    EU AI Act Article 10

    The regulatory layer that determines whether the trail is sufficient.

    Read next

    Choosing a dataset format

    Storage format decisions that make per-record provenance practical.

    Read next

    Training data size for LLMs

    Volume thresholds where provenance overhead becomes meaningful.

  • EU AI Act Article 10 — what training data documentation actually requires.

    Article 10 of the EU AI Act is the part that translates "the AI must be trustworthy" into engineering work product. It applies to every high-risk AI system placed on the EU market, and from 2 August 2026 it makes data governance a documentation deliverable, not an internal posture.

    Key takeaways

    • Article 10 applies to training, validation, AND test sets — not just training.
    • Eight documented governance practices required: design choices, data origin, preparation, assumptions, suitability, bias examination, mitigation, gap analysis.
    • Dataset Specification Document per dataset — 20–50 pages for a substantial LLM corpus.
    • Enforcement starts 2 August 2026. Fines reach €15 M or 3 % of worldwide annual turnover.
    • GDPR compliance is not enough. Article 10 is a separate, overlapping obligation.

    Article 10 of the EU AI Act is the part of the regulation that translates “the AI must be trustworthy” into engineering work product. It applies to every high-risk AI system placed on the EU market or used by an EU person, and it makes data governance a documentation deliverable, not an internal posture. The text is short — under 1 000 words. The implementation depth is multiple months of work for any team starting from a blank page. This guide walks through what Article 10 actually says, what each requirement means concretely for a training dataset, the enforcement timeline that matters in 2026, and the documentation artefacts you need to produce.

    What “high-risk” actually covers

    Article 10’s obligations only kick in for high-risk systems. Annex III of the Act defines the categories. The ones most likely to matter for an LLM-based product:

    • Biometric identification and categorisation — any model that infers identity, age, gender, or emotional state from a face, voice, or behavioural signal.
    • Employment and worker management — recruitment screening, CV filtering, performance evaluation, task allocation, termination decisions.
    • Access to essential services — credit scoring, insurance pricing, eligibility for public benefits, emergency services dispatch.
    • Law enforcement — risk assessment, evidence evaluation, lie detection, profiling.
    • Migration, asylum, and border control — visa and asylum risk scoring, document verification, identity assessment.
    • Administration of justice and democratic processes — assistance to judicial decisions, voting and election influence detection.
    • Critical infrastructure — safety components in road traffic, water supply, gas, electricity, digital infrastructure.

    An LLM-based product that touches any of these categories is high-risk by deployment, even if the underlying model is general-purpose. A general-purpose AI model marketed as “for HR screening” inherits the high-risk classification of the deployment use case.

    Article 10, paragraph by paragraph

    The Article has six paragraphs. The ones with concrete engineering implications:

    Paragraph 1 — Data governance applies to all training, validation, and testing sets

    The obligation is not limited to the training dataset. Validation and test sets must meet the same governance bar. In practice, this means the same provenance records, the same quality controls, the same bias audits — applied to every split, with documentation showing they were applied independently.

    Paragraph 2 — Specific governance practices

    The longest paragraph, listing eight required governance practices. Each must be documented:

    • Design choices — why was this dataset constructed in this way, what scope was excluded, what alternative sources were considered.
    • Data collection processes and origin — where each data point came from, how it was collected, under what legal basis if personal data was involved.
    • Data preparation operations — annotation, labelling, cleaning, enrichment, aggregation. The complete processing chain from raw source to training-ready format.
    • Formulation of relevant assumptions — what the data is assumed to measure or represent. Often the most overlooked requirement, and the one regulators will challenge first.
    • Assessment of availability, quantity, and suitability — does the dataset have enough representative examples for the deployment context.
    • Examination of bias — bias that could affect health and safety, bias that could lead to discrimination prohibited by EU law. Explicit, not implicit.
    • Appropriate measures to detect, prevent, and mitigate bias — what was done about the biases identified. Documentation of intent is not enough; documentation of action is required.
    • Identification of data gaps and shortcomings — what the dataset does not cover, and how that absence might affect downstream use.

    Paragraph 3 — Quality criteria

    Training, validation, and test sets must be: relevant, sufficiently representative, to the best extent possible free of errors, complete in view of the intended purpose, with appropriate statistical properties. Five adjectives, each requiring measurable evidence.

    “Relevant” and “complete” are interpreted relative to the intended purpose declared in the technical documentation (Article 11). This means the intended purpose statement is no longer a marketing field — it becomes the yardstick against which dataset adequacy is judged.

    Paragraph 4 — Contextual specificity

    Datasets must reflect the geographical, contextual, behavioural, and functional setting where the system will be used. A credit-scoring model trained on Italian data and deployed in France triggers a documented assessment of cross-context transfer risk. A face-recognition model trained on a US-centric dataset and deployed in Northern Europe triggers the same.

    Paragraph 5 — Special categories of personal data

    When the dataset includes special categories of personal data (race, ethnicity, health, religion, sexual orientation), the controller may process them only for the purpose of bias detection and correction, only when strictly necessary, only with appropriate safeguards, and never for downstream prediction. This paragraph creates an explicit, narrow GDPR carve-out specifically for fairness work, with documentation requirements that exceed normal GDPR processing records.

    The documentation artefact — Dataset Specification Document

    Producing the Dataset Specification Document three months after training is finished is an order of magnitude more expensive than producing it during training. Build the layer in early.

    Article 10 does not specify a document name, but Article 11 plus Annex IV together require a Dataset Specification Document per dataset. The minimum content, distilled from the regulatory text plus the Commission’s August 2025 implementation guidance:

    1. Dataset identifier — name, version, content hash, date of construction.
    2. Intended purpose mapping — explicit link to the high-risk system’s Article 11 intended-purpose declaration.
    3. Composition — sources, volumes per source, license per source, time range, geographic coverage.
    4. Collection methodology — how each source was obtained, scraped, licensed, or annotated. Legal basis for any personal data processing.
    5. Preparation chain — every transformation from raw to final, with versioned scripts or pipelines.
    6. Assumptions and exclusions — what the dataset assumes about its content; what is deliberately excluded.
    7. Quality metrics — error rate, deduplication rate, language-detection accuracy, sampling adequacy. Numbers, not descriptions.
    8. Bias analysis — protected attributes examined, methodology, results, mitigation actions taken.
    9. Gap analysis — known limitations, populations or contexts underrepresented, downstream risks.
    10. Retention and retraction — how long the dataset is kept, how a data subject can request removal, how the system is re-evaluated after removal.

    For an LLM training corpus of any substantial size, this document runs 20 to 50 pages. Producing it three months after training is finished is an order of magnitude more expensive than producing it during training. Build the layer in early.

    Enforcement timeline — what is binding when

    Need Article 10-ready training data?

    Our French regulatory and financial corpus ships with a Dataset Specification Document, per-record JSON-LD provenance, and a retraction procedure already documented.

    The AI Act entered into force on 1 August 2024. Its provisions apply on a staggered schedule:

    • 2 February 2025: prohibitions (Article 5) on banned practices.
    • 2 August 2025: obligations for general-purpose AI models (Chapter V).
    • 2 August 2026: Article 10 and the rest of the high-risk obligations become applicable. Member State authorities can begin enforcement.
    • 2 August 2027: high-risk systems embedded in products subject to existing Union harmonisation legislation.

    The 2 August 2026 date matters because most teams operating high-risk systems in the EU today already need to be compliant in less than three months. Penalties for Article 10 non-compliance reach the higher of €15 million or 3 % of worldwide annual turnover. Other Act violations can go up to €35 million or 7 %, but Article 10 sits in the 3 % tier.

    Common misinterpretations

    No retroactive exception

    Authorities will compare documentation timestamps against training run timestamps. Obvious retroactive compliance attempts (creating provenance records months after training) are a red flag and may trigger deeper audit. Build the layer during data preparation, not after deployment.

    • “Our model is hosted in the US, so the AI Act doesn’t apply.” Wrong. The Act applies extraterritorially. If the output is used by a person in the EU, the provider’s obligations attach.
    • “We don’t train on personal data, so Article 10 is irrelevant.” Wrong. Article 10 applies to all training data, personal or not, for any high-risk use case.
    • “GDPR compliance is enough.” Wrong. GDPR governs personal-data processing legality. Article 10 governs dataset quality, governance, and documentation for high-risk AI. Different obligations, overlapping but distinct.
    • “We can document the dataset retroactively before audit.” Possible, expensive, and increasingly suspicious. Authorities will compare documentation timestamps against training run timestamps; obvious retroactive compliance attempts are a red flag.
    • “Open-source datasets we use absolve us of responsibility.” Wrong. The deployer carries responsibility for the dataset used, regardless of upstream origin. A documented diligence check on the upstream provenance is required, not an automatic pass-through.

    Practical implementation — minimum-viable Article 10 compliance

    For a team starting from no formal data governance, the minimum work to reach demonstrable Article 10 compliance:

    1. Map your high-risk surface. Which of your deployed AI systems fall under Annex III? List them by name. If none do, Article 10 may not apply to you yet — but the Commission’s general-purpose AI guidance still does.
    2. Inventory your datasets. Every dataset that has touched a high-risk system, including upstream public datasets, scraped data, synthetic data, and user-derived data. Record source, license, date, content hash.
    3. Build a Dataset Specification Document per dataset. Use the 10-section template above. Start with the easy sections (composition, sources). The bias and gap analysis are harder and may require external help.
    4. Implement per-record provenance. A JSON-LD or similar audit trail attached to each training example, recording its source, license, processing chain, and ingestion timestamp. This becomes the evidence underpinning the dataset specification.
    5. Define the retraction procedure. Written. Tested. Owned by a named person. The hardest item on the list, and the one that will be requested first if a data subject ever invokes GDPR Article 17 against your training set.
    6. Annual review. Every Dataset Specification Document is reviewed and re-signed annually, even if the dataset has not changed. This produces the trail of ongoing oversight that Article 10 implicitly requires.

    Bottom line

    Article 10 turns dataset governance from an internal practice into a regulatory deliverable. The enforcement date is 2 August 2026. The minimum work — Annex III mapping, dataset inventory, per-dataset specification documents, per-record provenance trail, retraction procedure — is multiple months for a team starting from zero. The teams that survive comfortably are the ones that built the layer during data preparation, not bolted on after deployment.

    For a concrete pattern for the per-record provenance layer, see our companion guide on building an audit-ready provenance trail for training datasets. For the dataset construction discipline that makes Article 10 compliance practical at scale, see our writing on dataset formats and SFT dataset curation.

    See also: GDPR pseudonymization for LLM training data, the AI Act Article 10 datasheet template, and the French legal NLP landscape.

    Frequently asked questions

    Does the EU AI Act apply if my model is hosted in the US?

    Yes. The Act applies extraterritorially. If the output is used by a person in the EU, the provider’s obligations attach regardless of where the model is hosted.

    Is GDPR compliance enough?

    No. GDPR governs personal-data processing legality. Article 10 governs dataset quality, governance, and documentation for high-risk AI. Different obligations, overlapping but distinct.

    What counts as a ‘high-risk’ system?

    The Annex III categories: biometric identification, employment screening, access to essential services (credit, insurance), law enforcement, migration, judicial assistance, critical infrastructure safety. A general-purpose LLM becomes high-risk by deployment.

    Can I use open-source datasets and let upstream carry responsibility?

    No. The deployer carries responsibility for the dataset used, regardless of upstream origin. A documented diligence check on the upstream provenance is required.

    What are the penalties?

    Article 10 violations sit in the lower tier: €15 M or 3 % of worldwide annual turnover, whichever is higher. Other Act violations go up to €35 M or 7 %.

    Need AI Act-ready training data?

    Our French regulatory and financial corpus ships with Article 10 documentation in place — per-document provenance, retraction procedure, dataset specification document, attestation per release.


    Keep reading

    Read next

    Building an audit-ready provenance trail

    The per-record provenance pattern that satisfies Article 10’s documentation requirements.

    Read next

    How to train an LLM on your own data

    Where the governance layer plugs into a fine-tuning pipeline.

    Read next

    Choosing a dataset format

    The storage decisions that make per-record provenance scalable.

  • Training data size for LLMs — how many tokens do you actually need in 2026?

    "How many tokens do I need?" has eight different right answers, depending on whether you mean pre-training, continued pre-training, SFT, DPO, RLHF, distillation, RAG indexing, or evaluation. The Chinchilla scaling law was the first quantitative answer in 2022; three years on, the industry has both validated and substantially exceeded it.

    Key takeaways

    • Modern pre-training uses 100–500 tokens per parameter, not the Chinchilla 20 — inference compute dominates total cost.
    • Continued pre-training sweet spot: 1–10 B in-domain tokens with a 1:4 in-domain to general mix.
    • SFT sweet spot: 1 000–10 000 curated examples. Above 100 000, diminishing returns.
    • DPO and preference optimization: 10 000–50 000 pairs for style/safety tuning.
    • Eval set: 200–500 examples, never leaked into training. The discipline matters more than the size.

    “How many tokens do I need to train my model?” has eight different right answers, depending on whether you mean pre-training, continued pre-training, SFT, DPO, RLHF, distillation, RAG indexing, or evaluation. The Chinchilla scaling law gave the field its first quantitative answer in 2022. Three years later, the industry has both validated and substantially exceeded it. This guide walks through the actual data-size guidance for each training stage in 2026, the diminishing-returns thresholds, and the trade-offs that determine whether you should aim for “compute-optimal” or “inference-optimal” data volumes.

    The Chinchilla baseline — and why most modern models exceed it

    Inference compute wins

    Llama 3 used 15 T tokens at 8B (1 900 tokens/parameter — way over Chinchilla’s 20). Why? Inference compute dominates total cost over a model’s lifetime. Over-training trades training cost for cheaper inference forever — a winning trade for any deployed model.

    The Chinchilla paper from DeepMind established that, for a fixed compute budget, you should train a smaller model on more tokens rather than a larger model on fewer tokens. The headline finding: roughly 20 tokens per parameter is compute-optimal for pre-training. A 7 B model wants ~140 B tokens; a 70 B model wants ~1.4 T tokens.

    Three years later, every major open-weights release has gone well beyond Chinchilla-optimal:

    • Llama 3 (8 B and 70 B): 15 T tokens. That is ~1 900 tokens/parameter for the 8 B and ~210 for the 70 B.
    • DeepSeek-V3 (671 B mixture-of-experts, 37 B active): 14.8 T tokens.
    • Qwen 2.5 (7 B): 18 T tokens — over 2 500 tokens per parameter.

    This is not a refutation of Chinchilla; it is a different optimization. Chinchilla optimizes training compute. Modern teams optimize inference compute: a model that is “over-trained” (more tokens per parameter than Chinchilla-optimal) achieves the same quality at smaller parameter count, which means cheaper inference for the entire lifetime of the model. Inference compute dwarfs training compute over a model’s deployed life, so the trade-off favours over-training. If you are pre-training from scratch in 2026, aim for 100–500 tokens per parameter, not 20.

    Continued pre-training — moving the needle on a specific domain

    Continued pre-training resumes the base model’s autoregressive language-modelling objective on an in-domain corpus. The data-size thresholds are well established in the literature:

    • Below ~500 M tokens of in-domain data: measurable improvement on in-domain perplexity, often at the cost of out-of-domain capability. Marginal benefit over SFT for most use cases.
    • 1–10 B tokens in-domain: the sweet spot for vertical-language LLMs (medical, legal, finance, code). Enough to shift token frequencies and acquire domain vocabulary without catastrophically forgetting general competencies.
    • 10–100 B tokens in-domain: approaches the volume of a small dedicated pre-training. Genuinely changes the model’s distribution. Reserve for cases where a vertical foundation model is the deliverable.

    One practical rule: maintain a 1:4 in-domain to general-corpus ratio during continued pre-training. A pure in-domain run almost always degrades general fluency more than the in-domain gain justifies.

    SFT — examples, not tokens

    Need 1–10 B in-domain tokens of French regulatory text?

    Our corpus ships 460 M tokens of finance, regulation, and economic FR text in tiered packages — the right volume to swap into a continued pre-training mix.

    For supervised fine-tuning, the metric is examples (instruction–response pairs), not tokens. The empirical thresholds, validated across hundreds of public fine-tunes:

    • 50–500 examples: teaches the model a specific format or persona. Useful for tightly scoped tasks (always answer in JSON, always cite sources, refuse a specific class of request).
    • 1 000–10 000 examples: the LIMA range. With careful curation, this band produces high-quality general-purpose assistants. Most public open-source SFT recipes sit here.
    • 10 000–100 000 examples: diminishing returns set in. Each additional 10 K examples often adds 0.5–1 % on benchmarks, sometimes less. Worth it only if the additional examples cover gaps identified in evaluation.
    • Above 100 000 examples: needed for multi-task SFT covering a broad capability surface. Standard for the post-training stage of public foundation models. Rarely needed for vertical fine-tunes.

    Quality dominates quantity throughout this regime. A team that ships 2 000 expert-reviewed SFT examples will outperform a team that ships 50 000 LLM-generated examples on the same task. The composition of an SFT dataset is covered in detail in our companion article on SFT dataset formats.

    DPO and preference optimization — preference pairs

    Direct Preference Optimization and its variants (ORPO, KTO, SimPO) train on pairs of preferred and rejected responses to the same prompt. Data-size guidance:

    • 5 000–20 000 preference pairs: sufficient to tune style, formatting, and safety behaviour after SFT. The post-SFT alignment step for most open-weights releases.
    • 50 000–200 000 preference pairs: needed when DPO is the primary alignment vehicle (no separate SFT or weak SFT). The volume used by frontier alignment pipelines.

    Preference data quality is harder to measure than SFT quality because it is binary (preferred vs rejected) and depends on human-judgment consistency. Use the agreement rate between independent annotators as a quality proxy — below 75 %, the data is too noisy to train on.

    Evaluation data — the smallest, the most important

    An eval set is small in absolute terms but disproportionate in impact: it gates every decision about training.

    • 20–50 examples: enough to detect catastrophic regressions during development. Run after every training change. Should never be touched by training data.
    • 200–500 examples: the production eval set. Stratified by task, source, and difficulty. Used for go/no-go decisions and tracked over time as a “regression budget”.
    • 1 000–5 000 examples: for benchmark releases or when shipping a model to a customer who will run their own evaluation. Documented with clear scope, scoring rubric, and known limitations.

    One discipline that matters more than size: never let an eval example leak into training data. The fastest way to inflate apparent quality and the fastest way to lose customer trust when the inflation is found out.

    RAG corpus — quality of chunks, not quantity of tokens

    For retrieval-augmented generation, the question of “how much data” reduces to “how much queryable, well-chunked, well-embedded text”. The thresholds where things change:

    • Under 10 000 documents: a simple vector index (FAISS, Qdrant, Pinecone) on a single machine. Embedding generation takes hours. The retrieval bottleneck is recall, not latency.
    • 10 000 – 1 million documents: hybrid search (BM25 + vectors) becomes worthwhile. Re-ranker on top of retriever starts to materially improve precision.
    • Above 1 million documents: the index becomes the dominant operational cost. Hierarchical retrieval (cluster-first, then chunk-rerank) replaces flat search. Embedding model choice matters more than vector store choice.

    Doubling the corpus rarely doubles answer quality. Doubling the chunk-quality (cleaner extractions, smarter chunking, deduplication) often does. RAG performance is overwhelmingly bottlenecked by chunk quality and re-ranker quality, not raw token count.

    Estimating cost before you commit

    A back-of-envelope formula that gets you within 30 % of the real GPU-hour cost for a pre-training or continued-pre-training run:

    GPU-hours ≈ 6 × N_params × N_tokens / (GPU_TFLOPS × 1e12 × 3600 × utilization)
    
    with utilization = 0.4 to 0.55 on modern stacks

    Plugging in a 7 B continued pre-training on 5 B tokens, with H100 at 989 TFLOPS bf16 and 50 % utilization: about 240 GPU-hours, or 10 days on a single H100, or 30 hours on an 8× H100 node. At cloud rates (~3 USD/H100-hour in mid-2026), that is roughly 720 USD — a useful sanity check before reserving a cluster.

    For SFT and DPO, the formula is similar but the relevant quantity is example-tokens × epochs, and the utilization is lower (more I/O bound). A 5 K-example SFT on a 7 B model with QLoRA fits in 1–3 hours on a single 24 GB GPU.

    The diminishing-returns line

    If adding 10 % more data yields under 0.3 % improvement on your headline metric, stop scaling data and start scaling something else — annotation quality, prompt design, model architecture.

    At every training stage there is a volume above which adding more data costs more than it gains. The signals that you have crossed it:

    • Eval loss plateaus over the last 20–30 % of training tokens.
    • Eval task accuracy stops improving while training loss continues to drop — the gap between training and eval grows.
    • Adding 10 % more data yields under 0.3 % improvement on the headline metric.
    • Specific failure modes (the ones your customers report) do not budge with more data of the same distribution.

    When any two of these fire, stop scaling data and start scaling something else — annotation quality, prompt design, model architecture, or simply ship the version you have and revisit after deployment data accumulates.

    Quick reference — token volumes by stage

    StageMinimumSweet spotDiminishing returns
    From-scratch pre-training10× params (tokens)200–500× params~1 000× params
    Continued pre-training~500 M tokens1–10 B tokens~100 B tokens
    SFT~100 examples1 000–10 000~100 000
    DPO / preference~2 000 pairs10 000–50 000~200 000
    Eval set20–50 examples200–500~5 000
    RAG corpusany sizequality > quantityindexing cost dominates

    Bottom line

    The correct answer to “how many tokens do you need” is “what training stage, for what model, with what quality bar”. Pre-training in 2026 wants 200–500 tokens per parameter, not the Chinchilla-optimal 20. Continued pre-training wants 1–10 B in-domain tokens with a general-corpus rehearsal mix. SFT wants 1 000–10 000 carefully curated examples. DPO wants 10 000–50 000 preference pairs. Evals want 200–500 examples and a discipline of never leaking them into training.

    If you are planning a vertical fine-tune and want to compare the actual token volumes that drove our French regulatory and financial text corpus, see our companion guides on SFT datasets and dataset formats — the storage and curation decisions that determine whether your token count translates into usable training signal.

    See also: the best public LLM datasets in 2026 and LLM corpus deduplication techniques.

    Frequently asked questions

    Is the 20-tokens-per-parameter Chinchilla rule still valid?

    Compute-optimal — yes. But modern teams optimize inference compute, not training compute. Over-training (100–500 tokens/parameter) gives a smaller model with the same quality and cheaper inference for the model’s entire deployed life.

    How many SFT examples do I really need?

    1 000–10 000 carefully curated. The 2023 LIMA paper plus three years of replication studies confirm that quality dominates quantity in this range. Above 100 000, returns diminish hard.

    How big should my evaluation set be?

    200–500 stratified examples is the production sweet spot. Smaller for development sanity, larger for benchmark release. Discipline matters more than size — never let eval leak into training.

    How much RAG data do I need?

    Quality of chunks dominates quantity of tokens. Below 10 K documents a flat vector index is fine; 10 K–1 M needs hybrid search and a reranker; above 1 M, index cost dominates and you need hierarchical retrieval.

    How do I know when to stop adding data?

    Four signals: eval loss plateaus, eval gap grows, 10 % more data yields under 0.3 % gain, and specific user-reported failures don’t move. When two of these fire, switch focus.

    Need a tokenized French corpus for continued pre-training?

    We deliver 460 M tokens of finance, regulation, and economic French text as Parquet shards — versioned, provenance-tracked, AI Act art. 10 ready.


    Keep reading

    Read next

    How to train an LLM on your own data

    Which training stage actually needs how much data.

    Read next

    SFT datasets — format and best practices

    What 1 000–10 000 high-quality SFT examples look like.

    Read next

    Choosing a dataset format

    Storage choices that scale with token volume.

  • Choosing a dataset format — Parquet vs JSONL vs Arrow for ML pipelines.

    The choice between Parquet, JSONL, and Arrow looks like a storage detail. It becomes a deployment-blocking constraint the first time your team needs to load 500 GB through a pipeline prototyped on a 200 MB JSONL file.

    Key takeaways

    • JSONL up to ~500 MB or ~1 M examples — beyond that, switch to Parquet.
    • Parquet for canonical storage: schema-typed, columnar, 3–5× compression, splittable.
    • Arrow for the trainer cache: memory-mapped, zero-copy, near-instant random access.
    • Never pre-tokenize into the storage format — tokenizer changes invalidate the cache.
    • 100–500 MB per Parquet file is the sweet spot; one giant file kills splittability.

    The choice between Parquet, JSONL, and Arrow looks like a storage-engineering detail. It becomes a deployment-blocking constraint the first time your team needs to load 500 GB of training data through a pipeline that was prototyped on a 200 MB JSONL file. This guide walks through each format’s actual behaviour at the volumes that matter, the operations they optimize for and against, and a decision rule that pins down which format to use at which stage of an ML pipeline.

    The three formats in one sentence each

    • JSONL — one JSON object per line, text-encoded, schema-free, human-readable, slow to parse, slow to compress, perfect for under-1 GB datasets and for the input/output boundary of a pipeline.
    • Parquet — columnar binary, schema-on-write, heavy compression, fast columnar scans, slow to write, slow to inspect by hand, the canonical storage format for any dataset above a few hundred megabytes.
    • Arrow — columnar in-memory representation that maps directly to Parquet on disk. Zero-copy reads, memory-mapped access, the bridge between disk storage and training-loop dataloaders.

    None of the three is universally better. They serve different stages of the data lifecycle.

    JSONL — when to choose it

    JSONL is the lingua franca of small-to-medium dataset exchange. Its advantages are operational, not technical:

    • Line-oriented. Streamable end-to-end. head, wc -l, grep, jq work without any specialized library.
    • Schema-free. Adding or removing a field per record costs nothing at write time. Useful during dataset construction when the schema is still in flux.
    • Git-friendly. Diffs are readable. Small datasets can live in a repo alongside the code that produced them.
    • Universally supported. Every training framework — TRL, Axolotl, Unsloth, raw PyTorch — accepts JSONL natively.

    The trade-offs become punishing at scale:

    • No compression by default. A 100 MB JSONL file compresses to 15–25 MB with gzip, but loading then decompressing on every epoch wastes throughput.
    • Slow to parse. Python json.loads is one of the slower hot paths in a typical training pipeline. At 50 K examples/s on a single core, an 80 M-example dataset takes 25 minutes per epoch just for parsing.
    • No column projection. Reading one field requires reading the whole record. Reading one record requires scanning until that record’s line break.
    • No types. A column that is sometimes a string, sometimes a number, sometimes null is silently accepted on write and explodes on read.

    Decision rule: use JSONL up to ~500 MB or ~1 M examples. Beyond that, convert to Parquet for storage and use JSONL only as an export format for sharing samples or feeding small jobs.

    Parquet — when to choose it

    Parquet is the storage format you want once your dataset stops fitting comfortably in RAM. Its design properties translate directly to ML pipeline benefits:

    • Columnar layout. Reading the text column without touching the metadata column is a real I/O saving — often 5–10× faster than row-oriented formats.
    • Type-safe schema. Each column has a declared type (string, int32, float64, list<string>, struct<…>). The schema is stored in the footer; downstream readers cannot silently coerce a string to a number.
    • Column-level compression. Each column is compressed independently with the codec best fit to its type. Text columns hit zstd or snappy; numerical columns hit dictionary or delta encoding. Typical compression ratios for text-heavy ML datasets: 3–5× over JSONL.
    • Predicate pushdown. Modern readers (PyArrow, DuckDB, Polars) can filter on column values before decompressing — a one-line filter on language code reads only the matching row groups.
    • Splittable. Files are organized in row groups (default ~128 MB). Distributed training and Spark-style processing can split files transparently.

    The trade-offs:

    • Not designed for append. Parquet files are written once. Updating a single record means rewriting the file (or the partition).
    • Schema-on-write. Changing a column type requires rewriting all files. Add columns liberally up-front; they cost nothing if unused.
    • Binary. Cannot be inspected with cat or head. Use parquet-tools, pyarrow, duckdb, or any of the visual Parquet viewers.
    • Tooling weight. Reading a 50 GB Parquet dataset in pure Python without PyArrow or Polars is impractical. Add the dependency.

    Decision rule: use Parquet for any dataset above 500 MB, for any dataset that will be read more than once, and for any dataset whose schema is stable enough to declare.

    Arrow — when to choose it

    Arrow is not really a competing storage format — it is the in-memory representation that Parquet, Feather, and most modern columnar engines deserialize into. The distinction matters because Arrow has its own on-disk variants (Feather / Arrow IPC) used in specific places:

    • Hugging Face datasets cache. When you call load_dataset(), the library materializes the dataset as an Arrow file on disk and memory-maps it during training. This gives near-zero-cost random access to billions of examples.
    • Inter-process and inter-language data exchange. Arrow IPC is the format Pandas, Polars, R, Julia, and Rust use to pass dataframes around without serialization overhead.
    • Streaming RPC. Arrow Flight is the de facto protocol for moving columnar data between services in 2026, replacing gRPC-with-Protobuf for analytics workloads.

    For dataset storage, Feather is rarely the right choice over Parquet: it has weaker compression and weaker tooling support outside the Arrow ecosystem. Use Feather only when the file lifetime is short (intermediate cache) or when zero-overhead read latency matters more than disk size.

    A canonical pipeline that uses all three

    Looking for production-grade Parquet shards of French regulatory text?

    Our corpus ships as Parquet with a stable 10-column schema, deduplicated, and ready to ingest into a Hugging Face datasets pipeline.

    JSONL at the edges where humans look. Parquet in the middle where machines store. Arrow at the end where trainers consume. Forcing one format across all stages produces either opaque binary or a pipeline parsing text.

    A production ML dataset pipeline in 2026 typically uses each format at the stage where it shines:

    1. Ingest layer (JSONL): raw extractions from APIs, scrapers, or curated sources land as JSONL. Schema is loose. Files are small enough to inspect by hand and small enough that re-running the extractor is cheap.
    2. Storage layer (Parquet): after deduplication, language filtering, and schema normalization, the canonical dataset is written as Parquet shards (one shard per source, partitioned by date or split). This is the version that gets versioned, hashed, and referenced in model cards.
    3. Training layer (Arrow): the trainer loads Parquet, materializes to Arrow in the OS page cache (via datasets.load_dataset or PyArrow), and memory-maps during the actual training loop. Random access is constant time; shuffling is cheap.
    4. Export layer (JSONL): when a tier of the dataset is released to a downstream consumer (sample for evaluation, premium for a paying customer), the chosen subset is exported back to JSONL for portability. The Parquet master remains the source of truth.

    This is the pattern we use ourselves for the French regulatory text corpus we publish — JSONL at the source edges, Parquet for the canonical store, Arrow under the trainer.

    Common mistakes

    Most expensive mistake

    Pre-tokenizing into the storage format. Tokenizer version changes (new vocab, special tokens, model update) invalidate the entire cache. Store raw text; tokenize at trainer-time. Compute cost is negligible compared to the rebuild.

    • One giant Parquet file. Splittability is a Parquet feature only when there are multiple row groups, ideally across multiple files. Aim for 100–500 MB per file, ~50–200 MB per row group.
    • Mixing types in a JSON field. A field that is “string in 95 % of records, list of strings in 5 %” will be silently coerced on Parquet conversion. Normalize at ingest, not at the converter.
    • Pre-tokenizing into the storage format. Tokenizer changes (new vocabulary, special tokens, model version) invalidate the cache. Store raw text; tokenize on the fly inside the trainer. The compute cost is negligible compared to the cache rebuild cost.
    • Storing model outputs in the same file as inputs. Adds noise to your dataset hash and complicates retraction. Use a sidecar Parquet with a foreign-key column to the input record.
    • Ignoring file naming. Files like data.parquet, final_v2_NEW.parquet are time bombs. Use a deterministic naming scheme: {source}/{date}/part-{shard:04d}.parquet.

    Quick reference table

    CriterionJSONLParquetArrow IPC / Feather
    Storage layoutRow-oriented textColumnar binaryColumnar binary
    CompressionNone (gzip optional)Built-in, per columnBuilt-in (lighter)
    SchemaImplicitDeclared, typedDeclared, typed
    Best at scaleBelow 1 GBAbove 500 MBIn-memory / cache
    Append supportYesNo (write-once)No
    Human inspectionTrivialNeeds toolNeeds tool
    Tool dependencyNonepyarrow, polars, duckdbpyarrow
    Typical roleEdge / exportCanonical storageTrainer cache

    Bottom line

    Use JSONL at the edges of your pipeline where humans look at the data. Use Parquet in the middle where machines store and version it. Use Arrow at the end where training loops consume it. Forcing one format across all stages produces either an unworkable pile of opaque binary or a pipeline that wastes most of its time parsing text.

    For more on how this storage choice interacts with downstream SFT and pre-training workflows, see our companion guides on SFT dataset formats and training data size for LLMs.

    See also: LLM corpus deduplication techniques and what makes a corpus retrieval-friendly.

    Frequently asked questions

    When does JSONL stop working?

    Around 500 MB or 1 M examples. Above that, parse time dominates, no column projection, no compression, no schema enforcement. Convert to Parquet for storage and keep JSONL only as an export.

    Parquet or Feather/Arrow IPC?

    Parquet for storage. Feather/Arrow IPC only for short-lived caches or inter-language data exchange. Feather has weaker compression and weaker tooling for long-term storage.

    Should I gzip my JSONL?

    Yes if you’re stuck with JSONL at scale and need to ship over the network. No if you’re training: the decompression on every epoch wastes throughput. Convert to Parquet instead.

    How big should each Parquet file be?

    100–500 MB per file, with row groups of 50–200 MB inside. Multiple files enable splittability; one giant file means a single reader streams it linearly.

    Does Hugging Face datasets use Parquet or Arrow?

    Both. load_dataset() reads Parquet (or other formats), materializes to Arrow on disk in the local cache, and memory-maps the Arrow file during training. You write Parquet; HF handles the Arrow side.

    Looking for Parquet-shipped French regulatory text?

    Our corpus uses the canonical pipeline described above — JSONL at edges, Parquet for the canonical store, Arrow under the trainer. Per-record provenance baked in.


    Keep reading

    Read next

    SFT datasets — format and best practices

    How JSONL shapes the typical SFT pipeline.

    Read next

    Training data size for LLMs

    Volume thresholds where format choices start to bite.

    Read next

    Building an audit-ready provenance trail

    How storage format interacts with per-record provenance.

  • SFT datasets — format, structure, and instruction-tuning best practices.

    A supervised fine-tuning dataset looks deceptively simple — inputs and target outputs. The difficulty is hidden in the format you choose, the template you apply, and the loss mask you compute. Get any wrong and your training run completes cleanly while teaching the model something different from what you intended.

    Key takeaways

    • Chat format (messages array) is the 2026 default — adopt unless you have a specific reason not to.
    • Always use tokenizer.apply_chat_template() — never hardcode the chat string.
    • Loss-mask the prompt; train only on the assistant tokens.
    • 1 000 curated examples outperform 50 000 unfiltered ones. LIMA is consensus, not opinion.
    • Version the dataset (hash + version id in the model card) — Article 10 expects it.

    A supervised fine-tuning dataset looks deceptively simple: a list of inputs and the responses you want the model to produce. The difficulty is hidden in three places — the format you choose, the template you apply, and the loss mask you compute. Get any of those wrong and the training run will complete cleanly while teaching the model something different from what you intended. This guide covers what an SFT dataset actually is, the three formats in use in 2026, the chat templates you cannot ignore, and the quality bar that distinguishes a usable dataset from a dataset that quietly poisons your fine-tune.

    What an SFT dataset is, and is not

    An SFT dataset is a curated collection of input–output examples that demonstrate the behaviour you want the model to learn. It is not a knowledge base, not a search index, and not raw text. Three properties separate it from anything else:

    • Each example has a clear target output. There is no ambiguity about what the model should produce.
    • Each example carries an implicit format contract. The model learns the prefix structure (system prompt, user role, special tokens) as much as it learns the content.
    • The dataset has a defined scope. A mixed dataset of summarization plus code generation plus tool calling teaches each task less well than three focused datasets, unless balanced with deliberate care.

    If your goal is to inject knowledge that the base model lacks, SFT is the wrong instrument — use retrieval-augmented generation or continued pre-training instead. SFT teaches behaviour and format, not facts.

    The three SFT dataset formats in 2026

    Three formats dominate, with different fit depending on whether you are doing single-turn instruction following, multi-turn assistant behaviour, or pre-templated training.

    Format A — Chat (messages array)

    The dominant format in 2026. Each example is a list of messages with a role field (system, user, assistant, occasionally tool) and a content field. JSONL with one example per line.

    {"messages": [
      {"role": "system", "content": "You are a financial compliance assistant."},
      {"role": "user", "content": "Summarise the key changes in DORA art. 5."},
      {"role": "assistant", "content": "DORA Article 5 introduces..."}
    ]}

    Best fit when the target deployment is a chat or assistant interface. Naturally supports multi-turn examples. Most modern training frameworks (TRL’s SFTTrainer, Axolotl, Unsloth) accept this format directly.

    Format B — Instruct (prompt/response pairs)

    The legacy Stanford Alpaca format. Three fields: instruction (what to do), input (optional context), output (the target response).

    {"instruction": "Classify the regulatory framework.",
     "input": "MiCA regulation 2023/1114",
     "output": "MiCA is the EU Markets in Crypto-Assets regulation..."}

    Useful for single-turn task-specific training, especially when the dataset originated as a classification or extraction set. Easier to construct from CSV-style sources but loses information when the target deployment is multi-turn. Convert to chat format if you plan to ship a conversational interface.

    Format C — Pre-templated text

    A single text field per example, already formatted with the model’s chat template and special tokens. The training framework does no further processing.

    {"text": "<|im_start|>systemnYou are a compliance assistant.<|im_end|>n<|im_start|>usernSummarise DORA art. 5.<|im_end|>n<|im_start|>assistantnDORA Article 5...<|im_end|>"}

    Use when you need full control over how the prompt is tokenized — for instance, when training on a custom template or when reproducing an exact published recipe. Avoid for general work: pre-templating couples the dataset to a specific tokenizer version, making the dataset non-portable.

    Chat templates: the silent regression source

    Common failure mode

    A Qwen-templated dataset trained on a Llama base model is a guaranteed accuracy regression. Always regenerate the templated text when you switch base models — do not reuse cached templates.

    Every modern base model ships with a chat template that defines how messages are serialized into a single string before tokenization. Llama 3, Qwen 2.5, Mistral, Gemma 2, and the open-weights cohort each use a different template, and using the wrong one will silently degrade performance.

    • Llama 3: <|begin_of_text|>, <|start_header_id|>role<|end_header_id|>, <|eot_id|>.
    • Qwen 2.5 / Qwen 3: ChatML — <|im_start|>role and <|im_end|>.
    • Mistral: [INST] ... [/INST] with optional system prefix.
    • Gemma 2: <start_of_turn>role and <end_of_turn>.

    Two practical rules:

    1. Always use tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) instead of hard-coding strings. The function is owned by the model authors and updates if the template evolves.
    2. If you switch base models mid-project, regenerate the templated text rather than reusing the cached version. A Qwen-templated dataset trained on a Llama base model is a guaranteed accuracy regression.

    Loss masking — train on the answer, not the question

    Building an SFT dataset for regulated industries?

    Our French finance and regulatory corpus ships with instruction-response pairs already filtered, deduplicated, and AI Act art. 10 documented.

    In SFT, you want the model to learn to generate the assistant’s response, not to memorize the prompt. The mechanism is loss masking: during training, the loss is computed only on the tokens belonging to the assistant turn(s). System and user tokens are masked out.

    TRL’s SFTTrainer handles this automatically when using the chat format and supplying a completion_only_loss=True flag (or by passing a response template marker). When in doubt, sanity-check by printing one batch’s loss mask and confirming that only the assistant tokens are unmasked. Teams that skip this check sometimes end up training their model to also predict the user’s next question — a subtle failure mode that does not appear in eval loss but does appear in production.

    Quality bar — what to remove, what to keep

    1 000 curated SFT examples beat 50 000 unfiltered ones. The 2023 LIMA paper is consensus, not opinion.

    The 2023 LIMA paper established that 1 000 carefully filtered SFT examples can outperform 50 000 unfiltered ones. Three years of replication studies have confirmed the principle. The filters that matter:

    • Length filter. Drop examples shorter than a sane minimum (10 tokens of assistant response) and longer than your training sequence length. Both extremes corrupt training.
    • Refusal filter. Drop or rewrite “I cannot help with that” responses inherited from upstream datasets, unless refusal behaviour is exactly what you want to fine-tune. Most fine-tunes inherit refusals accidentally and inherit their generalization.
    • Duplicate filter. Exact and near-duplicate examples bias the model. MinHash LSH at the example level catches obvious cases; embedding-cosine clustering at sentence level catches paraphrase duplicates.
    • Quality filter. Either human-reviewed at small scale (under 5 000 examples) or LLM-as-judge scored at larger scale, with the bottom 10–30 % dropped. Score on accuracy, faithfulness to the input, and format compliance.
    • Topical balance. If your dataset is heavily skewed toward one task or one style (a common artefact of LLM-generated data), the model overfits to that style. Either rebalance by sub-sampling or use task-mixing during training.

    Where SFT data comes from

    Four sources, with different trade-offs:

    • Open-source datasets. Tulu 3, Open-Orca, UltraChat, Dolma-derived, and the curated lists maintained by the community. Free, large, but generic — usually a starting layer, not a finishing layer.
    • Synthetic data from a larger model. Generate questions and responses with a stronger model (GPT-4-class, Claude, larger open-weights). Cheap, scalable, but inherits biases and refusals of the generator model. License terms vary — read carefully if you plan to ship commercially.
    • Human-curated, in-domain. Domain experts write and review examples that reflect actual user queries. Slowest, most expensive, and the best quality. The default for vertical fine-tunes that need to perform in production.
    • Logs from a deployed system. Real user queries paired with reviewed model outputs (or corrected outputs). Highest signal because it matches production distribution. Requires consent, privacy review, and a redaction pipeline.

    A workable mix for a vertical fine-tune is 30 % open-source generic, 30 % synthetic in-domain, 30 % human-curated in-domain, 10 % production logs once available. Re-balance once you have real eval scores.

    Storage format and versioning

    JSONL is the lingua franca for SFT datasets. It is line-oriented, streamable, diff-friendly, and supported by every training library. For datasets above a few hundred megabytes, Parquet with a defined schema offers better compression and faster random access, at the cost of less convenient inspection. The choice between them depends on size and the rest of your pipeline; we cover it in detail in our article on dataset formats.

    Whatever format you choose, version the dataset. A SFT dataset with no version identifier is a model card defect. The minimum versioning record:

    • A SHA-256 of the content (concatenated example hashes, sorted deterministically).
    • A short identifier (semver or date-based) included in the model card.
    • A pinned reference to the generation or curation pipeline (git commit or container digest).
    • A change log noting additions, removals, and quality-filter version bumps.

    Putting it together — a one-page SFT dataset checklist

    • Format chosen: chat / instruct / pre-templated. Reason documented.
    • Chat template applied with the model’s official tokenizer.
    • Loss mask verified on a sample batch.
    • Filters applied: length, refusal, duplicate, quality, balance — each with measured drop rate.
    • Train/validation split: stratified by source, 5–10 % held out.
    • Provenance per example: source, license, generation method, capture date.
    • Dataset hash + version identifier recorded in the model card.
    • Retraction procedure: how a specific example or a class of examples is removed and the dataset re-hashed.

    A team that crosses every item off this list ships a fine-tune that works in production and that survives the documentation audit a regulated customer will eventually request. A team that skips half of them ships a fine-tune that demos well on the team’s laptop and then quietly fails the moment the data distribution shifts.

    For a fuller view of how an SFT dataset connects to the rest of the fine-tuning workflow, see our companion guide on training an LLM on your own data. For the regulatory layer that determines whether your fine-tuned model can be deployed in finance, healthcare, or law enforcement under the EU AI Act, see our article on AI Act Article 10 documentation.

    See also: SFT vs DPO vs RLHF dataset shapes and the French legal NLP landscape.

    Frequently asked questions

    Chat format or instruct format?

    Chat (messages array) for any product that ships as an assistant. Instruct (prompt/response) only for single-turn task-specific work where multi-turn doesn’t apply. Pre-templated text only when you need full tokenizer control.

    How do I avoid loss-masking bugs?

    Print one batch’s loss mask early in the run and verify only assistant tokens are unmasked. TRL’s SFTTrainer handles this automatically with the chat format and completion_only_loss=True.

    How many examples do I need?

    1 000–10 000 carefully filtered is the sweet spot. Above 100 000 diminishing returns set in — each new 10 K adds 0.5–1 % on benchmarks at most.

    Is synthetic SFT data acceptable?

    Yes, mixed with human-curated data. Synthetic is cheap and scalable but inherits the generator model’s biases and refusals. Read license terms carefully if you ship commercially.

    How do I version an SFT dataset?

    SHA-256 of concatenated example hashes (sorted deterministically), plus a short version id and a pinned reference to the curation pipeline. Bumps every time examples are added, removed, or re-filtered.

    Need an SFT-ready corpus for finance or regulation?

    Our French regulatory and financial text corpus is delivered as Parquet shards with the structure SFT pipelines expect — pre-filtered, AI Act-documented, tier-licensed.


    Keep reading

    Read next

    How to train an LLM on your own data

    Where SFT fits relative to RAG and continued pre-training.

    Read next

    Choosing a dataset format

    JSONL, Parquet, or Arrow for your SFT files.

    Read next

    Training data size for LLMs

    How many examples are enough for each training stage.

  • How to train an LLM on your own data — a practical 2026 guide.

    "Train an LLM on your own data" can mean six different things, and choosing the wrong one is the most expensive mistake teams make before writing a single line of code. This guide walks through the decision tree, data prep, the 2026 stack, and the governance layer that determines whether your fine-tune ships in a regulated industry.

    Key takeaways

    • Start with RAG; move to SFT only when RAG plateaus on task accuracy.
    • QLoRA on a 4-bit base model is the 2026 default — 70B fits on a single H100.
    • Data curation (dedup, quality filter, topical filter) matters more than hyperparameters.
    • 1 000 high-quality SFT examples beat 100 000 noisy ones — LIMA replicated three times over.
    • Ship with a provenance manifest — required for AI Act Article 10 compliance in the EU.

    “Train an LLM on your own data” can mean six different things, and choosing the wrong one is the most expensive mistake teams make before writing a single line of code. This guide walks through the decision tree, the data preparation steps that actually move the needle, the 2026 fine-tuning stack, and the governance layer that increasingly determines whether your fine-tuned model can be deployed in a regulated industry at all.

    Step 1 — Decide what “training on your own data” actually means

    Start with RAG. Move to SFT when RAG plateaus. Consider continued pre-training only when measurable domain-specific gaps remain.

    Four approaches sit under that phrase, with order-of-magnitude differences in cost, control, and lock-in:

    • Retrieval-Augmented Generation (RAG). The model is unchanged; your data is indexed in a vector store and injected into the prompt at query time. Cheapest, fastest to ship, easy to update. Best fit when answers must reflect documents that change weekly and when traceability to the source document is mandatory.
    • Supervised Fine-Tuning (SFT) with LoRA/QLoRA. A small set of adapter weights is trained on instruction–response pairs. Affordable (single GPU, hours rather than weeks), preserves the base model, and gives meaningful gains on domain-specific tasks. The default 2026 choice for most projects.
    • Continued pre-training. Resume the base model’s pre-training on a large in-domain corpus (10–100 B tokens). Useful when the base model has weak vocabulary or weak fluency in your domain. Expensive (multi-node, days to weeks of GPU time), and rarely the right first step.
    • From-scratch pre-training. Building a foundation model from the ground up. Justified only when no open-source base meets your latency, licensing, or sovereignty constraints. Budget for it in tens of thousands of GPU-hours and in legal review hours for the training corpus.

    Decision rule: start with RAG. Move to SFT when RAG plateaus on task accuracy. Consider continued pre-training only when you have measurable, domain-specific tokenization or fluency gaps that SFT cannot fix.

    Step 2 — Curate the data before you format it

    Quality > quantity

    For SFT specifically, 1 000 high-quality instruction-response pairs beat 100 000 low-quality ones. The 2023 LIMA paper plus three years of replication studies have made this consensus, not opinion.

    The most underrated step. Most “fine-tuned LLM” projects fail not at the training run, but at data quality. Three filters to apply before any formatting:

    • Deduplication. Exact and near-duplicate examples inflate training time and bias the model toward overrepresented patterns. Use MinHash LSH or SimHash to detect near-duplicates at the document and paragraph level. Expect 10 % to 40 % reduction in raw corpora.
    • Quality scoring. Length filters, language-ID filters, perplexity filters from a small reference model, and an LLM-as-judge pass on a representative sample. The aim is to remove machine-generated boilerplate, broken extractions, and out-of-scope content.
    • Topical filtering. If you are building a vertical model, keep only documents that match the target domain. A 100 K-document corpus where 80 % are on-topic outperforms a 500 K-document corpus where 30 % are on-topic — every time.

    For SFT specifically, the working rule is: 1 000 high-quality instruction–response pairs beat 100 000 low-quality ones. The 2023 LIMA paper and three years of replication studies have made this consensus, not opinion.

    Step 3 — Format the data for your training method

    The format that wins depends on the training stage:

    • For RAG: chunked text with embeddings. Chunk size 200–800 tokens, with 10–20 % overlap. Store source URL, chunk index, and an immutable document hash next to each chunk — the hash becomes your provenance anchor.
    • For SFT: JSONL with one example per line. Schema typically {"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", ...}]}. Apply the base model’s chat template (Llama 3, Qwen 2.5, Mistral all have published templates) to convert messages to tokenized strings.
    • For continued pre-training: Parquet shards of raw text with a stable schema. Each row carries the text, the source identifier, the license, and a hash. Tokenize on the fly during training, do not pre-tokenize to disk — tokenizer changes invalidate the cache.

    If you are unsure which file format to use for storage and downstream pipelines, the trade-offs between Parquet, JSONL, and Arrow are covered in our companion article on dataset formats.

    Step 4 — Choose the parameter-efficient method that fits your budget

    Full fine-tuning of a 7B model requires roughly 70–110 GB of GPU memory. Most teams cannot or should not provision that. Three parameter-efficient alternatives dominate in 2026:

    • LoRA — Low-Rank Adaptation. Trains rank-8 to rank-64 update matrices on selected layers. Memory footprint drops by 3–5×; quality is within 1–2 % of full fine-tuning on most benchmarks. The default.
    • QLoRA — LoRA on a 4-bit quantized base model. Memory drops by 10–15× compared to full fine-tuning. A 7B model fine-tunes on a single 24 GB consumer GPU; a 13B fits on 32 GB; a 70B fits on a single H100. Slightly slower per step than LoRA, but most teams accept the trade-off.
    • DoRA, NEFTune, ReFT — newer variants that add small accuracy gains in specific configurations. Worth benchmarking once your baseline LoRA pipeline works, not before.

    Rule of thumb: start with QLoRA, rank 16, on the attention projection layers. Tune from there only if your eval set shows specific weaknesses.

    Step 5 — The 2026 fine-tuning stack

    Building a fine-tune for finance, regulation, or compliance?

    We publish French regulatory and financial training corpora — pre-filtered, deduplicated, AI Act art. 10 ready, with per-document provenance.

    The toolchain has consolidated significantly. A minimal, production-grade SFT pipeline in May 2026 uses:

    • Base layer: Python 3.11+, PyTorch 2.5+, CUDA 12.x.
    • Hugging Face ecosystem: transformers for model loading, datasets for streaming, peft for LoRA adapters, trl for SFT trainers and DPO, accelerate for multi-GPU.
    • Throughput optimization: Unsloth provides patched kernels that reduce memory by 30–50 % and accelerate training by 2–3× on supported architectures. Axolotl offers a higher-level YAML config wrapper if you prefer config over code.
    • Quantization: bitsandbytes for 4-bit and 8-bit, plus the native bf16 and fp16 paths.
    • Inference and packaging: vLLM for serving, llama.cpp or MLX for on-device deployment, GGUF or ONNX for portable model artefacts.

    If you are starting from zero, Unsloth’s documented notebooks are the fastest path to a working baseline. If you need fine-grained control or are running on non-NVIDIA hardware (AMD, Intel Gaudi, Apple Silicon), drop down to transformers + peft + trl directly.

    Step 6 — Training run essentials

    Six hyperparameters and decisions that disproportionately affect outcomes:

    1. Learning rate. For LoRA: 1e-4 to 3e-4. For full fine-tuning: 1e-5 to 5e-5. Use a linear warm-up over 3–10 % of steps, then cosine decay.
    2. Batch size. Effective batch size (per-device × gradient accumulation × num devices) of 32–128 is a workable starting range for SFT. Smaller batches overfit on small datasets; larger batches lose signal.
    3. Epochs. 1 to 3 epochs for SFT on a curated dataset of a few thousand examples. More epochs typically degrade base model behaviour through catastrophic forgetting.
    4. Sequence length. Set to the longest reasonable example, not the longest possible one. Padding inflates memory and slows training. Sort examples by length and use packing (multiple short examples per training sequence) when supported.
    5. Validation set. Hold out 5–10 % of examples, stratified by source and topic. Compute eval loss every 50–200 steps. Stop training when eval loss plateaus or rises.
    6. Checkpointing. Save every N steps and at the end of each epoch. Keep the last three checkpoints. The cheapest insurance against an aborted run.

    Step 7 — Evaluate beyond loss

    Training loss tells you the model learned something. It does not tell you whether the model is better than the base model on tasks you care about. Three evaluation layers in increasing order of cost:

    • Automatic benchmarks. A small task-specific test set with exact-match or F1 scoring, plus a general benchmark (MMLU-style, language-specific) to detect regression on out-of-domain capabilities. Cheap, fast, but blind to fluency and style.
    • LLM-as-judge. A larger model (or the same model with a structured rubric) scores outputs on a held-out test set across 3–5 axes (accuracy, helpfulness, faithfulness, format, safety). Useful but not reliable enough to ship without human spot-check.
    • Human evaluation. 100–300 examples reviewed by domain experts, comparing base vs. fine-tuned outputs side-by-side. Expensive, slow, and the only signal that genuinely correlates with downstream user satisfaction. Reserve for go/no-go decisions.

    Track all three over time. A fine-tune that gains five points on the task benchmark but loses ten on the general benchmark is usually a net regression — your fine-tune has acquired domain skill at the expense of generality.

    Step 8 — Deploy with the governance layer in place

    Fine-tuned models deployed in 2026 in the European Union for high-risk use cases (finance, healthcare, employment, law enforcement, critical infrastructure) fall under the EU AI Act’s Article 10 obligations. The same goes for the training data that produced them. Concretely, ship with:

    • Training data provenance manifest. A per-document record of source URL, capture date, license, processing chain, and content hash. Stored next to the model artefact, not in a separate spreadsheet.
    • Versioning. Each training run produces a model card with the dataset hash, base model identifier, hyperparameters, and evaluation scores. Re-training without re-versioning is a compliance defect.
    • Retraction procedure. A documented path for removing a specific document (or a class of documents — e.g. a data subject’s personal data) and either re-training or accepting a measured drift. GDPR Article 17 makes this concrete; AI Act Article 10 makes it auditable.

    Teams that bolt this layer on after deployment spend two to four months retrofitting. Teams that add it during data prep spend two to four days. The pattern transfers directly across modalities, including the regulated-text corpora we build for our own customers.

    When fine-tuning is not the answer

    Three diagnostic questions before you start a training run:

    1. Has a clean RAG baseline been measured on the same evaluation set? If not, build it first. Fine-tuning to fix a problem RAG would have solved is the most common waste.
    2. Does the base model already know your domain vocabulary? Run a short qualitative probe — ask it 20 domain questions. If responses are coherent but bland, SFT will help. If responses are hallucinatory or wrong on basics, you may need continued pre-training or a different base model.
    3. Is the data you would train on the right shape? For SFT you need instructions and responses, not documents. Converting unstructured documents into instruction–response pairs is a sub-project in itself.

    Bottom line

    The 2026 path to a useful domain-specific LLM is rarely from-scratch training. It is a curated dataset, a QLoRA fine-tune on a sensible open base model, an evaluation harness that catches regressions early, and a governance layer that lets the result be deployed in places that matter. The bottleneck is almost always the data, not the model.

    If the data you need is French-language financial, regulatory, or economic text — codes, doctrine, EU regulations, prudential positions — we build, version, and license that corpus with the governance layer described above already in place. The methodology we use is the same methodology described in our writing on training datasets and AI Act compliance.

    See also: the best public LLM datasets in 2026, what makes a corpus retrieval-friendly, and SFT vs DPO vs RLHF dataset shapes.

    Frequently asked questions

    Should I fine-tune or use RAG?

    Use RAG when answers must reflect documents that change frequently and when traceability to source is mandatory. Use SFT when you need to teach a specific output format, persona, or task behaviour. The best production systems usually combine both.

    How much data do I need to fine-tune?

    For SFT, 1 000–10 000 carefully curated instruction-response pairs is the sweet spot. For continued pre-training, 1–10 B in-domain tokens. See our training data size article for stage-by-stage volumes.

    Can I fine-tune on a single GPU?

    Yes, with QLoRA. A 7B fits on 24 GB; 13B on 32 GB; 70B on a single H100. Throughput is lower than multi-GPU but the budget gap is two orders of magnitude.

    Is my fine-tuned model subject to the EU AI Act?

    If deployed in the EU or used by an EU person for any Annex III high-risk use case (finance, healthcare, employment, biometrics, law enforcement, justice, critical infrastructure) — yes. Article 10 obligations apply from 2 August 2026.

    What is the cheapest evaluation strategy?

    Hold out 200–500 examples scored on a task-specific F1 metric, plus a small general benchmark to detect regression. Run after every training change. Reserve human eval for go/no-go decisions.

    Need a vertical training corpus for finance or regulation?

    We build, version, and license French regulatory, financial, and economic text corpora — AI Act art. 10 ready, per-document provenance, tiered licensing for sample / standard / premium / enterprise.


    Keep reading

    Read next

    SFT datasets — format and best practices

    The three SFT formats, chat templates, loss masking, and the quality bar.

    Read next

    Training data size for LLMs

    Concrete token volumes for pre-training, continued pre-training, SFT, DPO, evaluation, and RAG.

    Read next

    EU AI Act Article 10

    What training data documentation actually requires under the AI Act.

  • How to create a training dataset for object detection.

    Object detection models are not bottlenecked by architecture anymore. Modern detectors converge on similar accuracy when given the same training data — the real bottleneck is the dataset, and the eight decisions you make before writing a line of training code.

    Key takeaways

    • Write a scope statement before you collect a single image — five answers, one page.
    • Source 70 % in-domain real data, 20 % public datasets, 10 % synthetic for rare classes.
    • Bounding boxes for 80 % of use cases; segmentation, keypoints, or oriented boxes only when justified.
    • Inter-annotator agreement above 0.85 IoU, gold-set audit on every annotator.
    • Split on capture sessions, not on frames. Augment within the operating envelope only.

    Object detection models are not bottlenecked by architecture choices anymore. Modern detectors — YOLO, DETR, RT-DETR, Grounding DINO — converge on similar accuracy when given the same training data. The real bottleneck is the dataset: how it is scoped, sourced, annotated, audited, and governed. A team can spend three months tuning a YOLOv8 backbone and gain two mAP points; the same three months invested in dataset quality often yields ten.

    This guide walks through the eight decisions that determine whether your training dataset will produce a deployable model or a stack of weights that fails the moment it leaves your test set. The emphasis is on practical engineering: what to write down, what to measure, where regulators will look.

    Step 1 — Define the detection scope before you collect a single image

    A dataset is a software artefact: scope statement, versioned source, audit trail, quality metric, maintenance schedule.

    Most failed object detection projects can be traced to an ambiguous scope statement written on day one. “Detect vehicles” is not a scope. “Detect cars, trucks, motorcycles, bicycles, and pedestrians in urban traffic scenes captured from a fixed CCTV pole between 2 m and 50 m distance, daytime conditions, no rain” is a scope.

    The scope statement must answer five questions in writing, ideally co-signed by the engineering lead and the business stakeholder:

    • Classes: exact list, with disambiguation rules. Is a van a car or a truck? Is a stroller a pedestrian?
    • Operating envelope: camera type, mounting height, viewing angle, distance range, lighting conditions, weather, time of day.
    • Edge classes: what is explicitly out of scope (background objects you should not detect, occlusion thresholds, minimum object size in pixels).
    • Acceptance metric: mAP at IoU 0.5, recall at fixed precision, latency budget. Pick one primary and at most two secondary.
    • Deployment surface: edge device with 4 GB RAM and INT8 quantization, or a beefy server with FP16. This drives input resolution and model size, which drives annotation granularity.

    Write this as a one-page document. Re-read it before every dataset decision. When an annotator asks “should I label this partially occluded bicycle?”, the answer is in the scope, not in a Slack thread.

    Step 2 — Source images that match the deployment distribution

    The single most common dataset mistake is over-representing easy examples. Publicly available image dumps (Open Images, COCO, web scrapes) skew toward well-lit, centered, unoccluded subjects. Your production stream does not.

    Three sourcing strategies, each with trade-offs:

    • In-the-wild collection from the deployment surface itself. The gold standard. Cameras in the actual location, capturing actual conditions. Slow, expensive, but the only source that matches the real distribution. Budget six to twelve weeks of collection before annotation starts.
    • Synthetic data from a 3D engine (Unreal, Unity, NVIDIA Omniverse). Useful for rare classes and dangerous scenarios (fire, accidents). The “domain gap” between synthetic and real is the silent killer — a model trained 100 % synthetic almost always fails in the real world. Use it as augmentation, not as the bulk.
    • Public dataset transfer. Start with a pre-trained backbone fine-tuned on a relevant public set, then fine-tune again on your in-domain data. Faster bootstrap, but inherits public-dataset biases (camera angles, geographic skew, class imbalance).

    A working ratio for industrial projects: 70 % in-domain real data, 20 % public dataset transfer, 10 % synthetic for rare classes. Adjust based on how exotic your deployment surface is.

    Step 3 — Choose the annotation primitive that matches the downstream task

    Bounding boxes are the default. They are fast to annotate (3 to 8 seconds per box for a trained annotator), supported by every detector, and sufficient for 80 % of business use cases. Use them unless you have a specific reason not to.

    Three cases where you need more:

    • Instance segmentation (polygon masks) — when objects overlap heavily, when you need pixel-accurate counts, or when downstream measurement of object area matters. Cost: 30 to 90 seconds per instance, three to ten times slower than boxes.
    • Keypoints — when pose matters (humans, robotic arms, animals). Adds skeletal structure on top of detection.
    • Oriented bounding boxes — when objects have a strong rotation axis (ships from satellite, books on a shelf, parking spots). Axis-aligned boxes waste capacity on background pixels and confuse non-max-suppression.

    The decision is not reversible at zero cost. Boxes annotated in pass one cannot be upgraded to polygons without re-annotating. Choose the maximum granularity you will plausibly need within the next 18 months, not the minimum that fits today’s model.

    Step 4 — Pick the right annotation tool, not the famous one

    The annotation tool market has three tiers:

    • Open-source, self-hosted: CVAT, Label Studio, Roboflow Annotate (community). Zero per-seat cost, full control over data. Choose if your dataset contains anything you cannot ship to a third party (regulated industries, IP-sensitive products, defense).
    • Managed SaaS: Roboflow (paid), Encord, V7 Labs, Supervisely. Faster onboarding, built-in QA workflows, automatic versioning. Pay per-image or per-seat. Choose if your bottleneck is annotation throughput, not data governance.
    • Full-service annotation vendors: Scale AI, Labelbox managed, Sama. They provide both tool and labour. Quality is variable and contractual — always benchmark against your own gold set before signing.

    Whichever tier you choose, three features are non-negotiable: project-level versioning so you can compare datasets across iterations, an export format you can parse (COCO JSON or YOLO TXT), and a permissions model that lets reviewers approve or reject individual annotations rather than entire batches.

    Step 5 — Write annotation guidelines that survive a thousand images

    Building a vision dataset for regulated industries?

    We apply the same governance pattern to high-stakes text and vision corpora — AI Act-ready, per-record provenance, tiered licensing.

    Annotation guidelines are a contract between you and your annotators. A weak guideline (“label all cars”) produces a dataset where annotator A draws boxes around partially visible cars and annotator B does not — a 5 to 15 mAP gap that no model architecture can fix.

    A workable guideline document has four sections:

    1. Class definitions with reference images. Three to five positive examples per class. Three to five negative or boundary examples. A car at 50 m looks different from a car at 5 m; both must appear in the reference set.
    2. Box-drawing rules. Include the wheels? Include the side mirrors? Tight or loose by some margin? Pick one and stick with it. The downstream model will learn whatever convention you choose, but only if it is consistent.
    3. Occlusion and truncation policy. Below what visible fraction is an object out of scope? Boxes for truncated objects: extend the box to where you estimate the full object would be, or limit to visible pixels?
    4. Difficult or rejected cases. A short list of images that should not be annotated at all (blurry, ambiguous, mislabeled in the scope). Annotators flag and skip.

    The guideline is a living document. Expect to revise it at least three times during a large annotation effort. Each revision triggers a re-audit of the data labelled before it.

    Step 6 — Build quality assurance into the pipeline, not after it

    Quality bar

    Reject the temptation to declare a dataset "done" before IAA and gold-set metrics stabilize. A 5 % systematic annotation error caps mAP at roughly 1 minus that rate, no matter how much compute you throw at training.

    Quality assurance for object detection annotations rests on three measurements taken continuously, not at the end:

    • Inter-annotator agreement (IAA). Have at least 5 % of images annotated by two annotators independently. Compute IoU-based agreement on matched boxes plus precision and recall on detected versus missed boxes. A workable target is IoU agreement above 0.85 and recall agreement above 0.95.
    • Gold-set audit. Maintain a fixed set of 100 to 500 expert-annotated reference images. Every annotator is scored against the gold set during onboarding and at random intervals. Annotators below a fixed accuracy threshold are retrained, not silently kept.
    • Model-in-the-loop review. Train an initial model on the first 10 to 20 % of data. Use it to predict annotations on the rest, and route disagreements (model says yes, annotator says no, or vice versa) to a human reviewer. This focuses expensive review time on the hard cases.

    Reject the temptation to declare a dataset “done” before the IAA and gold-set metrics stabilize. A dataset with 5 % systematic annotation errors will cap your model’s mAP at roughly 1 minus that error rate, no matter how much compute you throw at training.

    Step 7 — Splits, augmentation, and class balance

    Three classical pitfalls remain alive in 2026:

    • Splitting on images instead of on capture sessions. If your camera ran for an hour and produced 36 000 frames, splitting them randomly into 80 / 10 / 10 means your validation set sees scenes nearly identical to training. Split on capture session, location, and time, not on individual frames.
    • Class imbalance compensation only through over-sampling. Over-sampling rare classes inflates training time without adding information. Use focal loss, class-weighted loss, or sample mining instead. Reserve over-sampling for extreme cases.
    • Augmentation that violates the operating envelope. If your deployment camera is always horizontal, do not flip vertically. If lighting is always daytime, do not synthesize nighttime variations. Augmentations should expand the dataset within the deployment distribution, not outside it.

    The split, the augmentation policy, and the loss function are dataset decisions, not training decisions. They belong to the same document as the scope statement.

    Step 8 — Governance, provenance, and the EU AI Act

    Since the EU AI Act entered into force in 2024 and its rules on high-risk systems came into application in 2026, training-data governance is no longer optional for systems deployed in the European Union. Object detection is high-risk in several declared use cases (biometric identification, employment screening, law enforcement, critical infrastructure).

    Concretely, an AI Act art. 10 compliant training dataset records for every image and every annotation:

    • Provenance — source URL, capture date, capture device, geographic origin, copyright owner, and the licence under which the image is used.
    • Processing chain — what filters, augmentations, or transformations were applied, with the version of the code that applied them.
    • Annotation lineage — who labelled what when, against which version of the guidelines, with which tool, reviewed by whom.
    • Representativeness statement — declared demographic, geographic, and temporal coverage, with known gaps explicitly listed.
    • Retraction procedure — how a data subject can request removal of their image and how the model is retrained or re-evaluated as a result.

    Most teams add this layer after the fact, painfully. The cheap version is to add it during annotation — a JSON-LD record per annotation, hashed and append-only, that lives next to the parquet shards. The same approach works for any modality: text corpora, audio datasets, sensor logs. We use it for the French regulatory and financial text corpus we publish; the principles transfer directly to vision.

    After the first model: the continuous improvement loop

    A training dataset is not delivered, it is maintained. Three feedback loops keep it alive after the first model ships:

    • Production sampling. A small fraction of production inferences (1 to 5 %, randomly sampled) is logged with the image, the prediction, and the timestamp. Periodically audited by humans. Cases where the model errs become next-iteration training data.
    • Hard-case mining. Cases where the model is uncertain (low confidence, high entropy, ensemble disagreement) are routed to human review. They are usually worth ten random samples in the next iteration.
    • Drift monitoring. Compare the distribution of production inputs to the training distribution monthly. When drift exceeds a threshold (KL divergence, embedding-space distance), trigger a recollection campaign before performance degrades.

    Bottom line

    A training dataset for object detection is a software artefact. It has a scope statement, a versioned source, an audit trail, a quality metric, and a maintenance schedule. Teams that treat it that way ship models that survive deployment. Teams that treat it as a one-shot folder of images annotated by interns ship models that need to be rebuilt every six months.

    The exact same engineering principles apply to non-vision datasets. We build and publish industrial-grade text corpora for regulated industries — the methodology described above, applied to French financial and regulatory text rather than images, is what produces a dataset a tier-1 bank or a regtech vendor can actually deploy. If you are building a high-stakes detection pipeline and want to compare notes, our team is reachable through the contact page.

    See also: the best public LLM datasets in 2026.

    Frequently asked questions

    Bounding boxes or segmentation?

    Bounding boxes for 80 % of use cases — they are fast to annotate (3–8 seconds per box) and supported by every detector. Use segmentation only when objects overlap heavily, when you need pixel counts, or when downstream measurement of object area matters.

    How big should my training set be?

    Object detection is more sensitive to representativeness than to raw volume. 2 000–5 000 well-annotated images on the deployment distribution outperform 50 000 generic images. Scale once you’ve measured the gap, not before.

    Do I need synthetic data?

    Useful for rare classes (fire, accidents, low-incidence defects) where real data is impossible or unethical to collect. Treat it as augmentation, not as the bulk. A model trained 100 % synthetic almost always fails in real deployment.

    How do I avoid overfitting to one camera?

    Vary capture conditions during collection (lighting, angle, time of day) and split your data on capture sessions, not on individual frames. Frame-level random splits leak distribution and inflate validation scores.

    Does the EU AI Act apply to object detection systems?

    Yes, when deployed for Annex III high-risk use cases — biometric identification, law enforcement, employment screening, critical infrastructure safety. Article 10 obligations on training data documentation apply from 2 August 2026.

    Need governance-ready training data for a regulated AI system?

    We build vertical corpora with AI Act art. 10 attestation, per-record provenance, and tiered licensing — currently published in French finance, regulation, and economic text.


    Keep reading

    Read next

    How to train an LLM on your own data

    The cross-modality companion — same governance principles applied to text.

    Read next

    EU AI Act Article 10

    What training data documentation actually requires for high-risk AI systems.

    Read next

    Building an audit-ready provenance trail

    The per-record provenance pattern that scales across vision, text, and sensor data.