Essay · French Corpus LLM · Finaleads LLC

EU AI Act Article 10 — what training data documentation actually requires.

Article 10 of the EU AI Act is the part that translates "the AI must be trustworthy" into engineering work product. It applies to every high-risk AI system placed on the EU market, and from 2 August 2026 it makes data governance a documentation deliverable, not an internal posture.

Key takeaways

  • Article 10 applies to training, validation, AND test sets — not just training.
  • Eight documented governance practices required: design choices, data origin, preparation, assumptions, suitability, bias examination, mitigation, gap analysis.
  • Dataset Specification Document per dataset — 20–50 pages for a substantial LLM corpus.
  • Enforcement starts 2 August 2026. Fines reach €15 M or 3 % of worldwide annual turnover.
  • GDPR compliance is not enough. Article 10 is a separate, overlapping obligation.

Article 10 of the EU AI Act is the part of the regulation that translates “the AI must be trustworthy” into engineering work product. It applies to every high-risk AI system placed on the EU market or used by an EU person, and it makes data governance a documentation deliverable, not an internal posture. The text is short — under 1 000 words. The implementation depth is multiple months of work for any team starting from a blank page. This guide walks through what Article 10 actually says, what each requirement means concretely for a training dataset, the enforcement timeline that matters in 2026, and the documentation artefacts you need to produce.

What “high-risk” actually covers

Article 10’s obligations only kick in for high-risk systems. Annex III of the Act defines the categories. The ones most likely to matter for an LLM-based product:

  • Biometric identification and categorisation — any model that infers identity, age, gender, or emotional state from a face, voice, or behavioural signal.
  • Employment and worker management — recruitment screening, CV filtering, performance evaluation, task allocation, termination decisions.
  • Access to essential services — credit scoring, insurance pricing, eligibility for public benefits, emergency services dispatch.
  • Law enforcement — risk assessment, evidence evaluation, lie detection, profiling.
  • Migration, asylum, and border control — visa and asylum risk scoring, document verification, identity assessment.
  • Administration of justice and democratic processes — assistance to judicial decisions, voting and election influence detection.
  • Critical infrastructure — safety components in road traffic, water supply, gas, electricity, digital infrastructure.

An LLM-based product that touches any of these categories is high-risk by deployment, even if the underlying model is general-purpose. A general-purpose AI model marketed as “for HR screening” inherits the high-risk classification of the deployment use case.

Article 10, paragraph by paragraph

The Article has six paragraphs. The ones with concrete engineering implications:

Paragraph 1 — Data governance applies to all training, validation, and testing sets

The obligation is not limited to the training dataset. Validation and test sets must meet the same governance bar. In practice, this means the same provenance records, the same quality controls, the same bias audits — applied to every split, with documentation showing they were applied independently.

Paragraph 2 — Specific governance practices

The longest paragraph, listing eight required governance practices. Each must be documented:

  • Design choices — why was this dataset constructed in this way, what scope was excluded, what alternative sources were considered.
  • Data collection processes and origin — where each data point came from, how it was collected, under what legal basis if personal data was involved.
  • Data preparation operations — annotation, labelling, cleaning, enrichment, aggregation. The complete processing chain from raw source to training-ready format.
  • Formulation of relevant assumptions — what the data is assumed to measure or represent. Often the most overlooked requirement, and the one regulators will challenge first.
  • Assessment of availability, quantity, and suitability — does the dataset have enough representative examples for the deployment context.
  • Examination of bias — bias that could affect health and safety, bias that could lead to discrimination prohibited by EU law. Explicit, not implicit.
  • Appropriate measures to detect, prevent, and mitigate bias — what was done about the biases identified. Documentation of intent is not enough; documentation of action is required.
  • Identification of data gaps and shortcomings — what the dataset does not cover, and how that absence might affect downstream use.

Paragraph 3 — Quality criteria

Training, validation, and test sets must be: relevant, sufficiently representative, to the best extent possible free of errors, complete in view of the intended purpose, with appropriate statistical properties. Five adjectives, each requiring measurable evidence.

“Relevant” and “complete” are interpreted relative to the intended purpose declared in the technical documentation (Article 11). This means the intended purpose statement is no longer a marketing field — it becomes the yardstick against which dataset adequacy is judged.

Paragraph 4 — Contextual specificity

Datasets must reflect the geographical, contextual, behavioural, and functional setting where the system will be used. A credit-scoring model trained on Italian data and deployed in France triggers a documented assessment of cross-context transfer risk. A face-recognition model trained on a US-centric dataset and deployed in Northern Europe triggers the same.

Paragraph 5 — Special categories of personal data

When the dataset includes special categories of personal data (race, ethnicity, health, religion, sexual orientation), the controller may process them only for the purpose of bias detection and correction, only when strictly necessary, only with appropriate safeguards, and never for downstream prediction. This paragraph creates an explicit, narrow GDPR carve-out specifically for fairness work, with documentation requirements that exceed normal GDPR processing records.

The documentation artefact — Dataset Specification Document

Producing the Dataset Specification Document three months after training is finished is an order of magnitude more expensive than producing it during training. Build the layer in early.

Article 10 does not specify a document name, but Article 11 plus Annex IV together require a Dataset Specification Document per dataset. The minimum content, distilled from the regulatory text plus the Commission’s August 2025 implementation guidance:

  1. Dataset identifier — name, version, content hash, date of construction.
  2. Intended purpose mapping — explicit link to the high-risk system’s Article 11 intended-purpose declaration.
  3. Composition — sources, volumes per source, license per source, time range, geographic coverage.
  4. Collection methodology — how each source was obtained, scraped, licensed, or annotated. Legal basis for any personal data processing.
  5. Preparation chain — every transformation from raw to final, with versioned scripts or pipelines.
  6. Assumptions and exclusions — what the dataset assumes about its content; what is deliberately excluded.
  7. Quality metrics — error rate, deduplication rate, language-detection accuracy, sampling adequacy. Numbers, not descriptions.
  8. Bias analysis — protected attributes examined, methodology, results, mitigation actions taken.
  9. Gap analysis — known limitations, populations or contexts underrepresented, downstream risks.
  10. Retention and retraction — how long the dataset is kept, how a data subject can request removal, how the system is re-evaluated after removal.

For an LLM training corpus of any substantial size, this document runs 20 to 50 pages. Producing it three months after training is finished is an order of magnitude more expensive than producing it during training. Build the layer in early.

Enforcement timeline — what is binding when

Need Article 10-ready training data?

Our French regulatory and financial corpus ships with a Dataset Specification Document, per-record JSON-LD provenance, and a retraction procedure already documented.

The AI Act entered into force on 1 August 2024. Its provisions apply on a staggered schedule:

  • 2 February 2025: prohibitions (Article 5) on banned practices.
  • 2 August 2025: obligations for general-purpose AI models (Chapter V).
  • 2 August 2026: Article 10 and the rest of the high-risk obligations become applicable. Member State authorities can begin enforcement.
  • 2 August 2027: high-risk systems embedded in products subject to existing Union harmonisation legislation.

The 2 August 2026 date matters because most teams operating high-risk systems in the EU today already need to be compliant in less than three months. Penalties for Article 10 non-compliance reach the higher of €15 million or 3 % of worldwide annual turnover. Other Act violations can go up to €35 million or 7 %, but Article 10 sits in the 3 % tier.

Common misinterpretations

No retroactive exception

Authorities will compare documentation timestamps against training run timestamps. Obvious retroactive compliance attempts (creating provenance records months after training) are a red flag and may trigger deeper audit. Build the layer during data preparation, not after deployment.

  • “Our model is hosted in the US, so the AI Act doesn’t apply.” Wrong. The Act applies extraterritorially. If the output is used by a person in the EU, the provider’s obligations attach.
  • “We don’t train on personal data, so Article 10 is irrelevant.” Wrong. Article 10 applies to all training data, personal or not, for any high-risk use case.
  • “GDPR compliance is enough.” Wrong. GDPR governs personal-data processing legality. Article 10 governs dataset quality, governance, and documentation for high-risk AI. Different obligations, overlapping but distinct.
  • “We can document the dataset retroactively before audit.” Possible, expensive, and increasingly suspicious. Authorities will compare documentation timestamps against training run timestamps; obvious retroactive compliance attempts are a red flag.
  • “Open-source datasets we use absolve us of responsibility.” Wrong. The deployer carries responsibility for the dataset used, regardless of upstream origin. A documented diligence check on the upstream provenance is required, not an automatic pass-through.

Practical implementation — minimum-viable Article 10 compliance

For a team starting from no formal data governance, the minimum work to reach demonstrable Article 10 compliance:

  1. Map your high-risk surface. Which of your deployed AI systems fall under Annex III? List them by name. If none do, Article 10 may not apply to you yet — but the Commission’s general-purpose AI guidance still does.
  2. Inventory your datasets. Every dataset that has touched a high-risk system, including upstream public datasets, scraped data, synthetic data, and user-derived data. Record source, license, date, content hash.
  3. Build a Dataset Specification Document per dataset. Use the 10-section template above. Start with the easy sections (composition, sources). The bias and gap analysis are harder and may require external help.
  4. Implement per-record provenance. A JSON-LD or similar audit trail attached to each training example, recording its source, license, processing chain, and ingestion timestamp. This becomes the evidence underpinning the dataset specification.
  5. Define the retraction procedure. Written. Tested. Owned by a named person. The hardest item on the list, and the one that will be requested first if a data subject ever invokes GDPR Article 17 against your training set.
  6. Annual review. Every Dataset Specification Document is reviewed and re-signed annually, even if the dataset has not changed. This produces the trail of ongoing oversight that Article 10 implicitly requires.

Bottom line

Article 10 turns dataset governance from an internal practice into a regulatory deliverable. The enforcement date is 2 August 2026. The minimum work — Annex III mapping, dataset inventory, per-dataset specification documents, per-record provenance trail, retraction procedure — is multiple months for a team starting from zero. The teams that survive comfortably are the ones that built the layer during data preparation, not bolted on after deployment.

For a concrete pattern for the per-record provenance layer, see our companion guide on building an audit-ready provenance trail for training datasets. For the dataset construction discipline that makes Article 10 compliance practical at scale, see our writing on dataset formats and SFT dataset curation.

See also: GDPR pseudonymization for LLM training data, the AI Act Article 10 datasheet template, and the French legal NLP landscape.

Frequently asked questions

Does the EU AI Act apply if my model is hosted in the US?

Yes. The Act applies extraterritorially. If the output is used by a person in the EU, the provider’s obligations attach regardless of where the model is hosted.

Is GDPR compliance enough?

No. GDPR governs personal-data processing legality. Article 10 governs dataset quality, governance, and documentation for high-risk AI. Different obligations, overlapping but distinct.

What counts as a ‘high-risk’ system?

The Annex III categories: biometric identification, employment screening, access to essential services (credit, insurance), law enforcement, migration, judicial assistance, critical infrastructure safety. A general-purpose LLM becomes high-risk by deployment.

Can I use open-source datasets and let upstream carry responsibility?

No. The deployer carries responsibility for the dataset used, regardless of upstream origin. A documented diligence check on the upstream provenance is required.

What are the penalties?

Article 10 violations sit in the lower tier: €15 M or 3 % of worldwide annual turnover, whichever is higher. Other Act violations go up to €35 M or 7 %.

Need AI Act-ready training data?

Our French regulatory and financial corpus ships with Article 10 documentation in place — per-document provenance, retraction procedure, dataset specification document, attestation per release.


Keep reading

Read next

Building an audit-ready provenance trail

The per-record provenance pattern that satisfies Article 10’s documentation requirements.

Read next

How to train an LLM on your own data

Where the governance layer plugs into a fine-tuning pipeline.

Read next

Choosing a dataset format

The storage decisions that make per-record provenance scalable.

Comments

2 responses to “EU AI Act Article 10 — what training data documentation actually requires.

  1. […] retrofitting. Teams that add it during data prep spend two to four days. The detailed pattern is in our article on AI Act Article 10 and the audit-ready provenance trail […]

Leave a Reply

Your email address will not be published. Required fields are marked *