Editorial · French Corpus LLM · regulatory & generative AI

Category: AI Act & Governance

EU AI Act, GDPR, and dataset governance for high-risk AI systems — what regulators expect, how to build audit-ready training data, provenance, lineage, and retraction procedures.

  • Datasheets for datasets — the AI Act Article 10 compliance template.

    “Datasheets for datasets” was a 2018 academic proposal. In 2026 it is the closest thing the AI Act Article 10 has to a recognized documentation template. What the framework gets right, what it misses for general-purpose LLM corpora, and how to actually compile one a regulator can read.

    Key takeaways

    • Datasheets for datasets (Gebru et al. 2018) defines seven sections that map cleanly onto AI Act Article 10 requirements. Use them as your spine — auditors recognize the structure.
    • The framework predates LLM training corpora. You need to extend Composition and Collection Process to cover provenance per source, deduplication strategy, and pseudonymization. Don’t ship without those.
    • A datasheet is a static document but the dataset is not. Version the datasheet alongside the dataset, sign the PDF, and keep a changelog. “DSD v1.3, signed 2026-05-15” is more credible than a dated wiki page.
    • Composition and Distribution are where the AI Act Article 10 audit risk lives. Bias notes, licensing per source, and downstream restrictions belong there — not in a separate compliance document.
    • A signed PDF with a published SHA-256 is the cheapest forgery-resistance you can ship. PAdES PKCS#7 self-signed is enough at this stage; the value is the hash chain, not the certificate authority.

    What “datasheets for datasets” actually is

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford published “Datasheets for Datasets” in 2018. The proposal was simple: every dataset used to train a model should ship with a structured document that answers a fixed set of questions, the way an electronic component ships with a datasheet. The questions are organized into seven sections — Motivation, Composition, Collection Process, Preprocessing/Cleaning/Labeling, Uses, Distribution, and Maintenance.

    The framework was not a regulatory requirement at the time. It was a community proposal that gained traction because it formalized what good data-science practice looked like. Hugging Face adopted a variant for its Dataset Cards. Papers With Code embedded it. By 2023 it was the most cited piece of dataset documentation infrastructure outside the ML-fairness literature.

    Why it became the de-facto Article 10 template

    Article 10 of the AI Act requires data governance and management practices “concerning training, validation and testing data sets,” including their relevance, representativeness, examination of biases, and identification of data gaps. The regulation does not prescribe a format. But when a 2026 EU Commission staff working paper or a national supervisor asks “show me your dataset documentation,” the closest thing in regulated industries is a Gebru-style datasheet.

    WHAT REGULATORS CHECK FOR

    Per-source licensing, per-source counts, biases documented even if uncomfortable, a preprocessing audit trail, a maintenance plan. Those five points are where most 2026 datasheets fall down. A Gebru-style structure forces you to address each.

    The seven sections, mapped to AI Act obligations

    Datasheet sectionAI Act Article 10 hookWhat auditors check
    MotivationPurpose statement, intended usesIs the stated purpose plausible against the data?
    CompositionRelevance, representativenessPer-source counts, language distribution, time coverage
    Collection ProcessData acquisition, consent postureLicensing per source, scraping vs paid, opt-outs
    PreprocessingExamination of biases, gap analysisDedup, filter, pseudonymization, what was dropped
    UsesRestrictions on downstream usesAcceptable use, forbidden use, foreseeable misuse
    DistributionVersioning, third-party sharingFormat, hash, signature, retention
    MaintenanceRefresh cadence, correction processQuarterly diff, errata process, contact

    What the original framework misses for LLM corpora

    Gebru et al. wrote with classifier and tabular datasets in mind. Three gaps matter for LLM training data:

    • Token-level statistics. The 2018 schema asks about “instances” and “labels.” For an LLM corpus you also need token counts per source, average document length, and the distribution of document sizes. A single number for “total tokens” is not enough.
    • Cross-source deduplication. Composition assumes sources are independent. In an LLM corpus they routinely overlap — a parliamentary debate quoted in a court ruling quoted in a regulator’s decision. Document your dedup method (exact match, MinHash LSH, semantic) and the drop rate per source.
    • Pseudonymization scope. The original framework asks about sensitive data broadly. For GDPR/AI Act you need a specific section on personal-name pseudonymization: method, detector, mapping strategy, per-document counts.

    A datasheet that reads like marketing copy will not survive a regulator audit. The right tone is plain and quantitative: what we did, how many, with what known limits.

    A working template — section by section

    Below is the structure we ship on our own French finance and regulatory corpus, with the specific deltas from the 2018 schema marked. Adapt to your context — but if you cannot answer one of these sub-questions, that is a finding.

    • Motivation: purpose of the dataset, who funded it, intended use cases, who is excluded from the user list (e.g., consumer-facing apps), and why the dataset exists rather than reusing public alternatives.
    • Composition: total documents, total characters, total tokens, language (with confidence threshold), per-source breakdown including counts and licensing, temporal coverage (oldest and newest document per source), and the document size distribution at the 25/50/75/95 percentile.
    • Collection Process: data sources URL by URL, acquisition method (scraping, bulk archive, API, paid license), date range of acquisition, robots.txt compliance posture, and per-source HTTP/PDF/HTML notes.
    • Preprocessing: stage 1 normalization, stage 2 quality scoring methods, stage 3 deduplication methods + drop rates per source, stage 4 LLM-judge or rule-based scoring if any, and the pseudonymization layer (detector, mapping, counts).
    • Uses: recommended uses, foreseeable misuse, restrictions, the model alignment plan, and what the dataset should NOT be used for (e.g., predicting individuals).
    • Distribution: formats shipped (Parquet, JSONL, Snowflake share), file hashes, signatures, tier definitions if you split the corpus, retention policy.
    • Maintenance: refresh cadence, semantic changelog process, contact channel for errata, deprecation policy.

    A signed Dataset Specification we ship every release

    We publish a 54-page bundle (master DSD + Source Licenses + Glossary) with every release. PAdES PKCS#7 self-signed, RSA-4096, SHA-256 published. The format you can copy is open.

    Signing — PAdES PKCS#7 and why it matters

    A datasheet is a static artifact. If you publish it as a PDF and someone modifies the PDF later, you have no way to detect the change. PAdES PKCS#7 self-signed embeds a cryptographic signature into the PDF structure. Verifiers see the signature, can extract the SHA-256, and can confirm whether the bytes match what you published.

    Self-signed certificates are the right starting point. The value is not chain-of-trust to a CA — it is integrity of the bytes against a hash you publish on your website. RSA-4096 is current practice. Use OpenSSL or open-source Python libraries like endesive or pyHanko. Total cost for a stable signing setup: an afternoon of work.

    Versioning — how to keep the datasheet fresh

    Treat the datasheet as code. Bump it on every dataset release. A v1.0 datasheet shipped with v1.0 of the dataset; v1.1 ships with the v1.1 dataset and a changelog row. The changelog should answer: what sources were added or dropped, what counts changed, what preprocessing was added, and what the new SHA-256 of the dataset bundle is.

    For LLM corpora that grow quarterly, a semantic changelog per release is worth more than a full diff. “DORA RTS published in March 2025 added 14 documents to eurlex_fr; no other source changed substantively.” That is the kind of summary that Enterprise buyers re-read at renewal time to justify their subscription.

    Common failure modes in 2026 datasheets

    From reviewing publicly shipped datasheets in late 2025 and early 2026, the recurring issues are predictable:

    • Aggregate-only statistics. “5 billion tokens, mostly English.” Auditors want per-source counts and the methodology by which you compute them.
    • Hand-waved licensing. A line like “all sources permit redistribution” is not enough. Each source gets a specific license name and a link to the license text.
    • Missing bias section. If you cannot articulate a bias in your corpus, you have not analyzed it. Every corpus has biases. Document the visible ones, and the ones you suspect but cannot quantify, and explain why you ship anyway.
    • No signature. An unsigned PDF on a marketing page is a starting point, not an audit artifact. Sign it.
    • Stale changelog. A v1.2.0 datasheet that mentions changes from v0.8 to v0.9 and stops there means the maintenance section is fiction.

    See a working datasheet in production

    Our 54-page Dataset Specification ships with every release. Signed PDF, SHA-256 published, per-source licenses, per-document audit trail. Open format you can copy.

    Frequently asked questions

    Is a datasheet required by the AI Act?

    Not by name. Article 10 requires “appropriate data governance and management practices” and the regulation references documentation in several places. A Gebru-style datasheet is the most efficient way to demonstrate compliance with the letter and spirit of those obligations. It is not the only acceptable format.

    How long should the datasheet be?

    Long enough to be specific, short enough to be readable. Our own bundle is 54 pages across three documents (master DSD, Source Licenses, Glossary) for a 14-source, 2-million-document corpus. A small classifier dataset might land in 8–15 pages.

    Can I publish the datasheet on a wiki instead of a PDF?

    You can, but you give up signature and version stability. A wiki page changes silently. The cheapest hybrid is to maintain the source in Markdown, publish HTML pages for discoverability and search, and generate a signed PDF for every release. The PDF is the authoritative artifact.

    What about confidentiality — what if some sources are under NDA?

    Aggregate the confidential sources under a single named category and document the aggregation policy. Better than not mentioning them. Worst case for a regulator: an undisclosed source surfaces during audit because the model leaks something traceable.

    How does this differ from a model card?

    Model cards describe a trained model — intended uses, evaluation, ethical considerations. Datasheets describe a dataset. They overlap on biases and intended uses but answer different questions. A complete documentation pack ships both, with cross-references.


    Keep reading

    Read next

    EU AI Act Article 10 — what training data documentation actually requires

    The article 10 obligations decoded for general-purpose AI providers, with the documentation deliverables a 2026 audit expects.

    Read next

    Building an audit-ready provenance trail for training datasets

    PROV-O modeling, per-document JSON-LD, signed manifests. The pieces of a trail an auditor will follow.

    Read next

    GDPR pseudonymization for LLM training data — patterns and pitfalls

    What pseudonymization means under Article 4(5), where regex falls short, and how to report it cleanly in a datasheet.

  • GDPR pseudonymization for LLM training data — patterns and pitfalls.

    Training data from regulator corpora — CNIL, ACPR, sanctioning bodies — is full of personal names. The right answer is not redaction, it is pseudonymization. Article 4(5) of the GDPR is specific about what counts, and the gap between “we replaced some names” and “this dataset is auditably safe” is wider than most teams realize.

    Key takeaways

    • Anonymization removes any chance of re-identification; pseudonymization keeps a controlled re-identification path. Under GDPR they are different regimes, with different downstream obligations.
    • Regex detection of civil titles (M., Mme, Maître, Dr.) finds about 70–90 % of named persons in French regulator text. The remaining gap is mostly bare surnames after first mention, and entity-name confusions.
    • Per-document mapping is the right default. Cross-document linkability is a feature for entity-linking tasks but a liability for GDPR scope. Most teams should keep numbering local to each document.
    • Companies, public bodies, and place names are not personal data under Article 4(1). Preserving them keeps the dataset useful for finance and regtech work. Mixing the two creates worse downstream signal.
    • AI Act Article 10 expects measurable data governance. “We pseudonymized X documents, with Y unique persons mapped to Z substitutions, with a documented detector and a known-limit list” is what a 2026 audit wants to see.

    Anonymization vs pseudonymization in GDPR Article 4

    The GDPR draws a hard line between the two. Anonymization means the data subject cannot be re-identified by any reasonably likely means, taking into account “all the means reasonably likely to be used,” including those of the controller. Anonymized data is outside the scope of GDPR. Pseudonymization, defined in Article 4(5), means the data can no longer be attributed to a specific person without additional information held separately. Pseudonymized data is still personal data — GDPR still applies, but you get lighter obligations on certain processing activities, including some research.

    Training data falls firmly in the pseudonymization regime when you replace names with stable aliases. The mapping itself is the “additional information” the GDPR talks about. If you keep that mapping, you keep the regime. If you destroy it after the pass, you push toward anonymization — provided the residual text cannot itself re-identify the subject through context.

    WHY IT MATTERS

    A LinkedIn-grade backgrounder on a public sanction decision is enough context to re-identify someone even if you strip their name. True anonymization is a high bar. Pseudonymization is what most LLM datasets can actually claim, and it is what the AI Act Article 10 documentation expects you to claim accurately.

    Why training data on regulator corpora needs this layer

    Open data from French regulators — CNIL deliberations, ACPR sanctions, Cour de cassation and Conseil d’État rulings — names individuals routinely. Compliance officers, directors, professionals being sanctioned, even rapporteurs. The text is public, so republishing it is legal. But training a model on it without a pseudonymization layer ships those names into the model weights and into anything the model generates.

    Concrete numbers from a recent pass we ran on the full CNIL deliberation archive (8,126 substantive documents, 1979–2025): 92 % of documents contained at least one personal name. 20,460 unique persons were detected across the corpus. 21,750 individual substitutions applied. Less than 1 % of substantive deliberations escape this — admin closure letters, mostly. The proportion will be similar on any regulator corpus.

    Detection — what a title-based regex captures

    The simplest reliable detector is a regex anchored on civil titles: M., Mme, Madame, Monsieur, Maître, Me, Dr., Pr., MM., plus the English equivalents. After the title, match 1–4 capitalized tokens including French particles (de, du, des, d’). On French regulator text this lands roughly 70–90 % of person mentions in one pass.

    What gets missed: bare surname references after a first titled mention (“Dupont a indiqué…” without “M. Dupont”), fully lowercased OCR artefacts, and names that collide with common nouns (“Petit”, “Leblanc”, “Roy”). To recover the bare surname cases, do a second pass: for each detected person, look for the last token of their full name later in the document, and only substitute it if no other person in the same document shares that surname. This catches 5–10 additional substitutions per long document without introducing false matches.

    Regex is not the right tool for general NER. It is the right tool for the narrow “title + capitalized phrase” pattern in French legal text — high precision, well-bounded, auditable.

    Mapping — per-document vs cross-document

    Per-document mapping resets the alias counter at every document. Two appearances of [P1] in different documents may or may not be the same person — the corpus does not encode that information anywhere. Cross-document mapping uses a global counter, so [P42] always refers to the same person across the corpus.

    Per-document is the right default for compliance datasets. It removes the linkability graph that a cross-document mapping creates. It is also harder to back-identify — even with the aliased text plus public knowledge, you cannot stitch a profile across documents. The trade-off is that the corpus loses some signal for entity-linking and co-reference resolution tasks. For finance and regulatory training, that signal is rarely valuable enough to justify the GDPR risk.

    Mapping modeCross-doc linkabilityUseful forGDPR posture
    Per-documentNoneCompliance LLMs, general SFTStrongest
    Cross-documentFullEntity linking, co-referenceRequires separate justification
    Hybrid (per-source)Within source onlyTargeted researchDocument the boundary

    What survives the pass (and why that is intentional)

    Companies, public bodies, and place names are not personal data under Article 4(1) of the GDPR. They identify legal persons or geographic entities, not natural persons. A pseudonymization pass on a regulator corpus should leave them alone. This is what keeps the dataset useful for downstream tasks like sanction analysis (“which banks were fined for AML failures?”) or jurisdictional studies (“how often does the Conseil d’État cite the CJEU?”).

    The non-obvious part is the title prefix. We keep M. and Mme in front of the alias: M. [P1] rather than [P1]. The reason is downstream learning. A model that sees the structural pattern can still learn that a civil title precedes a person reference, which matters for French legal style — court decisions and ACPR sanctions follow it strictly. Stripping the title flattens that signal.

    A pseudonymized FR finance + regulatory corpus, ready for SFT

    We ship a French finance, regulatory and economic LLM corpus with ACPR-pattern pseudonymization applied per document. Article 10 audit trail per row, signed Dataset Specification, quarterly refresh.

    Pitfalls — back-identification, OCR, partial overlap

    Three failure modes worth instrumenting against:

    • Back-identification via context. Even with the name replaced, a sanction decision often gives enough context (entity, role, date) to re-identify a specific person. Pseudonymization reduces, but does not eliminate, this risk. Acknowledge it.
    • OCR noise. Old PDFs use ligatures and spacing that break the regex. Run a normalization pass first (Unicode NFC, ligature decomposition, whitespace collapse). If you skip this, you ship a corpus where 10–30 % of names slip through in pre-2010 documents.
    • Partial overlap with entities. “M. Dupont” replaced cleanly, but the document later says “la société Dupont SARL.” The bare surname match must not fire there. Anchor your bare-surname pass on word boundaries and exclude tokens followed by SA, SARL, SAS, SAS unipersonnelle, etc.

    Tooling — Presidio, spaCy, dedicated regex, hybrid

    Microsoft Presidio gives you a pre-built PII analyzer with French support. It uses a mix of NER and patterns and reports recognized entity types with confidence scores. Useful as a sanity check. spaCy + a French model (fr_core_news_lg or a specialized legal model) gives you free PERSON detection — strong recall but variable precision on legal-style text where many proper nouns are not persons.

    For French regulatory text specifically, a dedicated regex anchored on civil titles outperforms general NER in our experience: higher precision, fully auditable rules, trivially testable. The right architecture is hybrid — the regex pass first for high-precision substitution with civil titles, then a NER pass to flag potential misses for human review. The hybrid keeps the audit trail clean.

    Reporting — what to write in your dataset specification

    Article 10 of the AI Act expects measurable data governance. For pseudonymization, your dataset specification should record, at minimum:

    • The detector method (regex pattern, NER model + version, or hybrid).
    • The mapping strategy (per-document or cross-document).
    • Aggregate counts: documents touched, unique persons detected, total substitutions.
    • The known-limit list (cases the detector misses, with examples).
    • Whether companies, places, article references, and civil titles are preserved.
    • The pre-pseudo backup retention policy (separate access controls, retention period).

    On our own corpus, that section reads: “ACPR-pattern regex detector matching title + 1–4 capitalized tokens; per-document mapping; numbering resets between documents; companies and Article-level references preserved; pre-pseudo backup retained on a separate access-controlled volume.” That sentence is what an audit team actually wants. The reproducible counts back it up.

    A French regulatory corpus that already does this

    We extract, pseudonymize, score, and audit-trail French regulator and finance text. ACPR, CNIL, Cour de cassation, JORF, EUR-Lex. AI Act Article 10 signed Dataset Specification ships with every release.

    Frequently asked questions

    Is pseudonymized training data still personal data under GDPR?

    Yes. Article 4(5) makes that explicit. If a mapping exists that could re-identify the data subject, the data remains personal data. The mapping itself is the additional information GDPR refers to. You get lighter obligations on certain research-adjacent processing, but you do not exit GDPR scope unless you can claim true anonymization.

    How is this different from redaction?

    Redaction destroys the token: “[REDACTED].” Pseudonymization replaces it with a stable alias: “[P1].” A model trained on pseudonymized text still learns that a specific entity is referenced multiple times within a document; the alias preserves co-reference within the doc. Redaction destroys that signal. For LLM training, pseudonymization is almost always the right choice.

    Should I pseudonymize company names too?

    Only if you have a specific reason. Companies are not data subjects under Article 4(1), so the GDPR does not require it. Stripping them removes signal a regtech model needs — sanction prediction, AML pattern detection, comparative jurisprudence. Pseudonymize companies only when you have a legitimate competitive-intelligence concern, and document that scope choice.

    What about persons who waived their privacy through public statements?

    GDPR does not have a public-figure carve-out as broad as US privacy law. Article 9 has an exception for data manifestly made public by the data subject, but that applies to special-category data, and even there the courts read it narrowly. Default to pseudonymization, then carve out specific personas only with documented legal review.

    How do I verify the pass actually worked?

    Three checks. (1) Spot-check: pick 50 documents at random, read them, count any name that survived the pass. (2) Pattern audit: search the post-pseudo corpus for M. [A-Z] and similar — every hit is a survivor or a false negative to investigate. (3) Reverse map: for each detected alias, confirm the original name in the pre-pseudo backup and the count of substitutions. All three should be in your audit log.


    Keep reading

    Read next

    EU AI Act Article 10 — what training data documentation actually requires

    Decoded Article 10 obligations for general-purpose AI providers and what they mean for your dataset documentation.

    Read next

    Building an audit-ready provenance trail for training datasets

    PROV-O modeling, per-document JSON-LD, signed manifests — the pieces of an audit trail regulators expect to see.

    Read next

    Choosing a dataset format — Parquet vs JSONL vs Arrow

    Format choices that affect ingestion speed, compression, schema enforcement, and downstream pipeline ergonomics.

  • Building an audit-ready provenance trail for training datasets.

    An audit-ready provenance trail is the artefact that lets someone reconstruct, for any record in your training dataset, four answers: where did this come from, how did we get it, what did we do to it, and is it still allowed to be here? Without that trail, a regulator’s request to remove a data subject’s content becomes a months-long forensic exercise.

    Key takeaways

    • Per-record JSON-LD provenance is now a hard requirement under the EU AI Act, GDPR Article 17, and copyright transparency rules.
    • Use the W3C PROV-O vocabulary — regulators recognize it; in-house schemas they don’t.
    • Sidecar files (provenance separate from training data) win over embedded fields at scale.
    • Hash-chain the shards. A signed manifest plus versioned object storage is forensically equivalent to a blockchain at 1 % the cost.
    • Built during ingestion: 5–10 % of dataset effort. Retrofitted after deployment: 5–10× the original effort.

    An audit-ready provenance trail is the artefact that lets someone reconstruct, for any record in your training dataset, the exact answer to four questions: where did this come from, how did we get it, what did we do to it, and is it still allowed to be here? Without that trail, a regulator’s request to remove a data subject’s content becomes a months-long forensic exercise. With it, the request becomes a one-day pipeline run. This guide walks through what an audit-ready provenance record contains, the W3C PROV-O vocabulary that has become the de facto standard, the implementation pattern that scales to billions of records, and the operational discipline that keeps the trail trustworthy.

    Why a provenance trail is now a hard requirement

    Built during ingestion: 5–10 % of dataset effort. Retrofitted under regulatory pressure: 5–10× the original dataset effort.

    Three regulatory pressures, all converging in 2026, have moved provenance from “nice to have” to “deploy-blocker”:

    • EU AI Act Article 10 requires documented data collection, preparation, and origin for every dataset used to train a high-risk system. Enforcement applies from 2 August 2026.
    • GDPR Article 17 gives data subjects a right to erasure, including from training datasets. A team that cannot identify which records came from which subject cannot honour the request without retraining from scratch.
    • Copyright clarity — EU Directive 2019/790 plus the AI Act’s transparency obligations for general-purpose AI providers require disclosure of the data sources used during training, with sufficient detail to allow rights-holders to verify or contest.

    The teams that come out of this period intact are the ones that built the layer during data preparation. The teams that retrofitted it after launch are the ones still negotiating timelines with their regulator a year in.

    Anatomy of an audit-ready provenance record

    For every individual training example — every row of your Parquet, every JSONL line, every annotation — the provenance record should answer the four questions. A minimal but production-grade record:

    {
      "record_id": "sha256:7f3c…a1b8",
      "source": {
        "name": "BOFiP-Impôts",
        "url": "https://bofip.impots.gouv.fr/bofip/12345-PGP.html",
        "license": "Licence Ouverte 2.0",
        "license_url": "https://www.etalab.gouv.fr/licence-ouverte-open-licence",
        "rights_holder": "DGFiP (Direction générale des Finances publiques)",
        "captured_at": "2026-05-08T14:22:31Z",
        "capture_method": "official-api"
      },
      "content_hash": "sha256:f2e8…9c44",
      "pipeline": {
        "extractor_version": "extractor-bofip@v1.4.2",
        "pipeline_commit": "git:c8380cc",
        "transformations": [
          "html_to_text",
          "language_detect",
          "minhash_dedup",
          "topical_filter"
        ]
      },
      "ai_act_declaration": {
        "intended_purpose": "high-risk-finance-llm-fine-tuning",
        "personal_data_present": false,
        "special_categories_present": false,
        "retraction_path": "DELETE /api/v1/records/{record_id}"
      },
      "ingested_at": "2026-05-13T09:14:02Z"
    }

    This is roughly 600 bytes per record uncompressed, ~150 bytes after gzip. For a 2 M-record corpus, the provenance sidecar weighs ~300 MB compressed — negligible relative to the training data itself.

    PROV-O — the vocabulary that aligns with regulator expectations

    The W3C PROV-O ontology is the standard vocabulary for representing provenance. It defines three core types:

    • Entity — a thing that exists. The raw HTML page, the cleaned text, the final Parquet record. Each is an Entity with its own URI.
    • Activity — a process that uses and produces Entities. Extraction, cleaning, deduplication, annotation are each Activities.
    • Agent — a person, organization, or piece of software responsible for an Activity. The DGFiP that published the original document, the extractor script, the annotator who reviewed the result.

    The PROV-JSON-LD serialization makes this representation machine-readable and Linked-Data-compliant in one format. It is the closest the field has to a regulator-recognized standard. Three reasons to use it:

    1. An EU AI Act auditor will recognise PROV-O on sight. They will not recognise your in-house JSON schema.
    2. PROV-O graphs can be queried with SPARQL or any RDF tool, which makes “show me every record affected by removing this source” a single query.
    3. The vocabulary is stable. The W3C standard has not changed substantively since 2013. Investments in PROV-O compliance do not need to be rewritten when the next provenance fad arrives.

    Implementation pattern — sidecar files, not embedded fields

    Two architectural choices in tension:

    • Embedded: the provenance fields live inside each training record. Simple, no joins needed. Drawback: changes to provenance schema require rewriting the entire dataset. Retraction is awkward because the record and its provenance share the same row.
    • Sidecar: provenance lives in a separate file (Parquet shard or JSON-LD graph) keyed by record_id. Slightly more complex to query, but provenance can evolve independently of the training data. Retraction is clean — a deletion in the sidecar plus the same in the dataset.

    The sidecar pattern wins at scale. The minor query overhead is paid back the first time you need to update the provenance schema (and you will, when the next regulatory guidance lands).

    Layout for a multi-source corpus that we use ourselves:

    /dataset/
      s5_package/
        sample/
          data-00000.parquet           (training records)
          data-00001.parquet
          ...
      s7_audit_trail/
        sample/
          provenance-00000.jsonl.gz    (PROV-O records, keyed by record_id)
          provenance-00001.jsonl.gz
          ...
        MANIFEST.json                  (dataset hash, version, signed)
        ATTESTATION_AI_ACT_ART10.md    (human-readable executive summary)

    The manifest hashes the entire provenance trail, the training data, and the pipeline code together. Tampering with any of the three breaks the manifest signature.

    Hash chains and immutability

    Want a corpus with provenance already built in?

    Every record in our French regulatory and financial corpus carries a PROV-O entry, every shard is hashed, every release ships with an Article 10 attestation document.

    Provenance records must be append-only and tamper-evident. A chain of content hashes provides both properties without requiring a blockchain or a notary service:

    1. Each provenance record carries the SHA-256 of its training record’s content.
    2. The provenance file (a single Parquet shard or JSONL chunk) carries the SHA-256 of the previous shard plus its own content. Removing or modifying a shard invalidates all downstream hashes.
    3. The manifest carries the SHA-256 of all shards plus the pipeline commit hash. Re-running the pipeline on the same source produces a different timestamp but the same content hashes — which is the property auditors look for.

    This is roughly the same construction as Git’s Merkle tree. For most teams, the operational cost is low: a small Python helper runs at the end of every dataset release and writes the hashes.

    The retraction procedure

    The provenance trail’s value is most concrete when someone invokes a removal right. A workable retraction procedure has four stages:

    1. Identify. Query the provenance graph for all records matching the removal criterion (URL, rights-holder, document ID). Output: a list of record_id values.
    2. Tombstone. Mark the records as retracted in the provenance trail. The record remains in the file (immutability) but is flagged. Downstream consumers filter retracted records before use.
    3. Re-release. At the next scheduled release, the training data and provenance trail are regenerated without the retracted records. The manifest is updated; the dataset version bumps.
    4. Re-evaluate. The model that was trained on the previous version is either re-trained on the new version, or its continued use is justified with a documented assessment that the retracted content’s removal does not materially affect the model. Both are acceptable; the choice must be documented.

    The hard part is step 1. A team that built provenance during ingestion completes it in minutes. A team that relies on grep over original sources can spend weeks.

    What “trustworthy” provenance actually looks like to an auditor

    Three properties that distinguish provenance an auditor accepts from provenance they treat as suspicious:

    • Continuity. Provenance timestamps lie on or near the training-run timestamps. Provenance created weeks after a model release is interpreted as retroactive — possibly legitimate, often not.
    • Granularity. Per-record, not per-source. A “we used these 12 datasets” declaration is not provenance; it is metadata. A “this specific record came from this specific URL captured at this specific time under this specific license” record is provenance.
    • Verifiability. An auditor can pick a random record, read its provenance entry, and trace the chain back to the original source. Brittle chains (URLs that 404, hashes that do not reproduce, missing pipeline scripts) erode the entire trail’s credibility, not just the affected record’s.

    Common shortcuts that backfire

    Source-level is not enough

    "We documented sources, that’s enough" does not survive a per-record retraction request. The provenance must descend to the record granularity, or the retraction work scales with the dataset size, not the request size.

    • “We log to a SaaS data-catalog tool; that’s our provenance.” Most data-catalog tools document tables, not records. They are necessary but not sufficient.
    • “We rely on Git history for our pipeline scripts.” Git tells you what code existed when. It does not tell you which records were processed by which run. You need a runtime log keyed to record hashes, not just code commits.
    • “We use a blockchain to anchor our provenance.” Overkill for nearly every use case. A signed SHA-256 manifest stored in object storage with versioning enabled is forensically equivalent and several orders of magnitude cheaper.
    • “We documented sources, that’s enough.” Source-level documentation does not survive a per-record retraction request. The provenance must descend to the record granularity, or the retraction work scales with the dataset size, not the request size.

    Bottom line

    An audit-ready provenance trail is a per-record JSON-LD sidecar using the PROV-O vocabulary, organized as immutable hashed shards alongside the training data, with a manifest that signs the whole. It costs about 5–10 % of the engineering effort of building the dataset itself, when added during ingestion. It costs 5–10× the original dataset effort when added after deployment under regulatory pressure.

    This is the pattern we use for the French regulatory text corpus we publish — every record carries a PROV-O entry, every shard is hashed, every release ships with an Article 10 attestation document. For the regulatory layer that determines whether the trail is sufficient, see our article on EU AI Act Article 10. For the engineering decisions that make per-record provenance practical at scale, see our writing on dataset formats and training data size.

    See also: GDPR pseudonymization for LLM training data and the AI Act Article 10 datasheet template.

    Frequently asked questions

    Why PROV-O and not a custom schema?

    Regulators recognize W3C PROV-O on sight. They do not recognize in-house JSON schemas. The vocabulary has been stable since 2013 — investments in PROV-O compliance don’t need to be rewritten when the next provenance fad arrives.

    Sidecar files or embedded provenance?

    Sidecar wins at scale. Provenance can evolve independently of training data; retraction is clean (delete in sidecar plus same in dataset); schema changes don’t require rewriting the entire corpus.

    Do I need a blockchain?

    No. A signed SHA-256 manifest stored in object storage with versioning enabled is forensically equivalent and several orders of magnitude cheaper. Blockchains are overkill for nearly every training data use case.

    How does retraction actually work?

    Four stages: identify all records matching the criterion via provenance query (minutes), tombstone them in the trail, regenerate the dataset and manifest at the next release, and either re-train or document why continued use is acceptable.

    What if an auditor picks a random record?

    They can trace it back to the source URL, capture date, and license — within seconds. Brittle chains (404 URLs, non-reproducible hashes, missing pipeline scripts) erode the entire trail’s credibility, not just the affected record’s.

    Need a corpus with provenance already in place?

    Our French regulatory and financial corpus ships with PROV-O sidecar files, hash-chained shards, and an Article 10 attestation document per release.


    Keep reading

    Read next

    EU AI Act Article 10

    The regulatory layer that determines whether the trail is sufficient.

    Read next

    Choosing a dataset format

    Storage format decisions that make per-record provenance practical.

    Read next

    Training data size for LLMs

    Volume thresholds where provenance overhead becomes meaningful.

  • EU AI Act Article 10 — what training data documentation actually requires.

    Article 10 of the EU AI Act is the part that translates "the AI must be trustworthy" into engineering work product. It applies to every high-risk AI system placed on the EU market, and from 2 August 2026 it makes data governance a documentation deliverable, not an internal posture.

    Key takeaways

    • Article 10 applies to training, validation, AND test sets — not just training.
    • Eight documented governance practices required: design choices, data origin, preparation, assumptions, suitability, bias examination, mitigation, gap analysis.
    • Dataset Specification Document per dataset — 20–50 pages for a substantial LLM corpus.
    • Enforcement starts 2 August 2026. Fines reach €15 M or 3 % of worldwide annual turnover.
    • GDPR compliance is not enough. Article 10 is a separate, overlapping obligation.

    Article 10 of the EU AI Act is the part of the regulation that translates “the AI must be trustworthy” into engineering work product. It applies to every high-risk AI system placed on the EU market or used by an EU person, and it makes data governance a documentation deliverable, not an internal posture. The text is short — under 1 000 words. The implementation depth is multiple months of work for any team starting from a blank page. This guide walks through what Article 10 actually says, what each requirement means concretely for a training dataset, the enforcement timeline that matters in 2026, and the documentation artefacts you need to produce.

    What “high-risk” actually covers

    Article 10’s obligations only kick in for high-risk systems. Annex III of the Act defines the categories. The ones most likely to matter for an LLM-based product:

    • Biometric identification and categorisation — any model that infers identity, age, gender, or emotional state from a face, voice, or behavioural signal.
    • Employment and worker management — recruitment screening, CV filtering, performance evaluation, task allocation, termination decisions.
    • Access to essential services — credit scoring, insurance pricing, eligibility for public benefits, emergency services dispatch.
    • Law enforcement — risk assessment, evidence evaluation, lie detection, profiling.
    • Migration, asylum, and border control — visa and asylum risk scoring, document verification, identity assessment.
    • Administration of justice and democratic processes — assistance to judicial decisions, voting and election influence detection.
    • Critical infrastructure — safety components in road traffic, water supply, gas, electricity, digital infrastructure.

    An LLM-based product that touches any of these categories is high-risk by deployment, even if the underlying model is general-purpose. A general-purpose AI model marketed as “for HR screening” inherits the high-risk classification of the deployment use case.

    Article 10, paragraph by paragraph

    The Article has six paragraphs. The ones with concrete engineering implications:

    Paragraph 1 — Data governance applies to all training, validation, and testing sets

    The obligation is not limited to the training dataset. Validation and test sets must meet the same governance bar. In practice, this means the same provenance records, the same quality controls, the same bias audits — applied to every split, with documentation showing they were applied independently.

    Paragraph 2 — Specific governance practices

    The longest paragraph, listing eight required governance practices. Each must be documented:

    • Design choices — why was this dataset constructed in this way, what scope was excluded, what alternative sources were considered.
    • Data collection processes and origin — where each data point came from, how it was collected, under what legal basis if personal data was involved.
    • Data preparation operations — annotation, labelling, cleaning, enrichment, aggregation. The complete processing chain from raw source to training-ready format.
    • Formulation of relevant assumptions — what the data is assumed to measure or represent. Often the most overlooked requirement, and the one regulators will challenge first.
    • Assessment of availability, quantity, and suitability — does the dataset have enough representative examples for the deployment context.
    • Examination of bias — bias that could affect health and safety, bias that could lead to discrimination prohibited by EU law. Explicit, not implicit.
    • Appropriate measures to detect, prevent, and mitigate bias — what was done about the biases identified. Documentation of intent is not enough; documentation of action is required.
    • Identification of data gaps and shortcomings — what the dataset does not cover, and how that absence might affect downstream use.

    Paragraph 3 — Quality criteria

    Training, validation, and test sets must be: relevant, sufficiently representative, to the best extent possible free of errors, complete in view of the intended purpose, with appropriate statistical properties. Five adjectives, each requiring measurable evidence.

    “Relevant” and “complete” are interpreted relative to the intended purpose declared in the technical documentation (Article 11). This means the intended purpose statement is no longer a marketing field — it becomes the yardstick against which dataset adequacy is judged.

    Paragraph 4 — Contextual specificity

    Datasets must reflect the geographical, contextual, behavioural, and functional setting where the system will be used. A credit-scoring model trained on Italian data and deployed in France triggers a documented assessment of cross-context transfer risk. A face-recognition model trained on a US-centric dataset and deployed in Northern Europe triggers the same.

    Paragraph 5 — Special categories of personal data

    When the dataset includes special categories of personal data (race, ethnicity, health, religion, sexual orientation), the controller may process them only for the purpose of bias detection and correction, only when strictly necessary, only with appropriate safeguards, and never for downstream prediction. This paragraph creates an explicit, narrow GDPR carve-out specifically for fairness work, with documentation requirements that exceed normal GDPR processing records.

    The documentation artefact — Dataset Specification Document

    Producing the Dataset Specification Document three months after training is finished is an order of magnitude more expensive than producing it during training. Build the layer in early.

    Article 10 does not specify a document name, but Article 11 plus Annex IV together require a Dataset Specification Document per dataset. The minimum content, distilled from the regulatory text plus the Commission’s August 2025 implementation guidance:

    1. Dataset identifier — name, version, content hash, date of construction.
    2. Intended purpose mapping — explicit link to the high-risk system’s Article 11 intended-purpose declaration.
    3. Composition — sources, volumes per source, license per source, time range, geographic coverage.
    4. Collection methodology — how each source was obtained, scraped, licensed, or annotated. Legal basis for any personal data processing.
    5. Preparation chain — every transformation from raw to final, with versioned scripts or pipelines.
    6. Assumptions and exclusions — what the dataset assumes about its content; what is deliberately excluded.
    7. Quality metrics — error rate, deduplication rate, language-detection accuracy, sampling adequacy. Numbers, not descriptions.
    8. Bias analysis — protected attributes examined, methodology, results, mitigation actions taken.
    9. Gap analysis — known limitations, populations or contexts underrepresented, downstream risks.
    10. Retention and retraction — how long the dataset is kept, how a data subject can request removal, how the system is re-evaluated after removal.

    For an LLM training corpus of any substantial size, this document runs 20 to 50 pages. Producing it three months after training is finished is an order of magnitude more expensive than producing it during training. Build the layer in early.

    Enforcement timeline — what is binding when

    Need Article 10-ready training data?

    Our French regulatory and financial corpus ships with a Dataset Specification Document, per-record JSON-LD provenance, and a retraction procedure already documented.

    The AI Act entered into force on 1 August 2024. Its provisions apply on a staggered schedule:

    • 2 February 2025: prohibitions (Article 5) on banned practices.
    • 2 August 2025: obligations for general-purpose AI models (Chapter V).
    • 2 August 2026: Article 10 and the rest of the high-risk obligations become applicable. Member State authorities can begin enforcement.
    • 2 August 2027: high-risk systems embedded in products subject to existing Union harmonisation legislation.

    The 2 August 2026 date matters because most teams operating high-risk systems in the EU today already need to be compliant in less than three months. Penalties for Article 10 non-compliance reach the higher of €15 million or 3 % of worldwide annual turnover. Other Act violations can go up to €35 million or 7 %, but Article 10 sits in the 3 % tier.

    Common misinterpretations

    No retroactive exception

    Authorities will compare documentation timestamps against training run timestamps. Obvious retroactive compliance attempts (creating provenance records months after training) are a red flag and may trigger deeper audit. Build the layer during data preparation, not after deployment.

    • “Our model is hosted in the US, so the AI Act doesn’t apply.” Wrong. The Act applies extraterritorially. If the output is used by a person in the EU, the provider’s obligations attach.
    • “We don’t train on personal data, so Article 10 is irrelevant.” Wrong. Article 10 applies to all training data, personal or not, for any high-risk use case.
    • “GDPR compliance is enough.” Wrong. GDPR governs personal-data processing legality. Article 10 governs dataset quality, governance, and documentation for high-risk AI. Different obligations, overlapping but distinct.
    • “We can document the dataset retroactively before audit.” Possible, expensive, and increasingly suspicious. Authorities will compare documentation timestamps against training run timestamps; obvious retroactive compliance attempts are a red flag.
    • “Open-source datasets we use absolve us of responsibility.” Wrong. The deployer carries responsibility for the dataset used, regardless of upstream origin. A documented diligence check on the upstream provenance is required, not an automatic pass-through.

    Practical implementation — minimum-viable Article 10 compliance

    For a team starting from no formal data governance, the minimum work to reach demonstrable Article 10 compliance:

    1. Map your high-risk surface. Which of your deployed AI systems fall under Annex III? List them by name. If none do, Article 10 may not apply to you yet — but the Commission’s general-purpose AI guidance still does.
    2. Inventory your datasets. Every dataset that has touched a high-risk system, including upstream public datasets, scraped data, synthetic data, and user-derived data. Record source, license, date, content hash.
    3. Build a Dataset Specification Document per dataset. Use the 10-section template above. Start with the easy sections (composition, sources). The bias and gap analysis are harder and may require external help.
    4. Implement per-record provenance. A JSON-LD or similar audit trail attached to each training example, recording its source, license, processing chain, and ingestion timestamp. This becomes the evidence underpinning the dataset specification.
    5. Define the retraction procedure. Written. Tested. Owned by a named person. The hardest item on the list, and the one that will be requested first if a data subject ever invokes GDPR Article 17 against your training set.
    6. Annual review. Every Dataset Specification Document is reviewed and re-signed annually, even if the dataset has not changed. This produces the trail of ongoing oversight that Article 10 implicitly requires.

    Bottom line

    Article 10 turns dataset governance from an internal practice into a regulatory deliverable. The enforcement date is 2 August 2026. The minimum work — Annex III mapping, dataset inventory, per-dataset specification documents, per-record provenance trail, retraction procedure — is multiple months for a team starting from zero. The teams that survive comfortably are the ones that built the layer during data preparation, not bolted on after deployment.

    For a concrete pattern for the per-record provenance layer, see our companion guide on building an audit-ready provenance trail for training datasets. For the dataset construction discipline that makes Article 10 compliance practical at scale, see our writing on dataset formats and SFT dataset curation.

    See also: GDPR pseudonymization for LLM training data, the AI Act Article 10 datasheet template, and the French legal NLP landscape.

    Frequently asked questions

    Does the EU AI Act apply if my model is hosted in the US?

    Yes. The Act applies extraterritorially. If the output is used by a person in the EU, the provider’s obligations attach regardless of where the model is hosted.

    Is GDPR compliance enough?

    No. GDPR governs personal-data processing legality. Article 10 governs dataset quality, governance, and documentation for high-risk AI. Different obligations, overlapping but distinct.

    What counts as a ‘high-risk’ system?

    The Annex III categories: biometric identification, employment screening, access to essential services (credit, insurance), law enforcement, migration, judicial assistance, critical infrastructure safety. A general-purpose LLM becomes high-risk by deployment.

    Can I use open-source datasets and let upstream carry responsibility?

    No. The deployer carries responsibility for the dataset used, regardless of upstream origin. A documented diligence check on the upstream provenance is required.

    What are the penalties?

    Article 10 violations sit in the lower tier: €15 M or 3 % of worldwide annual turnover, whichever is higher. Other Act violations go up to €35 M or 7 %.

    Need AI Act-ready training data?

    Our French regulatory and financial corpus ships with Article 10 documentation in place — per-document provenance, retraction procedure, dataset specification document, attestation per release.


    Keep reading

    Read next

    Building an audit-ready provenance trail

    The per-record provenance pattern that satisfies Article 10’s documentation requirements.

    Read next

    How to train an LLM on your own data

    Where the governance layer plugs into a fine-tuning pipeline.

    Read next

    Choosing a dataset format

    The storage decisions that make per-record provenance scalable.