Essay · French Corpus LLM · Finaleads LLC

Building an audit-ready provenance trail for training datasets.

An audit-ready provenance trail is the artefact that lets someone reconstruct, for any record in your training dataset, four answers: where did this come from, how did we get it, what did we do to it, and is it still allowed to be here? Without that trail, a regulator’s request to remove a data subject’s content becomes a months-long forensic exercise.

Key takeaways

  • Per-record JSON-LD provenance is now a hard requirement under the EU AI Act, GDPR Article 17, and copyright transparency rules.
  • Use the W3C PROV-O vocabulary — regulators recognize it; in-house schemas they don’t.
  • Sidecar files (provenance separate from training data) win over embedded fields at scale.
  • Hash-chain the shards. A signed manifest plus versioned object storage is forensically equivalent to a blockchain at 1 % the cost.
  • Built during ingestion: 5–10 % of dataset effort. Retrofitted after deployment: 5–10× the original effort.

An audit-ready provenance trail is the artefact that lets someone reconstruct, for any record in your training dataset, the exact answer to four questions: where did this come from, how did we get it, what did we do to it, and is it still allowed to be here? Without that trail, a regulator’s request to remove a data subject’s content becomes a months-long forensic exercise. With it, the request becomes a one-day pipeline run. This guide walks through what an audit-ready provenance record contains, the W3C PROV-O vocabulary that has become the de facto standard, the implementation pattern that scales to billions of records, and the operational discipline that keeps the trail trustworthy.

Why a provenance trail is now a hard requirement

Built during ingestion: 5–10 % of dataset effort. Retrofitted under regulatory pressure: 5–10× the original dataset effort.

Three regulatory pressures, all converging in 2026, have moved provenance from “nice to have” to “deploy-blocker”:

  • EU AI Act Article 10 requires documented data collection, preparation, and origin for every dataset used to train a high-risk system. Enforcement applies from 2 August 2026.
  • GDPR Article 17 gives data subjects a right to erasure, including from training datasets. A team that cannot identify which records came from which subject cannot honour the request without retraining from scratch.
  • Copyright clarity — EU Directive 2019/790 plus the AI Act’s transparency obligations for general-purpose AI providers require disclosure of the data sources used during training, with sufficient detail to allow rights-holders to verify or contest.

The teams that come out of this period intact are the ones that built the layer during data preparation. The teams that retrofitted it after launch are the ones still negotiating timelines with their regulator a year in.

Anatomy of an audit-ready provenance record

For every individual training example — every row of your Parquet, every JSONL line, every annotation — the provenance record should answer the four questions. A minimal but production-grade record:

{
  "record_id": "sha256:7f3c…a1b8",
  "source": {
    "name": "BOFiP-Impôts",
    "url": "https://bofip.impots.gouv.fr/bofip/12345-PGP.html",
    "license": "Licence Ouverte 2.0",
    "license_url": "https://www.etalab.gouv.fr/licence-ouverte-open-licence",
    "rights_holder": "DGFiP (Direction générale des Finances publiques)",
    "captured_at": "2026-05-08T14:22:31Z",
    "capture_method": "official-api"
  },
  "content_hash": "sha256:f2e8…9c44",
  "pipeline": {
    "extractor_version": "extractor-bofip@v1.4.2",
    "pipeline_commit": "git:c8380cc",
    "transformations": [
      "html_to_text",
      "language_detect",
      "minhash_dedup",
      "topical_filter"
    ]
  },
  "ai_act_declaration": {
    "intended_purpose": "high-risk-finance-llm-fine-tuning",
    "personal_data_present": false,
    "special_categories_present": false,
    "retraction_path": "DELETE /api/v1/records/{record_id}"
  },
  "ingested_at": "2026-05-13T09:14:02Z"
}

This is roughly 600 bytes per record uncompressed, ~150 bytes after gzip. For a 2 M-record corpus, the provenance sidecar weighs ~300 MB compressed — negligible relative to the training data itself.

PROV-O — the vocabulary that aligns with regulator expectations

The W3C PROV-O ontology is the standard vocabulary for representing provenance. It defines three core types:

  • Entity — a thing that exists. The raw HTML page, the cleaned text, the final Parquet record. Each is an Entity with its own URI.
  • Activity — a process that uses and produces Entities. Extraction, cleaning, deduplication, annotation are each Activities.
  • Agent — a person, organization, or piece of software responsible for an Activity. The DGFiP that published the original document, the extractor script, the annotator who reviewed the result.

The PROV-JSON-LD serialization makes this representation machine-readable and Linked-Data-compliant in one format. It is the closest the field has to a regulator-recognized standard. Three reasons to use it:

  1. An EU AI Act auditor will recognise PROV-O on sight. They will not recognise your in-house JSON schema.
  2. PROV-O graphs can be queried with SPARQL or any RDF tool, which makes “show me every record affected by removing this source” a single query.
  3. The vocabulary is stable. The W3C standard has not changed substantively since 2013. Investments in PROV-O compliance do not need to be rewritten when the next provenance fad arrives.

Implementation pattern — sidecar files, not embedded fields

Two architectural choices in tension:

  • Embedded: the provenance fields live inside each training record. Simple, no joins needed. Drawback: changes to provenance schema require rewriting the entire dataset. Retraction is awkward because the record and its provenance share the same row.
  • Sidecar: provenance lives in a separate file (Parquet shard or JSON-LD graph) keyed by record_id. Slightly more complex to query, but provenance can evolve independently of the training data. Retraction is clean — a deletion in the sidecar plus the same in the dataset.

The sidecar pattern wins at scale. The minor query overhead is paid back the first time you need to update the provenance schema (and you will, when the next regulatory guidance lands).

Layout for a multi-source corpus that we use ourselves:

/dataset/
  s5_package/
    sample/
      data-00000.parquet           (training records)
      data-00001.parquet
      ...
  s7_audit_trail/
    sample/
      provenance-00000.jsonl.gz    (PROV-O records, keyed by record_id)
      provenance-00001.jsonl.gz
      ...
    MANIFEST.json                  (dataset hash, version, signed)
    ATTESTATION_AI_ACT_ART10.md    (human-readable executive summary)

The manifest hashes the entire provenance trail, the training data, and the pipeline code together. Tampering with any of the three breaks the manifest signature.

Hash chains and immutability

Want a corpus with provenance already built in?

Every record in our French regulatory and financial corpus carries a PROV-O entry, every shard is hashed, every release ships with an Article 10 attestation document.

Provenance records must be append-only and tamper-evident. A chain of content hashes provides both properties without requiring a blockchain or a notary service:

  1. Each provenance record carries the SHA-256 of its training record’s content.
  2. The provenance file (a single Parquet shard or JSONL chunk) carries the SHA-256 of the previous shard plus its own content. Removing or modifying a shard invalidates all downstream hashes.
  3. The manifest carries the SHA-256 of all shards plus the pipeline commit hash. Re-running the pipeline on the same source produces a different timestamp but the same content hashes — which is the property auditors look for.

This is roughly the same construction as Git’s Merkle tree. For most teams, the operational cost is low: a small Python helper runs at the end of every dataset release and writes the hashes.

The retraction procedure

The provenance trail’s value is most concrete when someone invokes a removal right. A workable retraction procedure has four stages:

  1. Identify. Query the provenance graph for all records matching the removal criterion (URL, rights-holder, document ID). Output: a list of record_id values.
  2. Tombstone. Mark the records as retracted in the provenance trail. The record remains in the file (immutability) but is flagged. Downstream consumers filter retracted records before use.
  3. Re-release. At the next scheduled release, the training data and provenance trail are regenerated without the retracted records. The manifest is updated; the dataset version bumps.
  4. Re-evaluate. The model that was trained on the previous version is either re-trained on the new version, or its continued use is justified with a documented assessment that the retracted content’s removal does not materially affect the model. Both are acceptable; the choice must be documented.

The hard part is step 1. A team that built provenance during ingestion completes it in minutes. A team that relies on grep over original sources can spend weeks.

What “trustworthy” provenance actually looks like to an auditor

Three properties that distinguish provenance an auditor accepts from provenance they treat as suspicious:

  • Continuity. Provenance timestamps lie on or near the training-run timestamps. Provenance created weeks after a model release is interpreted as retroactive — possibly legitimate, often not.
  • Granularity. Per-record, not per-source. A “we used these 12 datasets” declaration is not provenance; it is metadata. A “this specific record came from this specific URL captured at this specific time under this specific license” record is provenance.
  • Verifiability. An auditor can pick a random record, read its provenance entry, and trace the chain back to the original source. Brittle chains (URLs that 404, hashes that do not reproduce, missing pipeline scripts) erode the entire trail’s credibility, not just the affected record’s.

Common shortcuts that backfire

Source-level is not enough

"We documented sources, that’s enough" does not survive a per-record retraction request. The provenance must descend to the record granularity, or the retraction work scales with the dataset size, not the request size.

  • “We log to a SaaS data-catalog tool; that’s our provenance.” Most data-catalog tools document tables, not records. They are necessary but not sufficient.
  • “We rely on Git history for our pipeline scripts.” Git tells you what code existed when. It does not tell you which records were processed by which run. You need a runtime log keyed to record hashes, not just code commits.
  • “We use a blockchain to anchor our provenance.” Overkill for nearly every use case. A signed SHA-256 manifest stored in object storage with versioning enabled is forensically equivalent and several orders of magnitude cheaper.
  • “We documented sources, that’s enough.” Source-level documentation does not survive a per-record retraction request. The provenance must descend to the record granularity, or the retraction work scales with the dataset size, not the request size.

Bottom line

An audit-ready provenance trail is a per-record JSON-LD sidecar using the PROV-O vocabulary, organized as immutable hashed shards alongside the training data, with a manifest that signs the whole. It costs about 5–10 % of the engineering effort of building the dataset itself, when added during ingestion. It costs 5–10× the original dataset effort when added after deployment under regulatory pressure.

This is the pattern we use for the French regulatory text corpus we publish — every record carries a PROV-O entry, every shard is hashed, every release ships with an Article 10 attestation document. For the regulatory layer that determines whether the trail is sufficient, see our article on EU AI Act Article 10. For the engineering decisions that make per-record provenance practical at scale, see our writing on dataset formats and training data size.

See also: GDPR pseudonymization for LLM training data and the AI Act Article 10 datasheet template.

Frequently asked questions

Why PROV-O and not a custom schema?

Regulators recognize W3C PROV-O on sight. They do not recognize in-house JSON schemas. The vocabulary has been stable since 2013 — investments in PROV-O compliance don’t need to be rewritten when the next provenance fad arrives.

Sidecar files or embedded provenance?

Sidecar wins at scale. Provenance can evolve independently of training data; retraction is clean (delete in sidecar plus same in dataset); schema changes don’t require rewriting the entire corpus.

Do I need a blockchain?

No. A signed SHA-256 manifest stored in object storage with versioning enabled is forensically equivalent and several orders of magnitude cheaper. Blockchains are overkill for nearly every training data use case.

How does retraction actually work?

Four stages: identify all records matching the criterion via provenance query (minutes), tombstone them in the trail, regenerate the dataset and manifest at the next release, and either re-train or document why continued use is acceptable.

What if an auditor picks a random record?

They can trace it back to the source URL, capture date, and license — within seconds. Brittle chains (404 URLs, non-reproducible hashes, missing pipeline scripts) erode the entire trail’s credibility, not just the affected record’s.

Need a corpus with provenance already in place?

Our French regulatory and financial corpus ships with PROV-O sidecar files, hash-chained shards, and an Article 10 attestation document per release.


Keep reading

Read next

EU AI Act Article 10

The regulatory layer that determines whether the trail is sufficient.

Read next

Choosing a dataset format

Storage format decisions that make per-record provenance practical.

Read next

Training data size for LLMs

Volume thresholds where provenance overhead becomes meaningful.

Comments

2 responses to “Building an audit-ready provenance trail for training datasets.

  1. […] Teams that bolt this layer on after deployment spend two to four months retrofitting. Teams that add it during data prep spend two to four days. The detailed pattern is in our article on AI Act Article 10 and the audit-ready provenance trail guide. […]

Leave a Reply

Your email address will not be published. Required fields are marked *