“Datasheets for datasets” was a 2018 academic proposal. In 2026 it is the closest thing the AI Act Article 10 has to a recognized documentation template. What the framework gets right, what it misses for general-purpose LLM corpora, and how to actually compile one a regulator can read.
Key takeaways
- Datasheets for datasets (Gebru et al. 2018) defines seven sections that map cleanly onto AI Act Article 10 requirements. Use them as your spine — auditors recognize the structure.
- The framework predates LLM training corpora. You need to extend Composition and Collection Process to cover provenance per source, deduplication strategy, and pseudonymization. Don’t ship without those.
- A datasheet is a static document but the dataset is not. Version the datasheet alongside the dataset, sign the PDF, and keep a changelog. “DSD v1.3, signed 2026-05-15” is more credible than a dated wiki page.
- Composition and Distribution are where the AI Act Article 10 audit risk lives. Bias notes, licensing per source, and downstream restrictions belong there — not in a separate compliance document.
- A signed PDF with a published SHA-256 is the cheapest forgery-resistance you can ship. PAdES PKCS#7 self-signed is enough at this stage; the value is the hash chain, not the certificate authority.
In this article
- What “datasheets for datasets” actually is
- Why it became the de-facto Article 10 template
- The seven sections, mapped to AI Act obligations
- What the original framework misses for LLM corpora
- A working template — section by section
- Signing — PAdES PKCS#7 and why it matters
- Versioning — how to keep the datasheet fresh
- Common failure modes in 2026 datasheets
- Frequently asked questions
What “datasheets for datasets” actually is
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford published “Datasheets for Datasets” in 2018. The proposal was simple: every dataset used to train a model should ship with a structured document that answers a fixed set of questions, the way an electronic component ships with a datasheet. The questions are organized into seven sections — Motivation, Composition, Collection Process, Preprocessing/Cleaning/Labeling, Uses, Distribution, and Maintenance.
The framework was not a regulatory requirement at the time. It was a community proposal that gained traction because it formalized what good data-science practice looked like. Hugging Face adopted a variant for its Dataset Cards. Papers With Code embedded it. By 2023 it was the most cited piece of dataset documentation infrastructure outside the ML-fairness literature.
Why it became the de-facto Article 10 template
Article 10 of the AI Act requires data governance and management practices “concerning training, validation and testing data sets,” including their relevance, representativeness, examination of biases, and identification of data gaps. The regulation does not prescribe a format. But when a 2026 EU Commission staff working paper or a national supervisor asks “show me your dataset documentation,” the closest thing in regulated industries is a Gebru-style datasheet.
WHAT REGULATORS CHECK FOR
Per-source licensing, per-source counts, biases documented even if uncomfortable, a preprocessing audit trail, a maintenance plan. Those five points are where most 2026 datasheets fall down. A Gebru-style structure forces you to address each.
The seven sections, mapped to AI Act obligations
| Datasheet section | AI Act Article 10 hook | What auditors check |
|---|---|---|
| Motivation | Purpose statement, intended uses | Is the stated purpose plausible against the data? |
| Composition | Relevance, representativeness | Per-source counts, language distribution, time coverage |
| Collection Process | Data acquisition, consent posture | Licensing per source, scraping vs paid, opt-outs |
| Preprocessing | Examination of biases, gap analysis | Dedup, filter, pseudonymization, what was dropped |
| Uses | Restrictions on downstream uses | Acceptable use, forbidden use, foreseeable misuse |
| Distribution | Versioning, third-party sharing | Format, hash, signature, retention |
| Maintenance | Refresh cadence, correction process | Quarterly diff, errata process, contact |
What the original framework misses for LLM corpora
Gebru et al. wrote with classifier and tabular datasets in mind. Three gaps matter for LLM training data:
- Token-level statistics. The 2018 schema asks about “instances” and “labels.” For an LLM corpus you also need token counts per source, average document length, and the distribution of document sizes. A single number for “total tokens” is not enough.
- Cross-source deduplication. Composition assumes sources are independent. In an LLM corpus they routinely overlap — a parliamentary debate quoted in a court ruling quoted in a regulator’s decision. Document your dedup method (exact match, MinHash LSH, semantic) and the drop rate per source.
- Pseudonymization scope. The original framework asks about sensitive data broadly. For GDPR/AI Act you need a specific section on personal-name pseudonymization: method, detector, mapping strategy, per-document counts.
A datasheet that reads like marketing copy will not survive a regulator audit. The right tone is plain and quantitative: what we did, how many, with what known limits.
A working template — section by section
Below is the structure we ship on our own French finance and regulatory corpus, with the specific deltas from the 2018 schema marked. Adapt to your context — but if you cannot answer one of these sub-questions, that is a finding.
- Motivation: purpose of the dataset, who funded it, intended use cases, who is excluded from the user list (e.g., consumer-facing apps), and why the dataset exists rather than reusing public alternatives.
- Composition: total documents, total characters, total tokens, language (with confidence threshold), per-source breakdown including counts and licensing, temporal coverage (oldest and newest document per source), and the document size distribution at the 25/50/75/95 percentile.
- Collection Process: data sources URL by URL, acquisition method (scraping, bulk archive, API, paid license), date range of acquisition, robots.txt compliance posture, and per-source HTTP/PDF/HTML notes.
- Preprocessing: stage 1 normalization, stage 2 quality scoring methods, stage 3 deduplication methods + drop rates per source, stage 4 LLM-judge or rule-based scoring if any, and the pseudonymization layer (detector, mapping, counts).
- Uses: recommended uses, foreseeable misuse, restrictions, the model alignment plan, and what the dataset should NOT be used for (e.g., predicting individuals).
- Distribution: formats shipped (Parquet, JSONL, Snowflake share), file hashes, signatures, tier definitions if you split the corpus, retention policy.
- Maintenance: refresh cadence, semantic changelog process, contact channel for errata, deprecation policy.
A signed Dataset Specification we ship every release
We publish a 54-page bundle (master DSD + Source Licenses + Glossary) with every release. PAdES PKCS#7 self-signed, RSA-4096, SHA-256 published. The format you can copy is open.
Signing — PAdES PKCS#7 and why it matters
A datasheet is a static artifact. If you publish it as a PDF and someone modifies the PDF later, you have no way to detect the change. PAdES PKCS#7 self-signed embeds a cryptographic signature into the PDF structure. Verifiers see the signature, can extract the SHA-256, and can confirm whether the bytes match what you published.
Self-signed certificates are the right starting point. The value is not chain-of-trust to a CA — it is integrity of the bytes against a hash you publish on your website. RSA-4096 is current practice. Use OpenSSL or open-source Python libraries like endesive or pyHanko. Total cost for a stable signing setup: an afternoon of work.
Versioning — how to keep the datasheet fresh
Treat the datasheet as code. Bump it on every dataset release. A v1.0 datasheet shipped with v1.0 of the dataset; v1.1 ships with the v1.1 dataset and a changelog row. The changelog should answer: what sources were added or dropped, what counts changed, what preprocessing was added, and what the new SHA-256 of the dataset bundle is.
For LLM corpora that grow quarterly, a semantic changelog per release is worth more than a full diff. “DORA RTS published in March 2025 added 14 documents to eurlex_fr; no other source changed substantively.” That is the kind of summary that Enterprise buyers re-read at renewal time to justify their subscription.
Common failure modes in 2026 datasheets
From reviewing publicly shipped datasheets in late 2025 and early 2026, the recurring issues are predictable:
- Aggregate-only statistics. “5 billion tokens, mostly English.” Auditors want per-source counts and the methodology by which you compute them.
- Hand-waved licensing. A line like “all sources permit redistribution” is not enough. Each source gets a specific license name and a link to the license text.
- Missing bias section. If you cannot articulate a bias in your corpus, you have not analyzed it. Every corpus has biases. Document the visible ones, and the ones you suspect but cannot quantify, and explain why you ship anyway.
- No signature. An unsigned PDF on a marketing page is a starting point, not an audit artifact. Sign it.
- Stale changelog. A v1.2.0 datasheet that mentions changes from v0.8 to v0.9 and stops there means the maintenance section is fiction.
See a working datasheet in production
Our 54-page Dataset Specification ships with every release. Signed PDF, SHA-256 published, per-source licenses, per-document audit trail. Open format you can copy.
Frequently asked questions
Is a datasheet required by the AI Act?
Not by name. Article 10 requires “appropriate data governance and management practices” and the regulation references documentation in several places. A Gebru-style datasheet is the most efficient way to demonstrate compliance with the letter and spirit of those obligations. It is not the only acceptable format.
How long should the datasheet be?
Long enough to be specific, short enough to be readable. Our own bundle is 54 pages across three documents (master DSD, Source Licenses, Glossary) for a 14-source, 2-million-document corpus. A small classifier dataset might land in 8–15 pages.
Can I publish the datasheet on a wiki instead of a PDF?
You can, but you give up signature and version stability. A wiki page changes silently. The cheapest hybrid is to maintain the source in Markdown, publish HTML pages for discoverability and search, and generate a signed PDF for every release. The PDF is the authoritative artifact.
What about confidentiality — what if some sources are under NDA?
Aggregate the confidential sources under a single named category and document the aggregation policy. Better than not mentioning them. Worst case for a regulator: an undisclosed source surfaces during audit because the model leaks something traceable.
How does this differ from a model card?
Model cards describe a trained model — intended uses, evaluation, ethical considerations. Datasheets describe a dataset. They overlap on biases and intended uses but answer different questions. A complete documentation pack ships both, with cross-references.
Keep reading
Read next
EU AI Act Article 10 — what training data documentation actually requires
The article 10 obligations decoded for general-purpose AI providers, with the documentation deliverables a 2026 audit expects.
Read next
Building an audit-ready provenance trail for training datasets
PROV-O modeling, per-document JSON-LD, signed manifests. The pieces of a trail an auditor will follow.
Read next
GDPR pseudonymization for LLM training data — patterns and pitfalls
What pseudonymization means under Article 4(5), where regex falls short, and how to report it cleanly in a datasheet.
Leave a Reply