Dataset data sheet

What is in our data products — and what is not

A single page that answers the questions your procurement, legal and AI governance teams will ask before buying. Last updated: 13 May 2026 · Version 1.0.

This page describes the content of the datasets we license commercially and the legal basis on which we share them. It does not describe how we handle personal data of website visitors and customers — see the separate Privacy Notice for that.

1. What our datasets contain

Our commercial datasets are curated training corpora built from authoritative public sources. Each release contains millions of textual documents drawn from a single subject vertical (for example: finance and regulation, public sector, legal, healthcare) or from a tailored scope agreed with a customer in a custom engagement.

Every document is the cleaned plain text of an original public publication, paired with structured metadata: source identifier, original URL, upstream identifier, extraction timestamp, language tag, computed quality features, and the licence terms under which the source publishes it.

2. Source landscape

We ingest exclusively from official public sources. The categories below describe the type of source we use; the per-release dataset card lists the exact source URLs.

National legal gazettes and codified legislation
Financial regulators and securities authorities
Central banks (working papers, policy notes, statistical bulletins)
Treasury and finance ministries
Sectoral regulators (telecom, energy, healthcare, transport)
Official scientific repositories and research-funding bodies
European Union legal instruments (regulations, directives, decisions)

We do not ingest from: social media, anonymous forums, scraped marketing copy, copyrighted books or articles outside open-access licensing, or any source whose redistribution terms are unclear.

3. Personal Data declaration

Our datasets are not designed to contain, and to the best of our knowledge do not contain, Personal Data within the meaning of the GDPR or equivalent regulations.

Our editorial criteria specifically exclude sources that publish individual-level data. The text we ship is regulatory, legislative, doctrinal or institutional in nature. Where names of public officials appear within published official acts (for example, signatories of a decree or members of a regulatory committee), they appear as part of the publication of public function, are already public information, and are not used by us to identify or profile individuals.

We do not generate, augment or include synthetic Personal Data in our listing materials or in the datasets themselves. We do not include sensitive Personal Data (special categories under GDPR Article 9 or analogous categories under other regulations).

If a customer identifies a record that they reasonably believe contains Personal Data, we will assess it and, where appropriate, remove or redact the affected document from the next release. Requests are handled by writing to compliance@frenchcorpus.com.

4. Licence chain

Every document in our datasets is sourced under an open public licence and is shipped to the customer with the upstream licence preserved at the row level.

Upstream category	Typical licence	Commercial reuse
French open public data (regulators, ministries, central bank, treasury)	Licence Ouverte 2.0	Permitted with attribution
European Union legal instruments and publications	Decision (EU) 2011/833 reuse policy / CC-BY-4.0	Permitted with attribution
Other national public sources, on a case-by-case basis	National open-data licence equivalent	Documented per release

The transformations we apply to the source text — text normalisation, deduplication, quality scoring, packaging — are content-preserving. They do not generate a new copyrightable derivative work. The customer receives the dataset under the most restrictive licence among the sources included in the release, plus a clear attribution requirement.

5. Intellectual property and authorisation

FINALEADS LLC, operating as French Corpus LLM, confirms that it holds all rights necessary to share and sell each dataset under the licences declared in the dataset card shipped with the release. We do not include infringing or unlawful material in our datasets, listing materials or marketing assets. Where we use third-party content in promotional materials (graphics, fonts, imagery), we hold the corresponding licences.

6. Schema and fields

Each shard in a release carries the following columns. The exact schema for a given release is documented in the dataset card.

Group	Field examples	Purpose
Identification	doc id, source, URL, upstream identifier, extraction timestamp	Per-row provenance & reproducibility
Content	cleaned text, language tag, language confidence, character and token counts	The data your model trains on
Quality scoring	statistical features (repetition, charset balance, lexical diversity), composite quality score, quality tier	Filter rows to your training budget
Optional governance	LLM-judged scores, sub-vertical labels (where ordered)	Tier-specific premium signals

7. Refresh frequency

Sample tier: snapshot, not refreshed.
Limited Trial tier: Snowflake-only, login-gated, 30-day evaluation window — refresh tied to the same release as Standard.
Standard tier: quarterly refresh.
Premium tier: monthly refresh, with delta feeds available on request.
Enterprise tier: monthly refresh by default, with customer-specific cadence available.
Custom engagements: cadence agreed in the statement of work.

Each refresh produces a new versioned release with its own dataset card. Customers receive the new release alongside the prior one for transition, then the older release is archived.

8. Region availability

Datasets are built and curated on infrastructure hosted in the European Union (France). The data delivery region depends on the channel:

Snowflake Marketplace: available in selected Snowflake regions; the deployment region for each listing is stated in the listing description on Snowflake.
Direct procurement: delivery via signed URLs from an object storage bucket in a customer-chosen region (EU by default, US or other region on request).
Enterprise tier: private replication into the customer’s own data platform region available on request.

9. Format and delivery

Parquet (Apache Arrow-compatible, zstd-compressed) — primary format for bulk loading into data warehouses and model training pipelines.
JSON Lines (gzip-compressed) — secondary format for streaming and traditional NLP tooling.
Per-release manifest, dataset card, licence chain document and audit-trail summary — all in human-readable Markdown alongside the data.

10. Audit trail shipped with every release

Dataset card — complete description of the release: composition, statistics, schema, intended use, known limitations.
Licence chain document — per-source licence, attribution requirements and the overall licence the customer receives.
Schema dictionary — every column, its type and its source stage.
Statistics — document counts, character and token estimates, source mix.
Pipeline integrity hash — a deterministic identifier of the exact pipeline code that built the release, enabling end-to-end auditability.

These documents are designed to plug into your AI governance workflow and to support the data-governance and transparency expectations of the EU AI Act for general-purpose model training data.

11. Acceptable use and restrictions

Customers may use the datasets for lawful purposes consistent with the licence terms shipped in the release. We expect customers to:

Preserve the per-document attribution required by the upstream source.
Comply with any AI-specific obligation that applies to their downstream use (notably, the EU AI Act where applicable).
Refrain from re-aggregating the data with other sources in a way that would defeat the source attribution chain.
Refrain from any attempt to identify or profile individuals named incidentally in the published official text.

12. Questions about a specific release

For technical questions about a specific dataset (schema, sources included, sample access, AI Act integration), write to compliance@frenchcorpus.com. For commercial questions, write to contact@frenchcorpus.com.

This Data Sheet describes the standard properties of our data products. Any individual release is governed by the terms in its dataset card and by the commercial agreement signed with the customer.

Available on 🤗 Hugging Face Datasets 📄 Signed Dataset Specification ❄ Snowflake Marketplace · live

Dataset Data Sheet