Source Licences — French Premium Web Corpus v1.2.0

Full enumeration of every licence applicable to source materials in release fpwc-v1.2.0-2026-05-15, with the producer’s interpretation of each licence’s commercial-reuse permissions and obligations.

Companion documents: Master Dataset Specification · Glossary


Source licences — French Premium Web Corpus v1.1.0

Document scope. This document enumerates every licence applicable to source materials in release fpwc-v1.1.0-2026-05-14, with the producer’s interpretation of each licence’s commercial-reuse permissions and obligations. It is a companion to the master DATASET_SPECIFICATION.md.

Disclaimer. The interpretations below are the producer’s good-faith reading of the relevant texts and applicable case law as of 2026-05-14. They are not legal advice. A consumer planning a commercial deployment of an AI system trained on this corpus should obtain its own legal opinion.


1. Licence Ouverte 2.0 (Etalab)

Full title: Licence Ouverte 2.0 / Open Licence 2.0
Issuer: Etalab, French inter-ministerial mission for state open data, under the Direction interministérielle du numérique (DINUM).
Official text: https://www.etalab.gouv.fr/licence-ouverte-open-licence
English translation (official): https://github.com/etalab/licence-ouverte/blob/master/LO.en.md

1.1 Applicable sources in the corpus

The Licence Ouverte 2.0 applies to the following sources in the corpus:

  • legifrance_jorf — DILA Journal officiel de la République française
  • legifrance_legi — DILA Legifrance LEGI consolidated codes
  • bofip — DGFiP BOFiP-Impôts tax doctrine
  • acpr — ACPR doctrine and sanction decisions
  • amf — AMF doctrine (planned for v1.2 inclusion)
  • bdf — Banque de France institutional publications (subject to producer verification per publication; see §1.5 below)
  • dgtresor — DGTrésor Trésor-Info publications

1.2 Permissions granted by Licence Ouverte 2.0

Subject to the obligations in §1.3, the licence grants a free, non-exclusive, worldwide, perpetual right to:

  • Reproduce, copy, publish, and transmit the information.
  • Distribute and redistribute the information.
  • Adapt, modify, extract, and transform the information, including for the creation of derivative works.
  • Commercialize the information, including by combining it with other information or by including it in a product or service.

Both natural and legal persons are explicitly contemplated as licensees. The licence is irrevocable for as long as the licensee complies with its obligations.

1.3 Obligations imposed by Licence Ouverte 2.0

The licensee must:

  • Acknowledge the source of the information at each reuse, by citing the original producer and the date of the last update of the information, in a manner that does not suggest the original producer endorses or recommends the reuse.
  • Make the information available under the same licence (Licence Ouverte 2.0) when redistributing it, or under a licence compatible with Licence Ouverte 2.0 (the licence text explicitly lists CC-BY 2.0, CC-BY 3.0, and CC-BY 4.0 as compatible).

The licence does not require:

  • The publication of derivative works under the same licence (only redistribution of the original information requires this — derivative works can be distributed under any licence).
  • Notification to the original producer of intent to reuse or redistribute.
  • Payment of royalties.

1.4 Producer’s interpretation for corpus context

The French Premium Web Corpus is a derivative work that combines selected, filtered, deduplicated, quality-scored, and contextually-classified extractions from Licence Ouverte 2.0 sources. The producer interprets the licence as permitting:

  • The construction of the corpus itself — derivation, transformation, and combination with non-Licence-Ouverte content (such as EUR-Lex content) is explicitly permitted.
  • The commercial distribution of the corpus to paying customers — commercial reuse is explicitly permitted.
  • The licensing of the corpus to customers under producer-defined terms — derivative works may be distributed under any licence.

The producer satisfies the attribution obligation by:

  • Recording the source identifier in every row of the corpus (source column).
  • Recording the original document URL where applicable (url column).
  • Recording the source provider in this Licences document and in the per-tier LICENSE.md autogenerated documentation.
  • Stating the date of capture for each document (capture_date column).

The producer does not redistribute the underlying Licence Ouverte 2.0 information under a different licence; the corpus distribution constitutes a derivative work, and the producer’s commercial licence to corpus customers does not purport to relicense the underlying public-domain content.

1.5 Banque de France caveat

Banque de France publications are made available under multiple licensing regimes depending on the publication series. The corpus includes only publications explicitly marketed as institutional publications under Licence Ouverte 2.0 (annual reports, working papers, projections-economiques, debats-economiques, rapport-investissement-responsable). Statistical data series and confidential supervisory data are excluded.

1.6 Verification

The producer reviewed the Licence Ouverte 2.0 designation on the DILA, DGFiP, ACPR, AMF, BdF, and DGTrésor portals on 2026-05-08. No revocation or modification has been observed as of 2026-05-14. The producer will re-verify on each monthly release cycle.


2. EU Commission Decision 2011/833/EU

Full title: Commission Decision of 12 December 2011 on the reuse of Commission documents (2011/833/EU, OJ L 330, 14.12.2011)
Issuer: European Commission
Official text: http://data.europa.eu/eli/dec/2011/833/oj
Successor framework: Reused and amended by Commission Decision (EU) 2025/123 of 30 January 2025, which updates the framework but preserves the core open-reuse principles for documents that are not subject to specific exclusion.

2.1 Applicable sources in the corpus

This Decision (and its successor) applies to the following sources in the corpus:

  • eurlex_fr — EUR-Lex regulations and directives in French (via the upstream joelniklaus/eurlex_resources HF dataset, which itself relies on this Decision)
  • eurlex_playwright — Post-2022 EU acts extracted directly from eur-lex.europa.eu (DORA, MiCA, AI Act, CSRD, ESRS, AMLD6, MiFID II review, CRR3, CRD VI, NIS2)

2.2 Permissions granted by Decision 2011/833/EU

Article 2 of the Decision grants the right to reuse Commission documents, including for commercial purposes, without restriction other than:

  • The obligation to acknowledge the source.
  • The obligation not to distort the original meaning of the documents.
  • Confirmation that the Commission cannot be held liable for any consequence stemming from the reuse.

Reuse for commercial purposes (Article 2(2)) is explicitly permitted, including in the context of training AI systems and constructing AI training datasets.

2.3 Exclusions under the Decision

The Decision does not apply to:

  • Documents containing personal data subject to higher-tier protection (these are addressed by the present corpus through the personal data handling policy in §4.2 of the Dataset Specification).
  • Documents for which intellectual property rights are held by third parties (the corpus does not include such documents).
  • Documents that fall under Article 4 (security exclusions, military exclusions) — none of these apply to the EU legislative texts in the corpus.

2.4 Producer’s interpretation for corpus context

The producer interprets the Decision (and its successor 2025/123) as permitting unrestricted use of the EU legislative and regulatory texts in the corpus, including for commercial AI training data distribution. The acknowledgement obligation is satisfied by per-row source tagging and by reference in this Licences document.

2.5 Upstream provenance via Hugging Face

For documents sourced from joelniklaus/eurlex_resources, the upstream dataset is itself published under CC-BY-4.0 (see §3 below) on Hugging Face Hub. The producer treats the upstream HF dataset as the immediate provenance, while recognising that the underlying content remains EU public sector documents subject to Decision 2011/833/EU.


3. Creative Commons Attribution 4.0 International (CC-BY-4.0)

Full title: Creative Commons Attribution 4.0 International Public License
Issuer: Creative Commons Corporation
Official text: https://creativecommons.org/licenses/by/4.0/legalcode

3.1 Applicable sources in the corpus

CC-BY-4.0 applies to the corpus content sourced from the joelniklaus/eurlex_resources Hugging Face dataset, which is published by Joel Niklaus and contributors under CC-BY-4.0.

3.2 Permissions granted by CC-BY-4.0

The licence grants a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable right to:

  • Share — copy and redistribute the material in any medium or format.
  • Adapt — remix, transform, and build upon the material.

These rights are granted for any purpose, including commercial use.

3.3 Obligations imposed by CC-BY-4.0

The licensee must:

  • Give appropriate credit to the original creator (Joel Niklaus and contributors of joelniklaus/eurlex_resources).
  • Provide a link to the licence (https://creativecommons.org/licenses/by/4.0/).
  • Indicate if changes were made to the material.
  • Not apply legal terms or technical measures that legally restrict others from doing anything the licence permits.

3.4 Producer’s interpretation for corpus context

The producer credits Joel Niklaus and the contributors of joelniklaus/eurlex_resources in this Licences document and in the per-tier LICENSE.md autogenerated documentation. The producer indicates that the content has been adapted (filtered, deduplicated, quality-scored) for the corpus distribution.

The producer’s commercial licence to corpus customers does not purport to revoke or limit the customers’ rights under CC-BY-4.0 with respect to the underlying content; customers receiving the corpus retain their rights to the CC-BY-4.0 content under the original licence terms.


4. Apache License 2.0 (Qwen and DistilCamemBERT)

Full title: Apache License, Version 2.0
Issuer: The Apache Software Foundation
Official text: https://www.apache.org/licenses/LICENSE-2.0

4.1 Applicable to

The Apache 2.0 licence applies to the model weights used by the producer’s pipeline at training time (it does not apply to the corpus content itself):

  • Qwen2.5-7B-Instruct by Alibaba Cloud, used as the LLM-judge teacher in stage 4 of the pipeline.
  • DistilCamemBERT-base by cmarkea, used as the student model in stage 4b distillation.

4.2 Producer’s interpretation

The use of Apache 2.0 model weights to produce scoring annotations on third-party content (the corpus) is unambiguously permitted by the Apache 2.0 licence. The resulting annotations are not subject to the Apache 2.0 licence; they are independent derivations.

The producer does not redistribute the Qwen or DistilCamemBERT model weights. They are used solely at training/inference time on the producer’s infrastructure.


5. Composite licensing of the corpus output

The corpus as a whole is a composite work combining:

  • Content under Licence Ouverte 2.0 (~98 % of documents by count, ~85 % by tokens)
  • Content under EU Decision 2011/833/EU (incorporated via CC-BY-4.0 in practice; ~2 % by count, ~15 % by tokens)
  • Producer-generated derivations (quality scores, sub-vertical classifications, distil scores, JSON-LD provenance records)

5.1 Producer’s licensing model

The producer distributes the corpus under tier-specific commercial licences:

  • Sample tier — free preview under producer terms (no commercial redistribution; evaluation only). Underlying public-sector content remains under Licence Ouverte 2.0 and Decision 2011/833/EU.
  • Standard / Premium tiers — commercial licence to train AI systems; redistribution of the corpus by the licensee is prohibited but the licensee retains all rights under the upstream Licence Ouverte 2.0, Decision 2011/833/EU, and CC-BY-4.0 with respect to the underlying content.
  • Enterprise tier — broader commercial rights, including the right for the enterprise customer to use derived datasets internally; the enterprise customer also retains upstream rights.

5.2 Attribution wording suggested for licensees

Licensees of the corpus who publish AI systems or derived models trained on the corpus are recommended (though not strictly required by the producer) to include the following acknowledgement in their model card or system documentation:

This system has been trained, in part or in whole, on the French Premium Web Corpus (release fpwc-v1.1.0-2026-05-14), published by FINALEADS LLC, which combines content from the French Open Data programme (Licence Ouverte 2.0, sources: DILA, DGFiP, ACPR, AMF, Banque de France, DGTrésor) and from the European Union public-sector reuse framework (Decision 2011/833/EU). The corpus is distributed by its producer under a commercial licence.


6. Future verification cadence

The producer commits to re-verifying every licence designation on every monthly release cycle. If any licence designation is withdrawn or modified by an upstream provider, the affected content will be removed from the next release within 30 days of receipt of formal notice, as detailed in §10.4 of the Dataset Specification Document.


Questions about licensing, including requests for clarification or for redistribution permission beyond the producer’s tier terms, should be addressed to compliance@frenchcorpus.com.