Definitions of technical, legal, and regulatory terms used in the master Dataset Specification Document and its companions.
Companion documents: Master Dataset Specification · Source licences
Glossary — French Premium Web Corpus v1.1.0
Document scope. Definitions of key technical, legal, and regulatory terms used in the master DATASET_SPECIFICATION.md and its companion documents. The glossary is intended to enable a non-specialist auditor or compliance officer to read the specification without external reference, while preserving the precision required for an Article 10 documentation artefact.
A
ACPR — Autorité de contrôle prudentiel et de résolution. The French prudential supervisory authority for banking, insurance, and resolution, established within the Banque de France. The ACPR publishes positions, recommendations, notices, and sanction decisions. Source acpr in the corpus.
AI Act — Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending certain Union legislative acts. The EU framework for risk-tiered regulation of AI systems. Article 10 of the AI Act sets out data governance obligations for high-risk AI systems, which this Dataset Specification Document is designed to support.
AMF — Autorité des marchés financiers. The French financial markets supervisor, with jurisdiction over securities markets, asset management, and listed-company disclosures. Source amf in the corpus.
AMLD6 — Sixth Anti-Money Laundering Directive, Directive (EU) 2024/1640. Recent EU directive captured via the eurlex_playwright extractor.
Annex III — Annex III of the AI Act, listing the use cases that classify an AI system as “high-risk.” See §2.3 of the Dataset Specification for the corpus’s intended-purpose mapping to Annex III categories.
Apache 2.0 — Open-source software licence under which the Qwen2.5-7B-Instruct and DistilCamemBERT model weights used in the pipeline are distributed. See LICENSES.md §4.
Arrow — Apache Arrow, an in-memory columnar data representation used by Hugging Face datasets for cache and by PyArrow for Parquet read/write. The corpus is delivered in Parquet (on-disk) but consumers typically materialise to Arrow at training time.
Audit trail — In the context of this corpus, the set of per-document JSON-LD provenance records using the W3C PROV-O vocabulary, stored at s7_audit_trail/. See §5.9 of the Dataset Specification.
B
BdF — Banque de France. The French central bank, which also operates ACPR as its prudential supervision arm. Source bdf in the corpus.
Bias analysis — A required practice under Article 10(2)(f) of the AI Act. The corpus’s bias analysis is presented in §8 of the Dataset Specification.
BOFiP — Bulletin officiel des Finances publiques – Impôts. The official compendium of French tax doctrine published by the Direction générale des Finances publiques (DGFiP). Source bofip in the corpus.
C
CC-BY-4.0 — Creative Commons Attribution 4.0 International Public License. The licence under which the upstream joelniklaus/eurlex_resources Hugging Face dataset is distributed. See LICENSES.md §3.
CELEX — Communitatis Europeae Lex. The identifier scheme for EU legal documents in EUR-Lex, in the format 32024R1689 (sector code, year, document type, document number). The AI Act has CELEX 32024R1689.
CGI — Code général des impôts. French general tax code, captured in the legifrance_legi source.
CMF — Code monétaire et financier. French monetary and financial code, captured in the legifrance_legi source.
Composite quality score — In stage 4 of the pipeline, a weighted combination of coherence, legal density, finance value, and inverse AI-slop signal, predicted by the distilled DistilCamemBERT classifier. Used to define the Premium tier’s quality threshold. See §5.6 of the Dataset Specification.
Continued pre-training — In LLM training terminology, the practice of resuming the autoregressive language-modelling pre-training of a base model on a new, typically domain-specific, corpus. The FPWC is designed in part for this use case. See §2.1.
CRD VI — Capital Requirements Directive VI, Directive (EU) 2024/1619. Recent EU directive captured via eurlex_playwright.
CRR3 — Capital Requirements Regulation 3, Regulation (EU) 2024/1623. Recent EU regulation captured via eurlex_playwright.
CSRD — Corporate Sustainability Reporting Directive, Directive (EU) 2022/2464. Captured via eurlex_playwright.
D
Dataset Specification Document — The artefact this glossary supports. Per Article 11 of the AI Act and Annex IV, a structured document required for high-risk AI systems describing the training data sets.
Deduplication — In stage 3 of the pipeline, the process of identifying and removing near-duplicate documents using MinHash LSH. The corpus deduplication drop rate is 32.2 % of stage 2 output. See §5.3 of the Dataset Specification.
Delta Sharing — An open protocol for secure data sharing across organisations, developed by Databricks and supported as a distribution mechanism by Snowflake, AWS, and others. The corpus is planned for Delta Sharing distribution post-marketplace approval.
DGFiP — Direction générale des Finances publiques. The French general directorate of public finances, publisher of the BOFiP. Provider of the bofip source.
DGTrésor — Direction générale du Trésor. The French general directorate of the Treasury, publisher of Trésor-Info. Source dgtresor in the corpus.
DILA — Direction de l’information légale et administrative. The French government’s official directorate for legal and administrative publishing. Publisher of the Journal Officiel and the Legifrance LEGI codes. Provider of the legifrance_jorf and legifrance_legi sources.
DistilCamemBERT — A distilled French-language BERT model produced by cmarkea, Apache 2.0 licensed. Used in stage 4b of the pipeline as the student model that learns from Qwen2.5-7B teacher labels. See §5.6.
DORA — Digital Operational Resilience Act, Regulation (EU) 2022/2554. Captured via eurlex_playwright.
DoRA — Weight-Decomposed Low-Rank Adaptation. A LoRA variant. Mentioned only in companion blog content, not used in the pipeline.
E
EBA — European Banking Authority. EU authority whose primary publications are present in the corpus indirectly via EU regulations.
ECB — European Central Bank. Present in the corpus indirectly via EU monetary regulations.
EIOPA — European Insurance and Occupational Pensions Authority. Present in the corpus indirectly via EU insurance regulations.
Erasure — Under GDPR Article 17, the right of a data subject to obtain the deletion of personal data concerning them. The corpus’s erasure procedure is detailed in §10.3 of the Dataset Specification.
ESMA — European Securities and Markets Authority. Present in the corpus indirectly via EU securities regulations.
ESRS — European Sustainability Reporting Standards, Regulation (EU) 2023/2772. Captured via eurlex_playwright.
EU AI Act — See AI Act.
EUR-Lex — The European Union’s official online portal for EU law. The corpus’s eurlex_fr source contains French-language EU regulations and directives 1973–2022 (via the upstream HF dataset), and eurlex_playwright covers post-2022 acts.
F
FastText LID-176 — Facebook AI Research’s language identification model supporting 176 languages, used in stage 2 of the pipeline for per-document language identification.
FPWC — French Premium Web Corpus, the short identifier for this corpus.
G
Gap analysis — A required practice under Article 10(2)(h) of the AI Act. The corpus’s gap analysis is presented in §9 of the Dataset Specification.
GDPR — General Data Protection Regulation, Regulation (EU) 2016/679. Personal data handling in the corpus is governed by GDPR alongside the AI Act.
Gold-set audit — Methodology for sampling-based quality verification where a held-out reference set is manually annotated and used as ground truth. Mentioned in companion blog content as standard practice.
H
Hash chain — A sequence of content hashes where each member includes the hash of its predecessor, making any tampering immediately detectable. The corpus uses hash chains across pipeline stages to ensure cryptographic verifiability. See BIAS_ANALYSIS.md and §1 of the Dataset Specification.
High-risk AI system — A category of AI system defined in Article 6 and Annex III of the AI Act, subject to the heaviest documentation, governance, and conformity-assessment obligations.
Hugging Face — Online platform and software ecosystem for sharing models and datasets. The producer uses HF Hub indirectly (for the upstream joelniklaus/eurlex_resources dataset) but does not publish the FPWC on the public HF Hub.
I
IAA — Inter-annotator agreement. A measure of consistency between two or more independent annotators of the same data. Mentioned in §7 of the Dataset Specification as a methodology not yet performed with independent reviewers.
Intended purpose — In AI Act terminology (Article 3(12)), the use for which an AI system is intended by its provider. The FPWC’s intended purposes are listed in §2.1 of the Dataset Specification.
J
JORF — Journal officiel de la République française. The official gazette of the French Republic, publishing laws, decrees, ministerial orders, and appointments. Source legifrance_jorf in the corpus.
JSON-LD — JavaScript Object Notation for Linked Data. A serialisation format for Linked Data within JSON. Used in the corpus’s audit trail to represent W3C PROV-O provenance records.
JSONL — JSON Lines. A line-oriented format where each line is a valid JSON object. The Sample tier ships as JSONL; per-document audit trail records ship as gzipped JSONL.
L
LEGI — The DILA archive series for the consolidated French legal codes. Source legifrance_legi in the corpus.
Licence Ouverte 2.0 — Open Licence 2.0, the French government’s open licensing scheme published by Etalab. The licence covering most French public-sector content in the corpus. See LICENSES.md §1.
LIMA — A 2023 paper by Meta showing that 1 000 carefully curated SFT examples can outperform 50 000 unfiltered ones. Mentioned in companion blog content; not used directly in the pipeline.
LLM-as-judge — A methodology where a large language model is used to score the quality of other documents or model outputs. Used in stage 4 of the pipeline with Qwen2.5-7B-Instruct as the judge model.
LoRA — Low-Rank Adaptation. A parameter-efficient fine-tuning method for LLMs. Mentioned in companion blog content; not used directly in the pipeline.
M
MiCA — Markets in Crypto-Assets Regulation, Regulation (EU) 2023/1114. Captured via eurlex_playwright.
MiFID II — Markets in Financial Instruments Directive II, Directive 2014/65/EU. Original act included in eurlex_fr; recent revision captured via eurlex_playwright.
MinHash LSH — A technique combining MinHash signatures with locality-sensitive hashing for efficient detection of near-duplicate documents. Used in stage 3 of the pipeline at 128 permutations and 16 bands of 8 rows (effective Jaccard threshold ~0.8). See §5.3 of the Dataset Specification.
MSA — Master Service Agreement. A commercial contract template used between the producer and corpus customers. Out of scope of the Dataset Specification.
N
NIS2 — Network and Information Security Directive 2, Directive (EU) 2022/2557. Captured via eurlex_playwright.
O
OVH — OVHcloud SAS, French sovereign cloud provider. Hosts the corpus’s build infrastructure and storage. EU data residency at SBG5 (Strasbourg) datacenter.
P
Parquet — Apache Parquet, a columnar binary file format for analytics data. The Standard, Premium, and Enterprise tiers are distributed as Parquet shards.
Pipeline content SHA-256 — The cryptographic hash of the combined source data state, pipeline code state, and configuration that produced a corpus release. Embedded in every per-tier MANIFEST.json. Release v1.1.0’s pipeline content SHA-256 is 79fb405b21ba7aee68eb088bbb89f3c0599ae9fa093b88a124c350b5522ae8db.
PROV-O — W3C Provenance Ontology. A W3C standard vocabulary for representing provenance information. Used in the corpus’s per-document audit trail records.
Q
Qwen2.5-7B-Instruct — A 7-billion-parameter large language model by Alibaba Cloud, Apache 2.0 licensed. Used as the LLM-judge teacher in stage 4 of the pipeline.
R
RAG — Retrieval-Augmented Generation. An AI architecture combining a retrieval index with a generative model. The corpus is suitable for RAG indexing per §2.1 of the Dataset Specification.
Representativeness statement — A required practice under Article 10(3) of the AI Act. The corpus’s representativeness considerations are discussed in §8 and §9 of the Dataset Specification.
Retraction — In the corpus context, the procedure for removing affected records following an erasure request or licence withdrawal, detailed in §10.3 of the Dataset Specification.
S
Sample tier — The free, lead-generation preview of the corpus, containing 500 stratified-sample documents. Distributed as a single JSONL file.
Sample-then-distil — A training pattern where a small subset of data is labelled by an expensive teacher model, then the labels are used to train a cheaper student model that is applied to the full dataset. Used in stages 4 and 4b of the pipeline.
SFT — Supervised Fine-Tuning. In LLM training, the practice of fine-tuning a base model on labelled instruction-response pairs. The FPWC ships raw text only; SFT derivations are an explicit out-of-scope post-processing step left to the consumer.
SHA-256 — A 256-bit cryptographic hash function used in the corpus for content integrity and pipeline reproducibility verification.
Snowflake — A cloud data warehouse platform. The producer is pursuing a Snowflake Marketplace listing for the corpus.
Sub-vertical classifier — In stage 3 of the pipeline, a rule-based classifier assigning each topical document to one of six sub-verticals (regtech / risque / fiscalité / macro / corporate / autre).
Subject access request — Under GDPR Article 15, the right of a data subject to obtain confirmation as to whether personal data concerning them is being processed and to access that data. The corpus’s procedure is detailed in §10.2 of the Dataset Specification.
T
Tier — A commercial packaging level of the corpus (Sample, Standard, Premium, Enterprise). See §3.3 of the Dataset Specification.
Topical filter — In stage 3 of the pipeline, a rule-based classifier assigning each document a binary is_finance_topical flag. Threshold tuned to produce an 8.78 % topical-positive rate on the 2 044 132-document input.
V
vLLM — A high-throughput LLM inference library. Used by the producer in early prototypes; superseded in the final pipeline by the Qwen2.5-7B-Instruct teacher run with native transformers inference for compatibility with the available CUDA 12.2 driver. Not currently in the pipeline.
W
Wyoming LLC — The producer’s legal entity, FINALEADS LLC, incorporated in the State of Wyoming, United States. Operates the corpus under the brand French Corpus LLM.
Z
zstd — A compression algorithm used as the default compression codec for Parquet column data in the corpus distribution. Selected for its high compression ratio on structured text columns and its fast decompression speed at training time.
Comments or suggestions for additional entries should be addressed to compliance@frenchcorpus.com.