French legal NLP — corpora, benchmarks, and evaluation tasks in 2026.

French legal NLP sat in the academic margin for a decade. In 2026 it is a distinct vertical with its own corpora, its own evaluation tasks, and a small but growing set of usable benchmarks. The landscape is still fragmented — knowing where the open infrastructure lives is half the work.

Key takeaways

The DILA open data archives (JORF, LEGI, CASS, JADE, CONSTIT, KALI, CIRC, CNIL) are the foundation. Roughly 2.7 M documents and 2 B+ tokens across French statutes, jurisprudence, and regulator decisions.
EUR-Lex French translations cover EU law through 2022 in the bulk archive. Post-2022 regulations (DORA, MiCA, AI Act, CSRD) require Cellar API or direct extraction — the bulk feed has not caught up.
Benchmarks for FR legal NLP are thin. LegalBench (multi-language) covers some FR tasks, FQuAD is general-purpose. A dedicated FR legal benchmark gap is the open opportunity for 2026.
CamemBERT v2 (2024), and CroissantLLM (2024) are the strongest FR-pretrained baselines for legal text. Generic multilingual models (Llama-3, Qwen 2.5) catch up when fine-tuned with vertical FR data.
GDPR + AI Act compliance is the moat. A French legal model trained on a pseudonymized, audit-trailed corpus is the deployment that survives an enterprise procurement review. Generic FR web text is not.

What “French legal NLP” covers
The foundation — DILA open data archives
EU-origin French legal text — EUR-Lex and Cellar
Regulator doctrine — ACPR, AMF, CNIL, BdF, BOFiP
Pretrained French models — what to start from
Benchmarks — what exists and what doesn’t
Evaluation tasks worth building
Compliance — GDPR + AI Act as deployment gates
Frequently asked questions

What “French legal NLP” covers

“French legal NLP” is shorthand for a cluster of related tasks on French-language text from legal and regulatory sources. The cluster spans statute interpretation, case-law analysis, regulatory text retrieval, contract analysis, compliance-question answering, sanctions analysis, and a long tail of more specific applications (tax doctrine, employment law, social housing law).

Three things make this cluster distinct from English legal NLP and from generic French NLP. The legal style is dense and formal — sentence length, citation conventions, archaic vocabulary in older texts. The source structure matters — article numbers, CELEX numbers, court decision references are precise identifiers, not approximations. The compliance posture is non-negotiable — anything shipped to a French enterprise customer in 2026 has to clear GDPR and (soon) AI Act review.

The foundation — DILA open data archives

The Direction de l’information légale et administrative (DILA) publishes the canonical French legal text under the Licence Ouverte 2.0. All of it is downloadable as bulk archives from echanges.dila.gouv.fr/OPENDATA/. The major collections:

Archive	Content	Documents	Tokens (approx)
JORF	Journal officiel de la République — laws, decrees, orders 1947–present	~1.9 M	~470 M
LEGI	Consolidated French codes (Civil, Commercial, Monetary, Tax…)	~150 K articles	~50 M
CASS	Cour de cassation rulings 1790–present	~144 K	~250 M
JADE	Conseil d’État administrative rulings 1990–present	~552 K	~1.4 B
CONSTIT	Conseil constitutionnel decisions 1958–present	~7 K	~15 M
KALI	National collective bargaining agreements	~290 K articles	~100 M
CIRC	Ministerial circulars (administrative interpretation)	~30 K	~60 M
CNIL	CNIL deliberations 1979–present (GDPR + AI Act doctrine)	~26 K (8 K substantive)	~18 M

Total raw: roughly 2.7 million documents and 2 billion tokens. After deduplication (MinHash LSH 0.7), expect 1.6–1.8 million documents to survive. This is the single largest open corpus of French legal text by an order of magnitude.

EXTRACTION PATTERN

All DILA archives share a common XML envelope (META_COMMUN/ID + BLOC_TEXTUEL/CONTENU). One extractor handles all archives with source-specific filename patterns (JURITEXT for CASS, CETATEXT for JADE, CNILTEXT for CNIL, etc.). Open-source DILA bulk extractors are scarce. Roll your own or use ours.

EU-origin French legal text — EUR-Lex and Cellar

French translations of EU regulations, directives, decisions, and case law are available through EUR-Lex (web UI) and the underlying Cellar repository (API + bulk). The bulk archive covers content through 2022 reliably. For 2023+ content — DORA, MiCA, AI Act, CSRD, CRD VI, AMLR — you need to query the Cellar SPARQL endpoint or scrape EUR-Lex directly with proper cookie handling, because the bulk feed has not caught up and EUR-Lex’s recent anti-bot WAF blocks naive scrapers.

Practical numbers: a complete EUR-Lex FR bulk through 2022 is ~100 K documents and ~400 M tokens. Adding the priority post-2022 regulations is a much smaller volume (a few hundred CELEX numbers) but a much higher per-document value for any compliance-oriented downstream task.

Regulator doctrine — ACPR, AMF, CNIL, BdF, BOFiP

The regulator doctrine layer — published positions, sanctions, recommendations, interpretive guidance — is where the modern regulatory thinking lives. Each regulator ships content under Licence Ouverte 2.0, but in different formats and with different extraction effort:

ACPR (prudential supervisor): ~800–1,500 documents, HTML pages with downloadable PDFs. Sanctions and positions are the most valuable subset.
AMF (market supervisor): ~7,000+ documents in the sitemap. Many are PDF-landing pages; the actual content lives in attached PDFs. Hybrid HTML + PDF extraction required.
CNIL (data protection): ~8,000 substantive deliberations from 1979–2025. Available via the DILA bulk archive — the single richest open source of GDPR doctrine in French.
Banque de France: ~4,500 publications on the modern site (after filtering admin URLs). Bulletin, Working Papers, Financial Stability Reviews. The legacy publications.banque-france.fr archive holds another ~5,000–10,000 historical articles requiring AJAX-aware crawling.
BOFiP (tax administration): ~8,000 documents covering French tax doctrine. Stable structure, HTML-only.
DG Trésor: ~20,000 articles from tresor.economie.gouv.fr including Trésor-Éco letters, Trésor-Info briefs, and thematic notes.

The compliance moat is not the volume of legal text. It is the depth of regulator doctrine. CNIL and ACPR matter more than another 100 M tokens of JORF.

Pretrained French models — what to start from

Pretrained FR models split into two groups. French-first: CamemBERT v2 (2024, encoder, BERT-style for classification and NER), CroissantLLM (2024, decoder, 1.3B parameters), Vigogne, Lucie. Multilingual that include strong FR: Llama-3-8B-Instruct, Qwen 2.5-7B-Instruct, Mistral-Nemo-12B, Llama-3.1-70B, EuroLLM.

For legal-specific work in 2026, two patterns work. (1) Encoder for retrieval and classification: CamemBERT v2 fine-tuned on your specific corpus, with sentence-BERT training for retrieval embeddings. (2) Decoder for generation and QA: a multilingual open model (Llama-3.1 or Qwen 2.5 at 7B or 14B) fine-tuned with French legal SFT examples. The French-first generative models are still smaller and less competitive than the multilingual ones at the 7B+ scale.

A French finance + regulatory corpus, ready to train on

DILA + ACPR + AMF + CNIL + BdF + BOFiP + EUR-Lex FR, deduplicated, pseudonymized, audit-trailed. Pre-computed embeddings ship with each release. Plug into your SFT or RAG pipeline.

See the corpus

Benchmarks — what exists and what doesn’t

FR legal benchmarks are thin in 2026. The visible options:

LegalBench (multilingual, Stanford 2023): includes some FR tasks, narrow coverage on French-specific concepts.
FQuAD: general-purpose French extractive QA on Wikipedia. Not legal, useful as a generic French baseline.
SocFR: French social-text classification — useful for tone and sentiment, not legal reasoning.
NER on French legal: a few academic datasets exist (typically built around fine-tuned CamemBERT or French BERT variants for entity extraction on court rulings), but most are small and not consistently public.
Internal evals: every team building French legal AI builds its own eval set from a few hundred expert-curated questions. The lack of a public standard benchmark is the biggest gap.

An open French legal benchmark — 500–1000 expert-validated questions across statute interpretation, jurisprudence reasoning, regulator doctrine application — would be a high-leverage open contribution for 2026. The field is one well-curated dataset away from comparable rigor with English legal NLP.

Evaluation tasks worth building

If you build vertical FR legal AI, design evaluation around the tasks that pay. The categories that recur in real customer requirements:

Citation accuracy: given a regulatory question, does the model return correct article references and CELEX numbers? Easy to grade automatically — string match plus structural check.
Jurisdiction reasoning: given a query, does the model pick the right jurisdiction (national vs EU, civil vs administrative, etc.) and explain the choice?
Temporal correctness: given a question about “current law,” does the model use the in-force version, not a superseded one?
Faithfulness to source: when the model paraphrases a regulation, does the paraphrase preserve the legal effect, or shift it?
Refusal on out-of-scope: when asked something the corpus does not cover, does the model refuse cleanly rather than confabulate?

Compliance — GDPR + AI Act as deployment gates

Two compliance reviews decide whether a French legal AI product reaches an enterprise customer. GDPR: the model is trained on text that contains personal data (court rulings, regulator sanctions). Pseudonymization with documented method, scope, and counts is the minimum baseline. AI Act: from August 2026, general-purpose AI providers must demonstrate Article 10 compliance — data governance, documentation, bias examination.

A French legal model whose underlying corpus does not have these layers is not deployable to an enterprise customer in late 2026. The cost of adding them after the fact is high — you have to retrace your data lineage and redo your pseudonymization pass. The corpora that ship these layers in v1.0 are the ones that will be the production backbone of FR legal AI in 2026–2027.

Build your French legal AI on a production-ready vertical corpus

DILA + regulators + EUR-Lex FR, deduplicated, pseudonymized, AI Act Article 10 audit-trailed. The compliance gate cleared in v1.0.

Talk to us

Frequently asked questions

Is French legal NLP its own field or just a subset of legal NLP?

Subset in methodology, distinct in resources. The core techniques — retrieval, NER, RAG, classification — transfer from English legal NLP without modification. The distinct part is the corpora, the citation conventions, the institutional structure, and the language style of French legal writing. In practice, French legal NLP teams build separate pretrained encoders and fine-tune separate models.

Can I use OpenAI or Anthropic models for French legal use cases?

Yes, with caveats. Both perform well on French legal text out of the box for retrieval and basic reasoning. For citation-bearing answers and regulator-specific interpretation, open models fine-tuned on French legal text close the gap and clear French enterprise compliance review more cleanly (data residency, no foreign cloud).

What about Quebec or Belgian French legal text?

Adjacent but not interchangeable. The DILA archives are French (France). Quebec law lives in CanLII; Belgian law in Justel. The legal style overlaps with metropolitan French but the citation conventions, statutory structure, and case-law tradition differ enough that a model trained on DILA underperforms on Quebec or Belgian text without specific fine-tuning.

How big a model do I need for French legal QA?

For citation retrieval and extractive QA, a 7B model is sufficient when paired with a good retriever. For generative interpretation and multi-step legal reasoning, 14B+ shows measurable improvements. Above 70B the marginal gain on French legal tasks is small unless the use case is multi-jurisdictional or multilingual within the answer.

What is the cheapest way to validate that French legal AI will work on my use case?

Three steps. (1) Build a 50–100 question eval set with domain expert answers. (2) Run a baseline with a multilingual model + dense retrieval over a French legal corpus. (3) Have the domain expert score the responses. The accuracy on this small eval is a strong leading indicator of whether scaling the approach makes sense. Total cost: 1–2 weeks of work.

French legal NLP — corpora, benchmarks, and evaluation tasks in 2026.

Key takeaways

In this article

What “French legal NLP” covers

The foundation — DILA open data archives

EU-origin French legal text — EUR-Lex and Cellar

Regulator doctrine — ACPR, AMF, CNIL, BdF, BOFiP

Pretrained French models — what to start from

A French finance + regulatory corpus, ready to train on

Benchmarks — what exists and what doesn’t

Evaluation tasks worth building

Compliance — GDPR + AI Act as deployment gates

Build your French legal AI on a production-ready vertical corpus

Frequently asked questions

Is French legal NLP its own field or just a subset of legal NLP?

Can I use OpenAI or Anthropic models for French legal use cases?

What about Quebec or Belgian French legal text?

How big a model do I need for French legal QA?

What is the cheapest way to validate that French legal AI will work on my use case?

Keep reading

EU AI Act Article 10 — what training data documentation actually requires

GDPR pseudonymization for LLM training data — patterns and pitfalls

Best public datasets for training generative AI models in 2026

Comments

Leave a Reply Cancel reply

Keep reading.

French legal NLP — corpora, benchmarks, and evaluation tasks in 2026.

Instruction tuning vs SFT vs RLHF — choosing the right dataset shape.

RAG datasets — what makes a corpus retrieval-friendly in 2026.

Data deduplication for LLM corpora — MinHash LSH, exact match, semantic.