GDPR pseudonymization for LLM training data — patterns and pitfalls.

Training data from regulator corpora — CNIL, ACPR, sanctioning bodies — is full of personal names. The right answer is not redaction, it is pseudonymization. Article 4(5) of the GDPR is specific about what counts, and the gap between “we replaced some names” and “this dataset is auditably safe” is wider than most teams realize.

Key takeaways

Anonymization removes any chance of re-identification; pseudonymization keeps a controlled re-identification path. Under GDPR they are different regimes, with different downstream obligations.
Regex detection of civil titles (M., Mme, Maître, Dr.) finds about 70–90 % of named persons in French regulator text. The remaining gap is mostly bare surnames after first mention, and entity-name confusions.
Per-document mapping is the right default. Cross-document linkability is a feature for entity-linking tasks but a liability for GDPR scope. Most teams should keep numbering local to each document.
Companies, public bodies, and place names are not personal data under Article 4(1). Preserving them keeps the dataset useful for finance and regtech work. Mixing the two creates worse downstream signal.
AI Act Article 10 expects measurable data governance. “We pseudonymized X documents, with Y unique persons mapped to Z substitutions, with a documented detector and a known-limit list” is what a 2026 audit wants to see.

Anonymization vs pseudonymization in GDPR Article 4
Why training data on regulator corpora needs this layer
Detection — what a title-based regex captures
Mapping — per-document vs cross-document
What survives the pass (and why that is intentional)
Pitfalls — back-identification, OCR, partial overlap
Tooling — Presidio, spaCy, dedicated regex, hybrid
Reporting — what to write in your dataset specification
Frequently asked questions

Anonymization vs pseudonymization in GDPR Article 4

The GDPR draws a hard line between the two. Anonymization means the data subject cannot be re-identified by any reasonably likely means, taking into account “all the means reasonably likely to be used,” including those of the controller. Anonymized data is outside the scope of GDPR. Pseudonymization, defined in Article 4(5), means the data can no longer be attributed to a specific person without additional information held separately. Pseudonymized data is still personal data — GDPR still applies, but you get lighter obligations on certain processing activities, including some research.

Training data falls firmly in the pseudonymization regime when you replace names with stable aliases. The mapping itself is the “additional information” the GDPR talks about. If you keep that mapping, you keep the regime. If you destroy it after the pass, you push toward anonymization — provided the residual text cannot itself re-identify the subject through context.

WHY IT MATTERS

A LinkedIn-grade backgrounder on a public sanction decision is enough context to re-identify someone even if you strip their name. True anonymization is a high bar. Pseudonymization is what most LLM datasets can actually claim, and it is what the AI Act Article 10 documentation expects you to claim accurately.

Why training data on regulator corpora needs this layer

Open data from French regulators — CNIL deliberations, ACPR sanctions, Cour de cassation and Conseil d’État rulings — names individuals routinely. Compliance officers, directors, professionals being sanctioned, even rapporteurs. The text is public, so republishing it is legal. But training a model on it without a pseudonymization layer ships those names into the model weights and into anything the model generates.

Concrete numbers from a recent pass we ran on the full CNIL deliberation archive (8,126 substantive documents, 1979–2025): 92 % of documents contained at least one personal name. 20,460 unique persons were detected across the corpus. 21,750 individual substitutions applied. Less than 1 % of substantive deliberations escape this — admin closure letters, mostly. The proportion will be similar on any regulator corpus.

Detection — what a title-based regex captures

The simplest reliable detector is a regex anchored on civil titles: M., Mme, Madame, Monsieur, Maître, Me, Dr., Pr., MM., plus the English equivalents. After the title, match 1–4 capitalized tokens including French particles (de, du, des, d’). On French regulator text this lands roughly 70–90 % of person mentions in one pass.

What gets missed: bare surname references after a first titled mention (“Dupont a indiqué…” without “M. Dupont”), fully lowercased OCR artefacts, and names that collide with common nouns (“Petit”, “Leblanc”, “Roy”). To recover the bare surname cases, do a second pass: for each detected person, look for the last token of their full name later in the document, and only substitute it if no other person in the same document shares that surname. This catches 5–10 additional substitutions per long document without introducing false matches.

Regex is not the right tool for general NER. It is the right tool for the narrow “title + capitalized phrase” pattern in French legal text — high precision, well-bounded, auditable.

Mapping — per-document vs cross-document

Per-document mapping resets the alias counter at every document. Two appearances of [P1] in different documents may or may not be the same person — the corpus does not encode that information anywhere. Cross-document mapping uses a global counter, so [P42] always refers to the same person across the corpus.

Per-document is the right default for compliance datasets. It removes the linkability graph that a cross-document mapping creates. It is also harder to back-identify — even with the aliased text plus public knowledge, you cannot stitch a profile across documents. The trade-off is that the corpus loses some signal for entity-linking and co-reference resolution tasks. For finance and regulatory training, that signal is rarely valuable enough to justify the GDPR risk.

Mapping mode	Cross-doc linkability	Useful for	GDPR posture
Per-document	None	Compliance LLMs, general SFT	Strongest
Cross-document	Full	Entity linking, co-reference	Requires separate justification
Hybrid (per-source)	Within source only	Targeted research	Document the boundary

What survives the pass (and why that is intentional)

Companies, public bodies, and place names are not personal data under Article 4(1) of the GDPR. They identify legal persons or geographic entities, not natural persons. A pseudonymization pass on a regulator corpus should leave them alone. This is what keeps the dataset useful for downstream tasks like sanction analysis (“which banks were fined for AML failures?”) or jurisdictional studies (“how often does the Conseil d’État cite the CJEU?”).

The non-obvious part is the title prefix. We keep M. and Mme in front of the alias: M. [P1] rather than [P1]. The reason is downstream learning. A model that sees the structural pattern can still learn that a civil title precedes a person reference, which matters for French legal style — court decisions and ACPR sanctions follow it strictly. Stripping the title flattens that signal.

A pseudonymized FR finance + regulatory corpus, ready for SFT

We ship a French finance, regulatory and economic LLM corpus with ACPR-pattern pseudonymization applied per document. Article 10 audit trail per row, signed Dataset Specification, quarterly refresh.

See the dataset specification

Pitfalls — back-identification, OCR, partial overlap

Three failure modes worth instrumenting against:

Back-identification via context. Even with the name replaced, a sanction decision often gives enough context (entity, role, date) to re-identify a specific person. Pseudonymization reduces, but does not eliminate, this risk. Acknowledge it.
OCR noise. Old PDFs use ligatures and spacing that break the regex. Run a normalization pass first (Unicode NFC, ligature decomposition, whitespace collapse). If you skip this, you ship a corpus where 10–30 % of names slip through in pre-2010 documents.
Partial overlap with entities. “M. Dupont” replaced cleanly, but the document later says “la société Dupont SARL.” The bare surname match must not fire there. Anchor your bare-surname pass on word boundaries and exclude tokens followed by SA, SARL, SAS, SAS unipersonnelle, etc.

Tooling — Presidio, spaCy, dedicated regex, hybrid

Microsoft Presidio gives you a pre-built PII analyzer with French support. It uses a mix of NER and patterns and reports recognized entity types with confidence scores. Useful as a sanity check. spaCy + a French model (fr_core_news_lg or a specialized legal model) gives you free PERSON detection — strong recall but variable precision on legal-style text where many proper nouns are not persons.

For French regulatory text specifically, a dedicated regex anchored on civil titles outperforms general NER in our experience: higher precision, fully auditable rules, trivially testable. The right architecture is hybrid — the regex pass first for high-precision substitution with civil titles, then a NER pass to flag potential misses for human review. The hybrid keeps the audit trail clean.

Reporting — what to write in your dataset specification

Article 10 of the AI Act expects measurable data governance. For pseudonymization, your dataset specification should record, at minimum:

The detector method (regex pattern, NER model + version, or hybrid).
The mapping strategy (per-document or cross-document).
Aggregate counts: documents touched, unique persons detected, total substitutions.
The known-limit list (cases the detector misses, with examples).
Whether companies, places, article references, and civil titles are preserved.
The pre-pseudo backup retention policy (separate access controls, retention period).

On our own corpus, that section reads: “ACPR-pattern regex detector matching title + 1–4 capitalized tokens; per-document mapping; numbering resets between documents; companies and Article-level references preserved; pre-pseudo backup retained on a separate access-controlled volume.” That sentence is what an audit team actually wants. The reproducible counts back it up.

A French regulatory corpus that already does this

We extract, pseudonymize, score, and audit-trail French regulator and finance text. ACPR, CNIL, Cour de cassation, JORF, EUR-Lex. AI Act Article 10 signed Dataset Specification ships with every release.

Talk to us

Frequently asked questions

Yes. Article 4(5) makes that explicit. If a mapping exists that could re-identify the data subject, the data remains personal data. The mapping itself is the additional information GDPR refers to. You get lighter obligations on certain research-adjacent processing, but you do not exit GDPR scope unless you can claim true anonymization.

How is this different from redaction?

Redaction destroys the token: “[REDACTED].” Pseudonymization replaces it with a stable alias: “[P1].” A model trained on pseudonymized text still learns that a specific entity is referenced multiple times within a document; the alias preserves co-reference within the doc. Redaction destroys that signal. For LLM training, pseudonymization is almost always the right choice.

Should I pseudonymize company names too?

Only if you have a specific reason. Companies are not data subjects under Article 4(1), so the GDPR does not require it. Stripping them removes signal a regtech model needs — sanction prediction, AML pattern detection, comparative jurisprudence. Pseudonymize companies only when you have a legitimate competitive-intelligence concern, and document that scope choice.

What about persons who waived their privacy through public statements?

GDPR does not have a public-figure carve-out as broad as US privacy law. Article 9 has an exception for data manifestly made public by the data subject, but that applies to special-category data, and even there the courts read it narrowly. Default to pseudonymization, then carve out specific personas only with documented legal review.

How do I verify the pass actually worked?

Three checks. (1) Spot-check: pick 50 documents at random, read them, count any name that survived the pass. (2) Pattern audit: search the post-pseudo corpus for M. [A-Z] and similar — every hit is a survivor or a false negative to investigate. (3) Reverse map: for each detected alias, confirm the original name in the pre-pseudo backup and the count of substitutions. All three should be in your audit log.

GDPR pseudonymization for LLM training data — patterns and pitfalls.

Key takeaways

In this article

Anonymization vs pseudonymization in GDPR Article 4

Why training data on regulator corpora needs this layer

Detection — what a title-based regex captures

Mapping — per-document vs cross-document

What survives the pass (and why that is intentional)

A pseudonymized FR finance + regulatory corpus, ready for SFT

Pitfalls — back-identification, OCR, partial overlap

Tooling — Presidio, spaCy, dedicated regex, hybrid

Reporting — what to write in your dataset specification

A French regulatory corpus that already does this

Frequently asked questions

How is this different from redaction?

Should I pseudonymize company names too?

What about persons who waived their privacy through public statements?

How do I verify the pass actually worked?

Keep reading

EU AI Act Article 10 — what training data documentation actually requires

Building an audit-ready provenance trail for training datasets

Choosing a dataset format — Parquet vs JSONL vs Arrow

Comments

Leave a Reply Cancel reply

Keep reading.

French legal NLP — corpora, benchmarks, and evaluation tasks in 2026.

Instruction tuning vs SFT vs RLHF — choosing the right dataset shape.

RAG datasets — what makes a corpus retrieval-friendly in 2026.

Data deduplication for LLM corpora — MinHash LSH, exact match, semantic.

GDPR pseudonymization for LLM training data — patterns and pitfalls.

Key takeaways

In this article

Anonymization vs pseudonymization in GDPR Article 4

Why training data on regulator corpora needs this layer

Detection — what a title-based regex captures

Mapping — per-document vs cross-document

What survives the pass (and why that is intentional)

A pseudonymized FR finance + regulatory corpus, ready for SFT

Pitfalls — back-identification, OCR, partial overlap

Tooling — Presidio, spaCy, dedicated regex, hybrid

Reporting — what to write in your dataset specification

A French regulatory corpus that already does this

Frequently asked questions

Is pseudonymized training data still personal data under GDPR?

How is this different from redaction?

Should I pseudonymize company names too?

What about persons who waived their privacy through public statements?

How do I verify the pass actually worked?

Keep reading

EU AI Act Article 10 — what training data documentation actually requires

Building an audit-ready provenance trail for training datasets

Choosing a dataset format — Parquet vs JSONL vs Arrow

Comments

Leave a Reply Cancel reply

Keep reading.

French legal NLP — corpora, benchmarks, and evaluation tasks in 2026.

Instruction tuning vs SFT vs RLHF — choosing the right dataset shape.

RAG datasets — what makes a corpus retrieval-friendly in 2026.

Data deduplication for LLM corpora — MinHash LSH, exact match, semantic.