Editorial · French Corpus LLM · regulatory & generative AI
Field notes.
Field notes on training data, LLM fine-tuning, and regulatory compliance for AI in regulated industries — written by the team building the French Premium Web Corpus.
Browse by topic
Latest articles
-
French legal NLP — corpora, benchmarks, and evaluation tasks in 2026.
French legal NLP has matured into a distinct vertical with its own corpora, evaluation tasks, and tooling — but the landscape is still fragmented.…
-
Instruction tuning vs SFT vs RLHF — choosing the right dataset shape.
Instruction tuning, SFT, and RLHF look interchangeable until you have to build the dataset. Format differs, signal differs, evaluation differs. Here is how to…
-
RAG datasets — what makes a corpus retrieval-friendly in 2026.
RAG performance is more about your corpus than your retriever. Chunking, metadata, provenance per document, and pre-computed embeddings are the ingredients that decide whether…
-
Data deduplication for LLM corpora — MinHash LSH, exact match, semantic.
Deduplication is one of the highest-ROI steps in an LLM data pipeline. Exact match is easy and shallow. MinHash LSH is the workhorse. Semantic…
-
Best public datasets for training generative AI models in 2026.
A 2026 map of the public datasets you can train a generative model on — with licenses, sizes, language coverage, and the trade-offs that…
-
Datasheets for datasets — the AI Act Article 10 compliance template.
Datasheets for datasets, the 2018 framework by Gebru et al., is now the closest thing the AI Act Article 10 has to a recognized…
🧠 LLM Training Data
-
Instruction tuning vs SFT vs RLHF — choosing the right dataset shape.
Instruction tuning, SFT, and RLHF look interchangeable until you have to build the dataset. Format differs, signal differs,…
-
RAG datasets — what makes a corpus retrieval-friendly in 2026.
RAG performance is more about your corpus than your retriever. Chunking, metadata, provenance per document, and pre-computed embeddings…
-
Best public datasets for training generative AI models in 2026.
A 2026 map of the public datasets you can train a generative model on — with licenses, sizes,…
⚙️ Dataset Engineering
-
Data deduplication for LLM corpora — MinHash LSH, exact match, semantic.
Deduplication is one of the highest-ROI steps in an LLM data pipeline. Exact match is easy and shallow.…
-
Training data size for LLMs — how many tokens do you actually need in 2026?
Chinchilla scaling laws versus modern over-training. Concrete token volumes for pre-training, continued pre-training, SFT, DPO, evaluation, and RAG.…
-
Choosing a dataset format — Parquet vs JSONL vs Arrow for ML pipelines.
When to use JSONL, Parquet, or Arrow for ML datasets. The actual behaviour at scale, the operations each…
📋 AI Act & Governance
-
Datasheets for datasets — the AI Act Article 10 compliance template.
Datasheets for datasets, the 2018 framework by Gebru et al., is now the closest thing the AI Act…
-
GDPR pseudonymization for LLM training data — patterns and pitfalls.
Pseudonymization is the missing layer in most LLM training pipelines. Article 4(5) of the GDPR is specific about…
-
Building an audit-ready provenance trail for training datasets.
A per-record provenance trail using PROV-O and JSON-LD sidecars is now a regulatory necessity under the EU AI…
🎯 Object Detection
-
How to create a training dataset for object detection.
A practical eight-step guide to building a production-grade object detection training dataset — scoping, sourcing, annotation, QA, splits,…