Editorial · French Corpus LLM · regulatory & generative AI

Field notes.

Field notes on training data, LLM fine-tuning, and regulatory compliance for AI in regulated industries — written by the team building the French Premium Web Corpus.

Browse by topic

🧠

LLM Training Data

Datasets, fine-tuning, and the engineering behind production LLMs.

Browse articles →

⚙️

Dataset Engineering

Formats, schemas, splits, quality assurance, reproducible pipelines.

Browse articles →

📋

AI Act & Governance

EU AI Act, GDPR, audit trails, and compliance-ready training data.

Browse articles →

🎯

Object Detection

Building training datasets for computer vision and CV pipelines.

Browse articles →

Latest articles

French NLP & Finance

French legal NLP — corpora, benchmarks, and evaluation tasks in 2026.

French legal NLP has matured into a distinct vertical with its own corpora, evaluation tasks, and tooling — but the landscape is still fragmented.…

May 15, 2026
LLM Training Data

Instruction tuning vs SFT vs RLHF — choosing the right dataset shape.

Instruction tuning, SFT, and RLHF look interchangeable until you have to build the dataset. Format differs, signal differs, evaluation differs. Here is how to…

May 15, 2026
LLM Training Data

RAG datasets — what makes a corpus retrieval-friendly in 2026.

RAG performance is more about your corpus than your retriever. Chunking, metadata, provenance per document, and pre-computed embeddings are the ingredients that decide whether…

May 15, 2026
Dataset Engineering

Data deduplication for LLM corpora — MinHash LSH, exact match, semantic.

Deduplication is one of the highest-ROI steps in an LLM data pipeline. Exact match is easy and shallow. MinHash LSH is the workhorse. Semantic…

May 15, 2026
LLM Training Data

Best public datasets for training generative AI models in 2026.

A 2026 map of the public datasets you can train a generative model on — with licenses, sizes, language coverage, and the trade-offs that…

May 15, 2026
AI Act & Governance

Datasheets for datasets — the AI Act Article 10 compliance template.

Datasheets for datasets, the 2018 framework by Gebru et al., is now the closest thing the AI Act Article 10 has to a recognized…

May 15, 2026

🧠 LLM Training Data

See all →

Instruction tuning vs SFT vs RLHF — choosing the right dataset shape.

Instruction tuning, SFT, and RLHF look interchangeable until you have to build the dataset. Format differs, signal differs,…

May 15, 2026
RAG datasets — what makes a corpus retrieval-friendly in 2026.

RAG performance is more about your corpus than your retriever. Chunking, metadata, provenance per document, and pre-computed embeddings…

May 15, 2026
Best public datasets for training generative AI models in 2026.

A 2026 map of the public datasets you can train a generative model on — with licenses, sizes,…

May 15, 2026

⚙️ Dataset Engineering

See all →

Data deduplication for LLM corpora — MinHash LSH, exact match, semantic.

Deduplication is one of the highest-ROI steps in an LLM data pipeline. Exact match is easy and shallow.…

May 15, 2026
Training data size for LLMs — how many tokens do you actually need in 2026?

Chinchilla scaling laws versus modern over-training. Concrete token volumes for pre-training, continued pre-training, SFT, DPO, evaluation, and RAG.…

May 13, 2026
Choosing a dataset format — Parquet vs JSONL vs Arrow for ML pipelines.

When to use JSONL, Parquet, or Arrow for ML datasets. The actual behaviour at scale, the operations each…

May 13, 2026

📋 AI Act & Governance

See all →

Datasheets for datasets — the AI Act Article 10 compliance template.

Datasheets for datasets, the 2018 framework by Gebru et al., is now the closest thing the AI Act…

May 15, 2026
GDPR pseudonymization for LLM training data — patterns and pitfalls.

Pseudonymization is the missing layer in most LLM training pipelines. Article 4(5) of the GDPR is specific about…

May 15, 2026
Building an audit-ready provenance trail for training datasets.

A per-record provenance trail using PROV-O and JSON-LD sidecars is now a regulatory necessity under the EU AI…

May 13, 2026

🎯 Object Detection

See all →

How to create a training dataset for object detection.

A practical eight-step guide to building a production-grade object detection training dataset — scoping, sourcing, annotation, QA, splits,…

May 13, 2026

Blog

Field notes.

Browse by topic

LLM Training Data

Dataset Engineering

AI Act & Governance

Object Detection

Latest articles

French legal NLP — corpora, benchmarks, and evaluation tasks in 2026.

Instruction tuning vs SFT vs RLHF — choosing the right dataset shape.

RAG datasets — what makes a corpus retrieval-friendly in 2026.

Data deduplication for LLM corpora — MinHash LSH, exact match, semantic.

Best public datasets for training generative AI models in 2026.

Datasheets for datasets — the AI Act Article 10 compliance template.

🧠 LLM Training Data

Instruction tuning vs SFT vs RLHF — choosing the right dataset shape.

RAG datasets — what makes a corpus retrieval-friendly in 2026.

Best public datasets for training generative AI models in 2026.

⚙️ Dataset Engineering

Data deduplication for LLM corpora — MinHash LSH, exact match, semantic.

Training data size for LLMs — how many tokens do you actually need in 2026?

Choosing a dataset format — Parquet vs JSONL vs Arrow for ML pipelines.

📋 AI Act & Governance

Datasheets for datasets — the AI Act Article 10 compliance template.

GDPR pseudonymization for LLM training data — patterns and pitfalls.

Building an audit-ready provenance trail for training datasets.

🎯 Object Detection

How to create a training dataset for object detection.