Essay · French Corpus LLM · Finaleads LLC

RAG datasets — what makes a corpus retrieval-friendly in 2026.

Most teams shopping for retrieval-augmented generation focus on the retriever — vector DB, reranker, embedding model. The data is treated as input. That is backwards. RAG quality is decided in your corpus structure: how you chunk, what you index, what metadata you keep, and whether the model can trust what it retrieves.

Key takeaways

  • A retrieval-friendly corpus ships with per-chunk metadata, not just per-document. Source URL, publication date, jurisdiction, document type — at the chunk level, queryable as filters.
  • Pre-computed embeddings move RAG from a research prototype to a 10-minute deployment. E5-multilingual-large at 1024 dimensions is the 2026 default for multilingual production. Ship them with the corpus.
  • Chunk size 512 tokens with 64-token overlap is the production-default. Smaller chunks improve precision but kill cross-paragraph context. Domain-specific tuning is real — legal text wants bigger chunks than news.
  • Provenance per chunk is the AI Act Article 10 footprint of RAG. When your retriever surfaces a chunk, the citation chain has to lead back to a canonical source. Skip this and your enterprise customer’s audit fails.
  • Hybrid retrieval (dense + sparse + reranker) beats pure dense by 5–15 % on regulatory text. Even with a perfect embedding model, sparse signal (exact match on a CELEX number, an article reference) catches what semantics can’t.

What makes a corpus “retrieval-friendly”

Two corpora with the same raw content can produce very different RAG quality. The difference is in five preparation choices: chunking strategy, metadata schema, embedding model, provenance density, and source-aware filtering. A retrieval-friendly corpus is one where those choices are made deliberately and shipped alongside the text.

The opposite — a “pretraining-friendly” corpus — optimizes for raw token throughput, light cleaning, broad license, no chunking, no embeddings. Pleias Common Corpus, FineWeb-2, RedPajama are pretraining-shaped. They will work in a RAG pipeline, but you do all the preparation work yourself. A vertical corpus shipped for RAG already has that work done.

Chunking — size, overlap, boundaries

Chunk size 512 tokens with 64-token overlap is the production-default for E5- and BGE-class embedding models. Smaller chunks (128–256 tokens) improve precision on specific-fact retrieval but lose cross-paragraph context — a model that retrieves a single sentence often cannot reason about the surrounding argument. Larger chunks (1024+ tokens) are good for long-context models but reduce the recall granularity.

Boundary detection matters more than size. Splitting mid-sentence is the most common mistake. The right boundary order is: section heading > paragraph > sentence > token. For legal text specifically, also respect article boundaries — never split an article across two chunks if you can avoid it, because the article number is the citation key.

DomainRecommended chunk sizeOverlapRationale
News / web256–51232–64Short, self-contained paragraphs
Legal / regulatory512–76864–128Articles are the natural unit
Scientific papers512–1024128Cross-section reasoning
Code / API docs256–51264Function-level boundaries
Conversational logs128–25632Turn-level retrieval

ANCHOR ON STRUCTURE

If your source documents have a table of contents or visible structure (HTML headings, markdown sections, PDF outline), use it. Splitting on existing structure gives chunks that humans understand as units. Random splitting at 512 tokens generates chunks that look fine to the retriever but make no sense in a citation.

Metadata per chunk — what to keep, what to drop

The minimum metadata schema per chunk: chunk ID, parent document ID, source URL, publication date, chunk position within document, chunk text length. Anything you might want to filter on at query time should be a metadata field, not a tag inside the text.

Domain-specific metadata that earns its keep:

  • Legal corpora: jurisdiction, court level, article reference, CELEX number, date of decision, citation list (which other documents this chunk cites).
  • Financial corpora: entity tickers mentioned, currency, time period discussed, document type (10-K, regulator decision, research note), regulator if any.
  • Multilingual corpora: language, language confidence, translation fidelity score if the chunk is translated content.
  • Regulatory corpora: regulator name, regulatory topic (e.g., AML, GDPR, prudential), authority of the document (binding, soft law, guidance).

A retriever that cannot filter by jurisdiction or date is useless on regulatory questions. Metadata is the dial that turns RAG from research demo to production tool.

Pre-computed embeddings — model and dimensions

Three reasons to pre-compute and ship embeddings with the corpus: (1) speed to first query — your customer indexes Parquet, not text; (2) consistency — everyone using the corpus uses the same embedding, so retrieval results are reproducible; (3) cost — embedding 2 million documents once on the corpus owner’s GPU is cheaper than every customer doing it themselves.

Model choice as of 2026:

ModelDimLanguagesPosture
E5-multilingual-large-instruct1024100+Production default 2026
BGE-multilingual-gemma21024100+Strong multilingual
mGTE-large102470+Strong on legal/long-doc
multilingual-e5-base768100+Lighter, faster
CamemBERT v2 + projection768FR onlyFR-specific fine-tune

Storage: float32 at 1024 dim is 4 KB per chunk; INT8-quantized is 1 KB. For a 2-million-chunk corpus, that is 8 GB float32 vs 2 GB INT8 — both manageable. INT8 quantization loses around 0.5–1.5 % retrieval quality, which is usually worth it.

A French finance + regulatory corpus with E5-multilingual embeddings

We ship the corpus with pre-computed embeddings, per-chunk metadata, jurisdiction and citation graph. Plug into Pinecone, Weaviate, or pgvector in 10 minutes.

Hybrid retrieval — why dense alone is not enough

Pure dense retrieval (cosine on embeddings) is the 2022 default. In 2026 it is no longer enough for regulatory or legal retrieval. The reason: semantic models smooth over exact identifiers. A query for “Article L. 612-15” or CELEX “32023R1114” (MiCA) is a precise lookup, not a semantic one. Dense retrieval lands somewhere close but not exact; sparse retrieval (BM25) lands on it.

Standard hybrid: BM25 for sparse, E5-multilingual or BGE for dense, combine via reciprocal rank fusion or a learned reranker. Then a cross-encoder reranker (BGE-reranker or Cohere Rerank) on the top 25 candidates. On internal evals across French regulatory text, this beat pure dense by 5–15 % nDCG@10. The cost is operational complexity, not training cost — both retrievers run cheap once the index is built.

Provenance and citation — the audit angle

When your retriever surfaces a chunk and the model uses it, the citation chain should lead back to a canonical source. For an enterprise compliance use case, that means: the chunk has a stable URL or document ID, the document has an audit trail (when extracted, from where, with what license), and the model’s answer can include the citation in the output.

This is also the AI Act Article 10 footprint of RAG. A generative AI provider deploying a regulated-domain assistant has to demonstrate where the retrieved content came from. A corpus that ships per-chunk provenance JSON-LD makes that demonstration trivial — you read the audit field directly. A corpus that does not requires you to rebuild the trail yourself.

Eval — measuring retrieval quality on your corpus

Two evals to run on any RAG corpus before deployment: a synthetic-query eval and a real-query eval. Synthetic: for each document, generate 1–3 questions a Llama or Qwen-class model would plausibly answer using only that document; index the questions; measure recall@K. Real: pull 100 actual user questions from your support tickets or sales calls; have a domain expert score the retrieved chunks for relevance.

Benchmarks: BEIR is the canonical retrieval benchmark in 2026, but its domains are broad — for vertical RAG you need a domain-specific eval. MIRAGE (medical), LegalBench (legal), MIRACL (multilingual general). For French regulatory specifically, no public eval exists yet; build your own from 100–500 expert-curated query-document pairs.

Common failure patterns in 2026 RAG deployments

  • Splitting on tokens, not structure. Chunks land mid-sentence, the retriever surfaces incomplete information, the model generates plausible nonsense.
  • Missing date metadata. A query about “the current regulation” retrieves a 2018 version because the chunk has no date filter. Avoidable if every chunk carries publication date and the retriever respects it.
  • Embedding the wrong text. Embedding the chunk including boilerplate, footer, or page number degrades retrieval. Clean before you embed.
  • Single embedding model. Multilingual corpus, monolingual embedding. Recall on the non-English subset drops by 30 %+. Use a multilingual model from the start.
  • No reranker. Top-K candidates from dense retrieval are noisy. A cross-encoder reranker on top 25–50 is the single highest-ROI addition to a 2026 RAG pipeline.

A vertical French corpus ready for RAG

Per-chunk metadata (jurisdiction, date, citation graph), E5-multilingual embeddings, audit-trail per document. Plug-and-play for finance, regulatory, and economic RAG.

Frequently asked questions

Do I need a vector database or is a flat file enough?

For under 1 million chunks, a flat file with FAISS or HNSWlib is fine and removes a service dependency. For larger corpora or strict latency SLAs, a managed vector DB (Pinecone, Weaviate, Qdrant Cloud) saves you operational work. pgvector inside PostgreSQL is the right answer if you already run Postgres and the corpus fits.

How often should I re-embed?

Re-embed when the embedding model changes or when the corpus refresh adds significant new content (10 %+). Incremental embeds on new content only is the standard pattern. A full re-embed every 6–12 months keeps the corpus aligned with model improvements without massive cost.

Can I use OpenAI embeddings instead of open-source ones?

Yes, but you create vendor lock-in and a network dependency. OpenAI text-embedding-3-large is currently top-tier on English and good on multilingual but a closed-weight API. For production RAG on a corpus you ship to customers, open-weight (E5, BGE, mGTE) is the more durable choice.

Does fine-tuning the embedding model help?

On domain-specific corpora, yes. Sentence-Transformers SimCSE or Contrastive Loss fine-tuning on (query, relevant_chunk) pairs from your domain typically improves domain retrieval by 5–15 % with a few thousand pairs. The pairs cost more than the compute.

What about long-context models — do they replace RAG?

They reduce some classes of RAG use cases but do not replace RAG for large corpora. A 200K-context model can read an entire book, but it cannot read a 2-million-document corpus. RAG is the necessary architecture for corpus-scale knowledge. Long-context models complement RAG by accepting more retrieved chunks per query, which lets you operate at lower retrieval precision.


Keep reading

Read next

How to train an LLM on your own data — a practical 2026 guide

End-to-end fine-tuning playbook for a small team — hardware, dataset prep, evaluation.

Read next

Choosing a dataset format — Parquet vs JSONL vs Arrow

Format choices that decide how fast you can re-embed and re-index when the model changes.

Read next

Training data size for LLMs — how many tokens do you actually need in 2026?

Tokens for pretraining and chunks for RAG live in different mental models. Both matter.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *