Essay · French Corpus LLM · Finaleads LLC

Data deduplication for LLM corpora — MinHash LSH, exact match, semantic.

Deduplication is one of the highest-ROI steps in an LLM data pipeline. Skipping it drags down model quality, inflates training cost, and confuses your evaluation signal. The three methods on the table — exact match, MinHash LSH, semantic — are not interchangeable. Each catches a different class of duplicate.

Key takeaways

  • Exact deduplication catches byte-identical content — fast, cheap, and only removes the easiest 5–15 % of duplicates in a mixed corpus. Skip nothing else if you skip this.
  • MinHash LSH at Jaccard threshold 0.7–0.8 with 5-shingles is the production workhorse. On a multi-source French corpus (2 M docs), it removes 28–32 % of duplicates that exact matching misses.
  • Semantic dedup with sentence embeddings is the 2026 frontier. Catches paraphrases and translations exact-and-MinHash miss. Cost is GPU compute; ROI depends on whether your corpus has translation overlap.
  • Cluster-then-keep-highest-quality is the right policy for picking survivors. Random sampling and keep-first both throw away the wrong copy too often. Track the survival rule in your dataset specification.
  • Cross-source dedup matters most. Same parliamentary debate quoted in a court ruling quoted in a regulator’s decision is a routine pattern in regulatory corpora. Don’t skip the cross-source pass.

Why deduplication matters for LLMs

Three concrete reasons. Quality: training on duplicates over-weights the duplicated content, biases the model toward whatever style is repeated, and degrades perplexity on held-out evaluations. Cost: every duplicate is a token you pay to train on without getting incremental signal. At scale, that is real money. Evaluation: if duplicates leak into your eval set, your benchmark numbers are inflated by content the model already memorized.

The Mistral team, the Falcon team, and the FineWeb team all reported similar findings when they published their pipelines: aggressive deduplication is the single highest-ROI data-quality intervention. The 2022–2023 papers on training-data deduplication (Lee et al., Penedo et al., the SlimPajama report) consistently show that going from exact-only to MinHash-LSH improves downstream task performance with no change to model architecture or training recipe.

Exact-match dedup — what it does and does not catch

Hash each document with SHA-256, group by hash, keep one per group. This catches byte-identical content — typically a mirror, a CDN cache, or a republication that preserved formatting exactly. In a mixed corpus, exact-match removes around 5–15 % of total documents. The remaining 80–90 % of duplicates differ by whitespace, trivial reformatting, header changes, or one-byte modifications — exact-match misses all of them.

Two cheap upgrades: normalize whitespace and Unicode form before hashing, and hash the first 1024 characters and the last 1024 characters separately to detect identical body with different boilerplate. Neither replaces MinHash LSH but both add a fast pre-filter that reduces the LSH workload.

MinHash LSH — the workhorse

MinHash is a locality-sensitive hashing technique that estimates Jaccard similarity between two sets in sub-linear time. For documents, the set is usually the k-shingles (contiguous sequences of k tokens). Two documents that share 80 % of their shingles produce MinHash signatures that match with probability close to 80 %. The LSH part groups documents whose signatures match in many bands, so candidate duplicates are retrieved in near-constant time per query.

The Python datasketch library is the canonical implementation. Build a MinHashLSH with threshold=0.7 and num_perm=128, compute a MinHash per document, insert into the LSH, query back, and use a union-find to cluster the resulting pairs into duplicate groups. Wall-clock for 2 million documents at this configuration: roughly an hour on a single CPU machine. Memory: 32–64 GB depending on document length.

WATCH OUT

MinHash with k-shingles is sensitive to document length. Short documents (under 100 tokens) produce few shingles, so the Jaccard estimator is noisy and false positives and false negatives both rise. For corpora with many short documents, mix a short-doc exact-prefix bucket with a long-doc LSH bucket.

Tuning MinHash LSH — k-shingles, num_perm, threshold

ParameterCommon rangeEffectDefault to start
k (shingle size)3–7 tokensSmaller k = more matches, more noise; larger k = stricter5
num_perm64–256More perms = more precise Jaccard estimate, more compute128
threshold0.5–0.9Lower = aggressive dedup, more false positives0.7
bands × rowsderived from threshold + num_permTrade-off recall vs precisionauto

The CCNet recipe (Wenzek et al.) and the FineWeb recipe both start at threshold=0.7, k=5, num_perm=128. The Pleias Common Corpus team uses a slightly stricter 0.8 threshold. If you want a single starting configuration without tuning, use the CCNet defaults — they generalize well across European languages and produce stable dedup rates.

Most teams over-tune MinHash. Pick CCNet defaults, run, audit a hundred pairs, only then adjust. Random walk on hyperparameters wastes more time than it saves.

Semantic deduplication — when and how

Semantic dedup uses sentence embeddings (E5-multilingual-large, BGE-multilingual, Cohere Embed) to find documents that say the same thing in different words. The 2024 SemDeDup paper showed measurable gains on top of MinHash LSH for CommonCrawl-derived corpora — particularly for content that has been paraphrased across news sites or translated between languages.

Practical setup: embed each document (or each chunk if long), build an approximate nearest-neighbor index (FAISS, ScaNN, HNSW), threshold at cosine similarity around 0.92–0.95, cluster, keep highest-quality survivor. The compute cost is non-trivial — embedding 2 million documents on a single V100S takes around 6 hours; the ANN index build adds another hour. Worth it for corpora with significant translation or paraphrase overlap; usually overkill for single-language single-domain corpora that MinHash already covers.

A French corpus with deduplication audited per release

Our pipeline runs exact-match, MinHash LSH at threshold 0.7 with k=5 and num_perm=128, and tracks the drop rate per source. The numbers are in the signed Dataset Specification.

Cross-source duplicates — the regulatory case

Within a single source, dedup is straightforward — the duplicates are usually mirrors or near-mirrors. Cross-source is where the interesting cases live. In a French regulatory corpus, three patterns repeat:

  • Citation overlap. A Cour de cassation ruling quotes the trial court ruling, which quotes the statute, which is also present in the LEGI archive. Same text appears in 3–5 sources.
  • Republication. JORF publishes an arrêté ; a regulator republishes it on its own site with a different header; an industry association republishes the regulator’s version with annotations. Same body, three documents.
  • Translation chains. An EU regulation is published in EUR-Lex in French, transposed into a French law in JORF, summarized by ACPR in a position document, interpreted by Cour de cassation later. The shared substantive content varies in framing but the legal text propagates.

Skipping cross-source dedup on a regulatory corpus leaves 15–30 % of the docs as near-duplicates of each other. The model trained on this learns the regulatory style as “repeat the same paragraph in slightly different framings,” which is not what you want.

Survivor selection — which copy to keep

Given a cluster of N duplicates, you must keep exactly one. The three policies in practice: keep-first (whichever was indexed first), keep-random, and keep-highest-quality. The third is the right default — pick the document with the highest stage 2 quality score, break ties by lexicographic ID for reproducibility.

For cross-source clusters, also encode a source-priority list. In a finance + regulatory corpus, an example priority might be: LEGI (canonical statute) > JORF (published version) > regulator republication > industry annotation. This keeps the primary-source version of each duplicate group and discards the derivative copies.

Numbers from a real corpus

Concrete numbers from our own pipeline on the French finance + regulatory corpus, as of v1.2:

SourceInput docsAfter exact dedupAfter MinHash LSH 0.7Drop %
JORF (1947–2026)1,869,2841,869,2841,341,69128.2 %
CASS jurisprudence144,159143,898143,7370.3 %
JADE Conseil d’État552,576551,902513,7097.0 %
EUR-Lex FR143,889143,889100,43330.2 %
CNIL deliberations8,1268,1265,83728.2 %

JORF and CNIL show the highest drop rates because both have a lot of template-style repetition (similar decrees, similar authorization-type deliberations). EUR-Lex hits a 30 % drop because the bulk archive includes substantial overlap between consolidated and non-consolidated versions of the same regulations. The CASS rate is low because rulings genuinely differ document-to-document. These numbers are stable across our monthly refresh.

A pre-deduped French corpus, audited per release

Cross-source MinHash LSH at threshold 0.7, k=5, num_perm=128. Drop rates published per source in the signed Dataset Specification. You can re-verify against the source manifests.

Frequently asked questions

Should I deduplicate before or after quality filtering?

Dedup after coarse filters (language, minimum length) and before fine quality scoring. The reason: quality scoring on duplicates wastes compute, but dedup on documents that fail language detection produces noisy clusters. Sequence: language → min-length → exact dedup → MinHash LSH → quality scoring.

What threshold should I use if I have a small corpus?

For small corpora (under 100K documents), a stricter threshold like 0.8 or 0.85 is safer. False positives matter more when each surviving document carries more weight. Audit 50 pairs at the boundary and adjust. Production thresholds for large generalist corpora cluster around 0.7.

Does deduplication hurt minority dialects or rare languages?

It can. Aggressive dedup on a multilingual corpus where one language is overwhelmingly represented may strip too much of the dominant language while leaving rare-language duplicates intact. Best practice: dedup per-language buckets independently, then merge.

How do I know dedup actually helped?

Two checks. (1) Train two small models, one with dedup and one without, evaluate on a held-out perplexity set — the deduped model typically wins by 1–3 % perplexity. (2) Check for memorization on the eval set with a string-matching probe. Higher memorization rate without dedup is the visible signal.

Is exact-match dedup ever enough?

Only for very curated corpora (academic textbooks, manually maintained article collections) where the source material does not republish itself. Anything derived from CommonCrawl, web scraping, or multi-publisher news always needs MinHash on top.


Keep reading

Read next

Choosing a dataset format — Parquet vs JSONL vs Arrow

Format choice that affects how fast your dedup pass reads from disk and how the filtered survivors are written back.

Read next

Training data size for LLMs — how many tokens do you actually need in 2026?

Dedup reduces token count by 20–40 %. Knowing your real post-dedup count is the right input to Chinchilla-style planning.

Read next

Best public datasets for training generative AI models in 2026

Each major public dataset ships with a different deduplication posture. Knowing that matters more than headline token count.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *