Editorial · French Corpus LLM · regulatory & generative AI

Category: Dataset Engineering

Practical engineering for production datasets — formats, schemas, splits, versioning, quality assurance, deduplication, and reproducible pipelines from raw source to release.

  • Data deduplication for LLM corpora — MinHash LSH, exact match, semantic.

    Deduplication is one of the highest-ROI steps in an LLM data pipeline. Skipping it drags down model quality, inflates training cost, and confuses your evaluation signal. The three methods on the table — exact match, MinHash LSH, semantic — are not interchangeable. Each catches a different class of duplicate.

    Key takeaways

    • Exact deduplication catches byte-identical content — fast, cheap, and only removes the easiest 5–15 % of duplicates in a mixed corpus. Skip nothing else if you skip this.
    • MinHash LSH at Jaccard threshold 0.7–0.8 with 5-shingles is the production workhorse. On a multi-source French corpus (2 M docs), it removes 28–32 % of duplicates that exact matching misses.
    • Semantic dedup with sentence embeddings is the 2026 frontier. Catches paraphrases and translations exact-and-MinHash miss. Cost is GPU compute; ROI depends on whether your corpus has translation overlap.
    • Cluster-then-keep-highest-quality is the right policy for picking survivors. Random sampling and keep-first both throw away the wrong copy too often. Track the survival rule in your dataset specification.
    • Cross-source dedup matters most. Same parliamentary debate quoted in a court ruling quoted in a regulator’s decision is a routine pattern in regulatory corpora. Don’t skip the cross-source pass.

    Why deduplication matters for LLMs

    Three concrete reasons. Quality: training on duplicates over-weights the duplicated content, biases the model toward whatever style is repeated, and degrades perplexity on held-out evaluations. Cost: every duplicate is a token you pay to train on without getting incremental signal. At scale, that is real money. Evaluation: if duplicates leak into your eval set, your benchmark numbers are inflated by content the model already memorized.

    The Mistral team, the Falcon team, and the FineWeb team all reported similar findings when they published their pipelines: aggressive deduplication is the single highest-ROI data-quality intervention. The 2022–2023 papers on training-data deduplication (Lee et al., Penedo et al., the SlimPajama report) consistently show that going from exact-only to MinHash-LSH improves downstream task performance with no change to model architecture or training recipe.

    Exact-match dedup — what it does and does not catch

    Hash each document with SHA-256, group by hash, keep one per group. This catches byte-identical content — typically a mirror, a CDN cache, or a republication that preserved formatting exactly. In a mixed corpus, exact-match removes around 5–15 % of total documents. The remaining 80–90 % of duplicates differ by whitespace, trivial reformatting, header changes, or one-byte modifications — exact-match misses all of them.

    Two cheap upgrades: normalize whitespace and Unicode form before hashing, and hash the first 1024 characters and the last 1024 characters separately to detect identical body with different boilerplate. Neither replaces MinHash LSH but both add a fast pre-filter that reduces the LSH workload.

    MinHash LSH — the workhorse

    MinHash is a locality-sensitive hashing technique that estimates Jaccard similarity between two sets in sub-linear time. For documents, the set is usually the k-shingles (contiguous sequences of k tokens). Two documents that share 80 % of their shingles produce MinHash signatures that match with probability close to 80 %. The LSH part groups documents whose signatures match in many bands, so candidate duplicates are retrieved in near-constant time per query.

    The Python datasketch library is the canonical implementation. Build a MinHashLSH with threshold=0.7 and num_perm=128, compute a MinHash per document, insert into the LSH, query back, and use a union-find to cluster the resulting pairs into duplicate groups. Wall-clock for 2 million documents at this configuration: roughly an hour on a single CPU machine. Memory: 32–64 GB depending on document length.

    WATCH OUT

    MinHash with k-shingles is sensitive to document length. Short documents (under 100 tokens) produce few shingles, so the Jaccard estimator is noisy and false positives and false negatives both rise. For corpora with many short documents, mix a short-doc exact-prefix bucket with a long-doc LSH bucket.

    Tuning MinHash LSH — k-shingles, num_perm, threshold

    ParameterCommon rangeEffectDefault to start
    k (shingle size)3–7 tokensSmaller k = more matches, more noise; larger k = stricter5
    num_perm64–256More perms = more precise Jaccard estimate, more compute128
    threshold0.5–0.9Lower = aggressive dedup, more false positives0.7
    bands × rowsderived from threshold + num_permTrade-off recall vs precisionauto

    The CCNet recipe (Wenzek et al.) and the FineWeb recipe both start at threshold=0.7, k=5, num_perm=128. The Pleias Common Corpus team uses a slightly stricter 0.8 threshold. If you want a single starting configuration without tuning, use the CCNet defaults — they generalize well across European languages and produce stable dedup rates.

    Most teams over-tune MinHash. Pick CCNet defaults, run, audit a hundred pairs, only then adjust. Random walk on hyperparameters wastes more time than it saves.

    Semantic deduplication — when and how

    Semantic dedup uses sentence embeddings (E5-multilingual-large, BGE-multilingual, Cohere Embed) to find documents that say the same thing in different words. The 2024 SemDeDup paper showed measurable gains on top of MinHash LSH for CommonCrawl-derived corpora — particularly for content that has been paraphrased across news sites or translated between languages.

    Practical setup: embed each document (or each chunk if long), build an approximate nearest-neighbor index (FAISS, ScaNN, HNSW), threshold at cosine similarity around 0.92–0.95, cluster, keep highest-quality survivor. The compute cost is non-trivial — embedding 2 million documents on a single V100S takes around 6 hours; the ANN index build adds another hour. Worth it for corpora with significant translation or paraphrase overlap; usually overkill for single-language single-domain corpora that MinHash already covers.

    A French corpus with deduplication audited per release

    Our pipeline runs exact-match, MinHash LSH at threshold 0.7 with k=5 and num_perm=128, and tracks the drop rate per source. The numbers are in the signed Dataset Specification.

    Cross-source duplicates — the regulatory case

    Within a single source, dedup is straightforward — the duplicates are usually mirrors or near-mirrors. Cross-source is where the interesting cases live. In a French regulatory corpus, three patterns repeat:

    • Citation overlap. A Cour de cassation ruling quotes the trial court ruling, which quotes the statute, which is also present in the LEGI archive. Same text appears in 3–5 sources.
    • Republication. JORF publishes an arrêté ; a regulator republishes it on its own site with a different header; an industry association republishes the regulator’s version with annotations. Same body, three documents.
    • Translation chains. An EU regulation is published in EUR-Lex in French, transposed into a French law in JORF, summarized by ACPR in a position document, interpreted by Cour de cassation later. The shared substantive content varies in framing but the legal text propagates.

    Skipping cross-source dedup on a regulatory corpus leaves 15–30 % of the docs as near-duplicates of each other. The model trained on this learns the regulatory style as “repeat the same paragraph in slightly different framings,” which is not what you want.

    Survivor selection — which copy to keep

    Given a cluster of N duplicates, you must keep exactly one. The three policies in practice: keep-first (whichever was indexed first), keep-random, and keep-highest-quality. The third is the right default — pick the document with the highest stage 2 quality score, break ties by lexicographic ID for reproducibility.

    For cross-source clusters, also encode a source-priority list. In a finance + regulatory corpus, an example priority might be: LEGI (canonical statute) > JORF (published version) > regulator republication > industry annotation. This keeps the primary-source version of each duplicate group and discards the derivative copies.

    Numbers from a real corpus

    Concrete numbers from our own pipeline on the French finance + regulatory corpus, as of v1.2:

    SourceInput docsAfter exact dedupAfter MinHash LSH 0.7Drop %
    JORF (1947–2026)1,869,2841,869,2841,341,69128.2 %
    CASS jurisprudence144,159143,898143,7370.3 %
    JADE Conseil d’État552,576551,902513,7097.0 %
    EUR-Lex FR143,889143,889100,43330.2 %
    CNIL deliberations8,1268,1265,83728.2 %

    JORF and CNIL show the highest drop rates because both have a lot of template-style repetition (similar decrees, similar authorization-type deliberations). EUR-Lex hits a 30 % drop because the bulk archive includes substantial overlap between consolidated and non-consolidated versions of the same regulations. The CASS rate is low because rulings genuinely differ document-to-document. These numbers are stable across our monthly refresh.

    A pre-deduped French corpus, audited per release

    Cross-source MinHash LSH at threshold 0.7, k=5, num_perm=128. Drop rates published per source in the signed Dataset Specification. You can re-verify against the source manifests.

    Frequently asked questions

    Should I deduplicate before or after quality filtering?

    Dedup after coarse filters (language, minimum length) and before fine quality scoring. The reason: quality scoring on duplicates wastes compute, but dedup on documents that fail language detection produces noisy clusters. Sequence: language → min-length → exact dedup → MinHash LSH → quality scoring.

    What threshold should I use if I have a small corpus?

    For small corpora (under 100K documents), a stricter threshold like 0.8 or 0.85 is safer. False positives matter more when each surviving document carries more weight. Audit 50 pairs at the boundary and adjust. Production thresholds for large generalist corpora cluster around 0.7.

    Does deduplication hurt minority dialects or rare languages?

    It can. Aggressive dedup on a multilingual corpus where one language is overwhelmingly represented may strip too much of the dominant language while leaving rare-language duplicates intact. Best practice: dedup per-language buckets independently, then merge.

    How do I know dedup actually helped?

    Two checks. (1) Train two small models, one with dedup and one without, evaluate on a held-out perplexity set — the deduped model typically wins by 1–3 % perplexity. (2) Check for memorization on the eval set with a string-matching probe. Higher memorization rate without dedup is the visible signal.

    Is exact-match dedup ever enough?

    Only for very curated corpora (academic textbooks, manually maintained article collections) where the source material does not republish itself. Anything derived from CommonCrawl, web scraping, or multi-publisher news always needs MinHash on top.


    Keep reading

    Read next

    Choosing a dataset format — Parquet vs JSONL vs Arrow

    Format choice that affects how fast your dedup pass reads from disk and how the filtered survivors are written back.

    Read next

    Training data size for LLMs — how many tokens do you actually need in 2026?

    Dedup reduces token count by 20–40 %. Knowing your real post-dedup count is the right input to Chinchilla-style planning.

    Read next

    Best public datasets for training generative AI models in 2026

    Each major public dataset ships with a different deduplication posture. Knowing that matters more than headline token count.

  • Training data size for LLMs — how many tokens do you actually need in 2026?

    "How many tokens do I need?" has eight different right answers, depending on whether you mean pre-training, continued pre-training, SFT, DPO, RLHF, distillation, RAG indexing, or evaluation. The Chinchilla scaling law was the first quantitative answer in 2022; three years on, the industry has both validated and substantially exceeded it.

    Key takeaways

    • Modern pre-training uses 100–500 tokens per parameter, not the Chinchilla 20 — inference compute dominates total cost.
    • Continued pre-training sweet spot: 1–10 B in-domain tokens with a 1:4 in-domain to general mix.
    • SFT sweet spot: 1 000–10 000 curated examples. Above 100 000, diminishing returns.
    • DPO and preference optimization: 10 000–50 000 pairs for style/safety tuning.
    • Eval set: 200–500 examples, never leaked into training. The discipline matters more than the size.

    “How many tokens do I need to train my model?” has eight different right answers, depending on whether you mean pre-training, continued pre-training, SFT, DPO, RLHF, distillation, RAG indexing, or evaluation. The Chinchilla scaling law gave the field its first quantitative answer in 2022. Three years later, the industry has both validated and substantially exceeded it. This guide walks through the actual data-size guidance for each training stage in 2026, the diminishing-returns thresholds, and the trade-offs that determine whether you should aim for “compute-optimal” or “inference-optimal” data volumes.

    The Chinchilla baseline — and why most modern models exceed it

    Inference compute wins

    Llama 3 used 15 T tokens at 8B (1 900 tokens/parameter — way over Chinchilla’s 20). Why? Inference compute dominates total cost over a model’s lifetime. Over-training trades training cost for cheaper inference forever — a winning trade for any deployed model.

    The Chinchilla paper from DeepMind established that, for a fixed compute budget, you should train a smaller model on more tokens rather than a larger model on fewer tokens. The headline finding: roughly 20 tokens per parameter is compute-optimal for pre-training. A 7 B model wants ~140 B tokens; a 70 B model wants ~1.4 T tokens.

    Three years later, every major open-weights release has gone well beyond Chinchilla-optimal:

    • Llama 3 (8 B and 70 B): 15 T tokens. That is ~1 900 tokens/parameter for the 8 B and ~210 for the 70 B.
    • DeepSeek-V3 (671 B mixture-of-experts, 37 B active): 14.8 T tokens.
    • Qwen 2.5 (7 B): 18 T tokens — over 2 500 tokens per parameter.

    This is not a refutation of Chinchilla; it is a different optimization. Chinchilla optimizes training compute. Modern teams optimize inference compute: a model that is “over-trained” (more tokens per parameter than Chinchilla-optimal) achieves the same quality at smaller parameter count, which means cheaper inference for the entire lifetime of the model. Inference compute dwarfs training compute over a model’s deployed life, so the trade-off favours over-training. If you are pre-training from scratch in 2026, aim for 100–500 tokens per parameter, not 20.

    Continued pre-training — moving the needle on a specific domain

    Continued pre-training resumes the base model’s autoregressive language-modelling objective on an in-domain corpus. The data-size thresholds are well established in the literature:

    • Below ~500 M tokens of in-domain data: measurable improvement on in-domain perplexity, often at the cost of out-of-domain capability. Marginal benefit over SFT for most use cases.
    • 1–10 B tokens in-domain: the sweet spot for vertical-language LLMs (medical, legal, finance, code). Enough to shift token frequencies and acquire domain vocabulary without catastrophically forgetting general competencies.
    • 10–100 B tokens in-domain: approaches the volume of a small dedicated pre-training. Genuinely changes the model’s distribution. Reserve for cases where a vertical foundation model is the deliverable.

    One practical rule: maintain a 1:4 in-domain to general-corpus ratio during continued pre-training. A pure in-domain run almost always degrades general fluency more than the in-domain gain justifies.

    SFT — examples, not tokens

    Need 1–10 B in-domain tokens of French regulatory text?

    Our corpus ships 460 M tokens of finance, regulation, and economic FR text in tiered packages — the right volume to swap into a continued pre-training mix.

    For supervised fine-tuning, the metric is examples (instruction–response pairs), not tokens. The empirical thresholds, validated across hundreds of public fine-tunes:

    • 50–500 examples: teaches the model a specific format or persona. Useful for tightly scoped tasks (always answer in JSON, always cite sources, refuse a specific class of request).
    • 1 000–10 000 examples: the LIMA range. With careful curation, this band produces high-quality general-purpose assistants. Most public open-source SFT recipes sit here.
    • 10 000–100 000 examples: diminishing returns set in. Each additional 10 K examples often adds 0.5–1 % on benchmarks, sometimes less. Worth it only if the additional examples cover gaps identified in evaluation.
    • Above 100 000 examples: needed for multi-task SFT covering a broad capability surface. Standard for the post-training stage of public foundation models. Rarely needed for vertical fine-tunes.

    Quality dominates quantity throughout this regime. A team that ships 2 000 expert-reviewed SFT examples will outperform a team that ships 50 000 LLM-generated examples on the same task. The composition of an SFT dataset is covered in detail in our companion article on SFT dataset formats.

    DPO and preference optimization — preference pairs

    Direct Preference Optimization and its variants (ORPO, KTO, SimPO) train on pairs of preferred and rejected responses to the same prompt. Data-size guidance:

    • 5 000–20 000 preference pairs: sufficient to tune style, formatting, and safety behaviour after SFT. The post-SFT alignment step for most open-weights releases.
    • 50 000–200 000 preference pairs: needed when DPO is the primary alignment vehicle (no separate SFT or weak SFT). The volume used by frontier alignment pipelines.

    Preference data quality is harder to measure than SFT quality because it is binary (preferred vs rejected) and depends on human-judgment consistency. Use the agreement rate between independent annotators as a quality proxy — below 75 %, the data is too noisy to train on.

    Evaluation data — the smallest, the most important

    An eval set is small in absolute terms but disproportionate in impact: it gates every decision about training.

    • 20–50 examples: enough to detect catastrophic regressions during development. Run after every training change. Should never be touched by training data.
    • 200–500 examples: the production eval set. Stratified by task, source, and difficulty. Used for go/no-go decisions and tracked over time as a “regression budget”.
    • 1 000–5 000 examples: for benchmark releases or when shipping a model to a customer who will run their own evaluation. Documented with clear scope, scoring rubric, and known limitations.

    One discipline that matters more than size: never let an eval example leak into training data. The fastest way to inflate apparent quality and the fastest way to lose customer trust when the inflation is found out.

    RAG corpus — quality of chunks, not quantity of tokens

    For retrieval-augmented generation, the question of “how much data” reduces to “how much queryable, well-chunked, well-embedded text”. The thresholds where things change:

    • Under 10 000 documents: a simple vector index (FAISS, Qdrant, Pinecone) on a single machine. Embedding generation takes hours. The retrieval bottleneck is recall, not latency.
    • 10 000 – 1 million documents: hybrid search (BM25 + vectors) becomes worthwhile. Re-ranker on top of retriever starts to materially improve precision.
    • Above 1 million documents: the index becomes the dominant operational cost. Hierarchical retrieval (cluster-first, then chunk-rerank) replaces flat search. Embedding model choice matters more than vector store choice.

    Doubling the corpus rarely doubles answer quality. Doubling the chunk-quality (cleaner extractions, smarter chunking, deduplication) often does. RAG performance is overwhelmingly bottlenecked by chunk quality and re-ranker quality, not raw token count.

    Estimating cost before you commit

    A back-of-envelope formula that gets you within 30 % of the real GPU-hour cost for a pre-training or continued-pre-training run:

    GPU-hours ≈ 6 × N_params × N_tokens / (GPU_TFLOPS × 1e12 × 3600 × utilization)
    
    with utilization = 0.4 to 0.55 on modern stacks

    Plugging in a 7 B continued pre-training on 5 B tokens, with H100 at 989 TFLOPS bf16 and 50 % utilization: about 240 GPU-hours, or 10 days on a single H100, or 30 hours on an 8× H100 node. At cloud rates (~3 USD/H100-hour in mid-2026), that is roughly 720 USD — a useful sanity check before reserving a cluster.

    For SFT and DPO, the formula is similar but the relevant quantity is example-tokens × epochs, and the utilization is lower (more I/O bound). A 5 K-example SFT on a 7 B model with QLoRA fits in 1–3 hours on a single 24 GB GPU.

    The diminishing-returns line

    If adding 10 % more data yields under 0.3 % improvement on your headline metric, stop scaling data and start scaling something else — annotation quality, prompt design, model architecture.

    At every training stage there is a volume above which adding more data costs more than it gains. The signals that you have crossed it:

    • Eval loss plateaus over the last 20–30 % of training tokens.
    • Eval task accuracy stops improving while training loss continues to drop — the gap between training and eval grows.
    • Adding 10 % more data yields under 0.3 % improvement on the headline metric.
    • Specific failure modes (the ones your customers report) do not budge with more data of the same distribution.

    When any two of these fire, stop scaling data and start scaling something else — annotation quality, prompt design, model architecture, or simply ship the version you have and revisit after deployment data accumulates.

    Quick reference — token volumes by stage

    StageMinimumSweet spotDiminishing returns
    From-scratch pre-training10× params (tokens)200–500× params~1 000× params
    Continued pre-training~500 M tokens1–10 B tokens~100 B tokens
    SFT~100 examples1 000–10 000~100 000
    DPO / preference~2 000 pairs10 000–50 000~200 000
    Eval set20–50 examples200–500~5 000
    RAG corpusany sizequality > quantityindexing cost dominates

    Bottom line

    The correct answer to “how many tokens do you need” is “what training stage, for what model, with what quality bar”. Pre-training in 2026 wants 200–500 tokens per parameter, not the Chinchilla-optimal 20. Continued pre-training wants 1–10 B in-domain tokens with a general-corpus rehearsal mix. SFT wants 1 000–10 000 carefully curated examples. DPO wants 10 000–50 000 preference pairs. Evals want 200–500 examples and a discipline of never leaking them into training.

    If you are planning a vertical fine-tune and want to compare the actual token volumes that drove our French regulatory and financial text corpus, see our companion guides on SFT datasets and dataset formats — the storage and curation decisions that determine whether your token count translates into usable training signal.

    See also: the best public LLM datasets in 2026 and LLM corpus deduplication techniques.

    Frequently asked questions

    Is the 20-tokens-per-parameter Chinchilla rule still valid?

    Compute-optimal — yes. But modern teams optimize inference compute, not training compute. Over-training (100–500 tokens/parameter) gives a smaller model with the same quality and cheaper inference for the model’s entire deployed life.

    How many SFT examples do I really need?

    1 000–10 000 carefully curated. The 2023 LIMA paper plus three years of replication studies confirm that quality dominates quantity in this range. Above 100 000, returns diminish hard.

    How big should my evaluation set be?

    200–500 stratified examples is the production sweet spot. Smaller for development sanity, larger for benchmark release. Discipline matters more than size — never let eval leak into training.

    How much RAG data do I need?

    Quality of chunks dominates quantity of tokens. Below 10 K documents a flat vector index is fine; 10 K–1 M needs hybrid search and a reranker; above 1 M, index cost dominates and you need hierarchical retrieval.

    How do I know when to stop adding data?

    Four signals: eval loss plateaus, eval gap grows, 10 % more data yields under 0.3 % gain, and specific user-reported failures don’t move. When two of these fire, switch focus.

    Need a tokenized French corpus for continued pre-training?

    We deliver 460 M tokens of finance, regulation, and economic French text as Parquet shards — versioned, provenance-tracked, AI Act art. 10 ready.


    Keep reading

    Read next

    How to train an LLM on your own data

    Which training stage actually needs how much data.

    Read next

    SFT datasets — format and best practices

    What 1 000–10 000 high-quality SFT examples look like.

    Read next

    Choosing a dataset format

    Storage choices that scale with token volume.

  • Choosing a dataset format — Parquet vs JSONL vs Arrow for ML pipelines.

    The choice between Parquet, JSONL, and Arrow looks like a storage detail. It becomes a deployment-blocking constraint the first time your team needs to load 500 GB through a pipeline prototyped on a 200 MB JSONL file.

    Key takeaways

    • JSONL up to ~500 MB or ~1 M examples — beyond that, switch to Parquet.
    • Parquet for canonical storage: schema-typed, columnar, 3–5× compression, splittable.
    • Arrow for the trainer cache: memory-mapped, zero-copy, near-instant random access.
    • Never pre-tokenize into the storage format — tokenizer changes invalidate the cache.
    • 100–500 MB per Parquet file is the sweet spot; one giant file kills splittability.

    The choice between Parquet, JSONL, and Arrow looks like a storage-engineering detail. It becomes a deployment-blocking constraint the first time your team needs to load 500 GB of training data through a pipeline that was prototyped on a 200 MB JSONL file. This guide walks through each format’s actual behaviour at the volumes that matter, the operations they optimize for and against, and a decision rule that pins down which format to use at which stage of an ML pipeline.

    The three formats in one sentence each

    • JSONL — one JSON object per line, text-encoded, schema-free, human-readable, slow to parse, slow to compress, perfect for under-1 GB datasets and for the input/output boundary of a pipeline.
    • Parquet — columnar binary, schema-on-write, heavy compression, fast columnar scans, slow to write, slow to inspect by hand, the canonical storage format for any dataset above a few hundred megabytes.
    • Arrow — columnar in-memory representation that maps directly to Parquet on disk. Zero-copy reads, memory-mapped access, the bridge between disk storage and training-loop dataloaders.

    None of the three is universally better. They serve different stages of the data lifecycle.

    JSONL — when to choose it

    JSONL is the lingua franca of small-to-medium dataset exchange. Its advantages are operational, not technical:

    • Line-oriented. Streamable end-to-end. head, wc -l, grep, jq work without any specialized library.
    • Schema-free. Adding or removing a field per record costs nothing at write time. Useful during dataset construction when the schema is still in flux.
    • Git-friendly. Diffs are readable. Small datasets can live in a repo alongside the code that produced them.
    • Universally supported. Every training framework — TRL, Axolotl, Unsloth, raw PyTorch — accepts JSONL natively.

    The trade-offs become punishing at scale:

    • No compression by default. A 100 MB JSONL file compresses to 15–25 MB with gzip, but loading then decompressing on every epoch wastes throughput.
    • Slow to parse. Python json.loads is one of the slower hot paths in a typical training pipeline. At 50 K examples/s on a single core, an 80 M-example dataset takes 25 minutes per epoch just for parsing.
    • No column projection. Reading one field requires reading the whole record. Reading one record requires scanning until that record’s line break.
    • No types. A column that is sometimes a string, sometimes a number, sometimes null is silently accepted on write and explodes on read.

    Decision rule: use JSONL up to ~500 MB or ~1 M examples. Beyond that, convert to Parquet for storage and use JSONL only as an export format for sharing samples or feeding small jobs.

    Parquet — when to choose it

    Parquet is the storage format you want once your dataset stops fitting comfortably in RAM. Its design properties translate directly to ML pipeline benefits:

    • Columnar layout. Reading the text column without touching the metadata column is a real I/O saving — often 5–10× faster than row-oriented formats.
    • Type-safe schema. Each column has a declared type (string, int32, float64, list<string>, struct<…>). The schema is stored in the footer; downstream readers cannot silently coerce a string to a number.
    • Column-level compression. Each column is compressed independently with the codec best fit to its type. Text columns hit zstd or snappy; numerical columns hit dictionary or delta encoding. Typical compression ratios for text-heavy ML datasets: 3–5× over JSONL.
    • Predicate pushdown. Modern readers (PyArrow, DuckDB, Polars) can filter on column values before decompressing — a one-line filter on language code reads only the matching row groups.
    • Splittable. Files are organized in row groups (default ~128 MB). Distributed training and Spark-style processing can split files transparently.

    The trade-offs:

    • Not designed for append. Parquet files are written once. Updating a single record means rewriting the file (or the partition).
    • Schema-on-write. Changing a column type requires rewriting all files. Add columns liberally up-front; they cost nothing if unused.
    • Binary. Cannot be inspected with cat or head. Use parquet-tools, pyarrow, duckdb, or any of the visual Parquet viewers.
    • Tooling weight. Reading a 50 GB Parquet dataset in pure Python without PyArrow or Polars is impractical. Add the dependency.

    Decision rule: use Parquet for any dataset above 500 MB, for any dataset that will be read more than once, and for any dataset whose schema is stable enough to declare.

    Arrow — when to choose it

    Arrow is not really a competing storage format — it is the in-memory representation that Parquet, Feather, and most modern columnar engines deserialize into. The distinction matters because Arrow has its own on-disk variants (Feather / Arrow IPC) used in specific places:

    • Hugging Face datasets cache. When you call load_dataset(), the library materializes the dataset as an Arrow file on disk and memory-maps it during training. This gives near-zero-cost random access to billions of examples.
    • Inter-process and inter-language data exchange. Arrow IPC is the format Pandas, Polars, R, Julia, and Rust use to pass dataframes around without serialization overhead.
    • Streaming RPC. Arrow Flight is the de facto protocol for moving columnar data between services in 2026, replacing gRPC-with-Protobuf for analytics workloads.

    For dataset storage, Feather is rarely the right choice over Parquet: it has weaker compression and weaker tooling support outside the Arrow ecosystem. Use Feather only when the file lifetime is short (intermediate cache) or when zero-overhead read latency matters more than disk size.

    A canonical pipeline that uses all three

    Looking for production-grade Parquet shards of French regulatory text?

    Our corpus ships as Parquet with a stable 10-column schema, deduplicated, and ready to ingest into a Hugging Face datasets pipeline.

    JSONL at the edges where humans look. Parquet in the middle where machines store. Arrow at the end where trainers consume. Forcing one format across all stages produces either opaque binary or a pipeline parsing text.

    A production ML dataset pipeline in 2026 typically uses each format at the stage where it shines:

    1. Ingest layer (JSONL): raw extractions from APIs, scrapers, or curated sources land as JSONL. Schema is loose. Files are small enough to inspect by hand and small enough that re-running the extractor is cheap.
    2. Storage layer (Parquet): after deduplication, language filtering, and schema normalization, the canonical dataset is written as Parquet shards (one shard per source, partitioned by date or split). This is the version that gets versioned, hashed, and referenced in model cards.
    3. Training layer (Arrow): the trainer loads Parquet, materializes to Arrow in the OS page cache (via datasets.load_dataset or PyArrow), and memory-maps during the actual training loop. Random access is constant time; shuffling is cheap.
    4. Export layer (JSONL): when a tier of the dataset is released to a downstream consumer (sample for evaluation, premium for a paying customer), the chosen subset is exported back to JSONL for portability. The Parquet master remains the source of truth.

    This is the pattern we use ourselves for the French regulatory text corpus we publish — JSONL at the source edges, Parquet for the canonical store, Arrow under the trainer.

    Common mistakes

    Most expensive mistake

    Pre-tokenizing into the storage format. Tokenizer version changes (new vocab, special tokens, model update) invalidate the entire cache. Store raw text; tokenize at trainer-time. Compute cost is negligible compared to the rebuild.

    • One giant Parquet file. Splittability is a Parquet feature only when there are multiple row groups, ideally across multiple files. Aim for 100–500 MB per file, ~50–200 MB per row group.
    • Mixing types in a JSON field. A field that is “string in 95 % of records, list of strings in 5 %” will be silently coerced on Parquet conversion. Normalize at ingest, not at the converter.
    • Pre-tokenizing into the storage format. Tokenizer changes (new vocabulary, special tokens, model version) invalidate the cache. Store raw text; tokenize on the fly inside the trainer. The compute cost is negligible compared to the cache rebuild cost.
    • Storing model outputs in the same file as inputs. Adds noise to your dataset hash and complicates retraction. Use a sidecar Parquet with a foreign-key column to the input record.
    • Ignoring file naming. Files like data.parquet, final_v2_NEW.parquet are time bombs. Use a deterministic naming scheme: {source}/{date}/part-{shard:04d}.parquet.

    Quick reference table

    CriterionJSONLParquetArrow IPC / Feather
    Storage layoutRow-oriented textColumnar binaryColumnar binary
    CompressionNone (gzip optional)Built-in, per columnBuilt-in (lighter)
    SchemaImplicitDeclared, typedDeclared, typed
    Best at scaleBelow 1 GBAbove 500 MBIn-memory / cache
    Append supportYesNo (write-once)No
    Human inspectionTrivialNeeds toolNeeds tool
    Tool dependencyNonepyarrow, polars, duckdbpyarrow
    Typical roleEdge / exportCanonical storageTrainer cache

    Bottom line

    Use JSONL at the edges of your pipeline where humans look at the data. Use Parquet in the middle where machines store and version it. Use Arrow at the end where training loops consume it. Forcing one format across all stages produces either an unworkable pile of opaque binary or a pipeline that wastes most of its time parsing text.

    For more on how this storage choice interacts with downstream SFT and pre-training workflows, see our companion guides on SFT dataset formats and training data size for LLMs.

    See also: LLM corpus deduplication techniques and what makes a corpus retrieval-friendly.

    Frequently asked questions

    When does JSONL stop working?

    Around 500 MB or 1 M examples. Above that, parse time dominates, no column projection, no compression, no schema enforcement. Convert to Parquet for storage and keep JSONL only as an export.

    Parquet or Feather/Arrow IPC?

    Parquet for storage. Feather/Arrow IPC only for short-lived caches or inter-language data exchange. Feather has weaker compression and weaker tooling for long-term storage.

    Should I gzip my JSONL?

    Yes if you’re stuck with JSONL at scale and need to ship over the network. No if you’re training: the decompression on every epoch wastes throughput. Convert to Parquet instead.

    How big should each Parquet file be?

    100–500 MB per file, with row groups of 50–200 MB inside. Multiple files enable splittability; one giant file means a single reader streams it linearly.

    Does Hugging Face datasets use Parquet or Arrow?

    Both. load_dataset() reads Parquet (or other formats), materializes to Arrow on disk in the local cache, and memory-maps the Arrow file during training. You write Parquet; HF handles the Arrow side.

    Looking for Parquet-shipped French regulatory text?

    Our corpus uses the canonical pipeline described above — JSONL at edges, Parquet for the canonical store, Arrow under the trainer. Per-record provenance baked in.


    Keep reading

    Read next

    SFT datasets — format and best practices

    How JSONL shapes the typical SFT pipeline.

    Read next

    Training data size for LLMs

    Volume thresholds where format choices start to bite.

    Read next

    Building an audit-ready provenance trail

    How storage format interacts with per-record provenance.