The choice between Parquet, JSONL, and Arrow looks like a storage detail. It becomes a deployment-blocking constraint the first time your team needs to load 500 GB through a pipeline prototyped on a 200 MB JSONL file.
Key takeaways
- JSONL up to ~500 MB or ~1 M examples — beyond that, switch to Parquet.
- Parquet for canonical storage: schema-typed, columnar, 3–5× compression, splittable.
- Arrow for the trainer cache: memory-mapped, zero-copy, near-instant random access.
- Never pre-tokenize into the storage format — tokenizer changes invalidate the cache.
- 100–500 MB per Parquet file is the sweet spot; one giant file kills splittability.
In this article
The choice between Parquet, JSONL, and Arrow looks like a storage-engineering detail. It becomes a deployment-blocking constraint the first time your team needs to load 500 GB of training data through a pipeline that was prototyped on a 200 MB JSONL file. This guide walks through each format’s actual behaviour at the volumes that matter, the operations they optimize for and against, and a decision rule that pins down which format to use at which stage of an ML pipeline.
The three formats in one sentence each
- JSONL — one JSON object per line, text-encoded, schema-free, human-readable, slow to parse, slow to compress, perfect for under-1 GB datasets and for the input/output boundary of a pipeline.
- Parquet — columnar binary, schema-on-write, heavy compression, fast columnar scans, slow to write, slow to inspect by hand, the canonical storage format for any dataset above a few hundred megabytes.
- Arrow — columnar in-memory representation that maps directly to Parquet on disk. Zero-copy reads, memory-mapped access, the bridge between disk storage and training-loop dataloaders.
None of the three is universally better. They serve different stages of the data lifecycle.
JSONL — when to choose it
JSONL is the lingua franca of small-to-medium dataset exchange. Its advantages are operational, not technical:
- Line-oriented. Streamable end-to-end.
head,wc -l,grep,jqwork without any specialized library. - Schema-free. Adding or removing a field per record costs nothing at write time. Useful during dataset construction when the schema is still in flux.
- Git-friendly. Diffs are readable. Small datasets can live in a repo alongside the code that produced them.
- Universally supported. Every training framework — TRL, Axolotl, Unsloth, raw PyTorch — accepts JSONL natively.
The trade-offs become punishing at scale:
- No compression by default. A 100 MB JSONL file compresses to 15–25 MB with gzip, but loading then decompressing on every epoch wastes throughput.
- Slow to parse. Python
json.loadsis one of the slower hot paths in a typical training pipeline. At 50 K examples/s on a single core, an 80 M-example dataset takes 25 minutes per epoch just for parsing. - No column projection. Reading one field requires reading the whole record. Reading one record requires scanning until that record’s line break.
- No types. A column that is sometimes a string, sometimes a number, sometimes null is silently accepted on write and explodes on read.
Decision rule: use JSONL up to ~500 MB or ~1 M examples. Beyond that, convert to Parquet for storage and use JSONL only as an export format for sharing samples or feeding small jobs.
Parquet — when to choose it
Parquet is the storage format you want once your dataset stops fitting comfortably in RAM. Its design properties translate directly to ML pipeline benefits:
- Columnar layout. Reading the
textcolumn without touching themetadatacolumn is a real I/O saving — often 5–10× faster than row-oriented formats. - Type-safe schema. Each column has a declared type (string, int32, float64, list<string>, struct<…>). The schema is stored in the footer; downstream readers cannot silently coerce a string to a number.
- Column-level compression. Each column is compressed independently with the codec best fit to its type. Text columns hit zstd or snappy; numerical columns hit dictionary or delta encoding. Typical compression ratios for text-heavy ML datasets: 3–5× over JSONL.
- Predicate pushdown. Modern readers (PyArrow, DuckDB, Polars) can filter on column values before decompressing — a one-line filter on language code reads only the matching row groups.
- Splittable. Files are organized in row groups (default ~128 MB). Distributed training and Spark-style processing can split files transparently.
The trade-offs:
- Not designed for append. Parquet files are written once. Updating a single record means rewriting the file (or the partition).
- Schema-on-write. Changing a column type requires rewriting all files. Add columns liberally up-front; they cost nothing if unused.
- Binary. Cannot be inspected with
catorhead. Useparquet-tools,pyarrow,duckdb, or any of the visual Parquet viewers. - Tooling weight. Reading a 50 GB Parquet dataset in pure Python without PyArrow or Polars is impractical. Add the dependency.
Decision rule: use Parquet for any dataset above 500 MB, for any dataset that will be read more than once, and for any dataset whose schema is stable enough to declare.
Arrow — when to choose it
Arrow is not really a competing storage format — it is the in-memory representation that Parquet, Feather, and most modern columnar engines deserialize into. The distinction matters because Arrow has its own on-disk variants (Feather / Arrow IPC) used in specific places:
- Hugging Face
datasetscache. When you callload_dataset(), the library materializes the dataset as an Arrow file on disk and memory-maps it during training. This gives near-zero-cost random access to billions of examples. - Inter-process and inter-language data exchange. Arrow IPC is the format Pandas, Polars, R, Julia, and Rust use to pass dataframes around without serialization overhead.
- Streaming RPC. Arrow Flight is the de facto protocol for moving columnar data between services in 2026, replacing gRPC-with-Protobuf for analytics workloads.
For dataset storage, Feather is rarely the right choice over Parquet: it has weaker compression and weaker tooling support outside the Arrow ecosystem. Use Feather only when the file lifetime is short (intermediate cache) or when zero-overhead read latency matters more than disk size.
A canonical pipeline that uses all three
Looking for production-grade Parquet shards of French regulatory text?
Our corpus ships as Parquet with a stable 10-column schema, deduplicated, and ready to ingest into a Hugging Face datasets pipeline.
JSONL at the edges where humans look. Parquet in the middle where machines store. Arrow at the end where trainers consume. Forcing one format across all stages produces either opaque binary or a pipeline parsing text.
A production ML dataset pipeline in 2026 typically uses each format at the stage where it shines:
- Ingest layer (JSONL): raw extractions from APIs, scrapers, or curated sources land as JSONL. Schema is loose. Files are small enough to inspect by hand and small enough that re-running the extractor is cheap.
- Storage layer (Parquet): after deduplication, language filtering, and schema normalization, the canonical dataset is written as Parquet shards (one shard per source, partitioned by date or split). This is the version that gets versioned, hashed, and referenced in model cards.
- Training layer (Arrow): the trainer loads Parquet, materializes to Arrow in the OS page cache (via
datasets.load_datasetor PyArrow), and memory-maps during the actual training loop. Random access is constant time; shuffling is cheap. - Export layer (JSONL): when a tier of the dataset is released to a downstream consumer (sample for evaluation, premium for a paying customer), the chosen subset is exported back to JSONL for portability. The Parquet master remains the source of truth.
This is the pattern we use ourselves for the French regulatory text corpus we publish — JSONL at the source edges, Parquet for the canonical store, Arrow under the trainer.
Common mistakes
Most expensive mistake
Pre-tokenizing into the storage format. Tokenizer version changes (new vocab, special tokens, model update) invalidate the entire cache. Store raw text; tokenize at trainer-time. Compute cost is negligible compared to the rebuild.
- One giant Parquet file. Splittability is a Parquet feature only when there are multiple row groups, ideally across multiple files. Aim for 100–500 MB per file, ~50–200 MB per row group.
- Mixing types in a JSON field. A field that is “string in 95 % of records, list of strings in 5 %” will be silently coerced on Parquet conversion. Normalize at ingest, not at the converter.
- Pre-tokenizing into the storage format. Tokenizer changes (new vocabulary, special tokens, model version) invalidate the cache. Store raw text; tokenize on the fly inside the trainer. The compute cost is negligible compared to the cache rebuild cost.
- Storing model outputs in the same file as inputs. Adds noise to your dataset hash and complicates retraction. Use a sidecar Parquet with a foreign-key column to the input record.
- Ignoring file naming. Files like
data.parquet,final_v2_NEW.parquetare time bombs. Use a deterministic naming scheme:{source}/{date}/part-{shard:04d}.parquet.
Quick reference table
| Criterion | JSONL | Parquet | Arrow IPC / Feather |
|---|---|---|---|
| Storage layout | Row-oriented text | Columnar binary | Columnar binary |
| Compression | None (gzip optional) | Built-in, per column | Built-in (lighter) |
| Schema | Implicit | Declared, typed | Declared, typed |
| Best at scale | Below 1 GB | Above 500 MB | In-memory / cache |
| Append support | Yes | No (write-once) | No |
| Human inspection | Trivial | Needs tool | Needs tool |
| Tool dependency | None | pyarrow, polars, duckdb | pyarrow |
| Typical role | Edge / export | Canonical storage | Trainer cache |
Bottom line
Use JSONL at the edges of your pipeline where humans look at the data. Use Parquet in the middle where machines store and version it. Use Arrow at the end where training loops consume it. Forcing one format across all stages produces either an unworkable pile of opaque binary or a pipeline that wastes most of its time parsing text.
For more on how this storage choice interacts with downstream SFT and pre-training workflows, see our companion guides on SFT dataset formats and training data size for LLMs.
See also: LLM corpus deduplication techniques and what makes a corpus retrieval-friendly.
Frequently asked questions
When does JSONL stop working?
Around 500 MB or 1 M examples. Above that, parse time dominates, no column projection, no compression, no schema enforcement. Convert to Parquet for storage and keep JSONL only as an export.
Parquet or Feather/Arrow IPC?
Parquet for storage. Feather/Arrow IPC only for short-lived caches or inter-language data exchange. Feather has weaker compression and weaker tooling for long-term storage.
Should I gzip my JSONL?
Yes if you’re stuck with JSONL at scale and need to ship over the network. No if you’re training: the decompression on every epoch wastes throughput. Convert to Parquet instead.
How big should each Parquet file be?
100–500 MB per file, with row groups of 50–200 MB inside. Multiple files enable splittability; one giant file means a single reader streams it linearly.
Does Hugging Face datasets use Parquet or Arrow?
Both. load_dataset() reads Parquet (or other formats), materializes to Arrow on disk in the local cache, and memory-maps the Arrow file during training. You write Parquet; HF handles the Arrow side.
Looking for Parquet-shipped French regulatory text?
Our corpus uses the canonical pipeline described above — JSONL at edges, Parquet for the canonical store, Arrow under the trainer. Per-record provenance baked in.
Keep reading
Read next
Building an audit-ready provenance trail
How storage format interacts with per-record provenance.
Leave a Reply