Choosing a dataset format — Parquet vs JSONL vs Arrow for ML pipelines.

The choice between Parquet, JSONL, and Arrow looks like a storage detail. It becomes a deployment-blocking constraint the first time your team needs to load 500 GB through a pipeline prototyped on a 200 MB JSONL file.

Key takeaways

JSONL up to ~500 MB or ~1 M examples — beyond that, switch to Parquet.
Parquet for canonical storage: schema-typed, columnar, 3–5× compression, splittable.
Arrow for the trainer cache: memory-mapped, zero-copy, near-instant random access.
Never pre-tokenize into the storage format — tokenizer changes invalidate the cache.
100–500 MB per Parquet file is the sweet spot; one giant file kills splittability.

The choice between Parquet, JSONL, and Arrow looks like a storage-engineering detail. It becomes a deployment-blocking constraint the first time your team needs to load 500 GB of training data through a pipeline that was prototyped on a 200 MB JSONL file. This guide walks through each format’s actual behaviour at the volumes that matter, the operations they optimize for and against, and a decision rule that pins down which format to use at which stage of an ML pipeline.

The three formats in one sentence each

JSONL — one JSON object per line, text-encoded, schema-free, human-readable, slow to parse, slow to compress, perfect for under-1 GB datasets and for the input/output boundary of a pipeline.
Parquet — columnar binary, schema-on-write, heavy compression, fast columnar scans, slow to write, slow to inspect by hand, the canonical storage format for any dataset above a few hundred megabytes.
Arrow — columnar in-memory representation that maps directly to Parquet on disk. Zero-copy reads, memory-mapped access, the bridge between disk storage and training-loop dataloaders.

None of the three is universally better. They serve different stages of the data lifecycle.

JSONL — when to choose it

JSONL is the lingua franca of small-to-medium dataset exchange. Its advantages are operational, not technical:

Line-oriented. Streamable end-to-end. head, wc -l, grep, jq work without any specialized library.
Schema-free. Adding or removing a field per record costs nothing at write time. Useful during dataset construction when the schema is still in flux.
Git-friendly. Diffs are readable. Small datasets can live in a repo alongside the code that produced them.
Universally supported. Every training framework — TRL, Axolotl, Unsloth, raw PyTorch — accepts JSONL natively.

The trade-offs become punishing at scale:

No compression by default. A 100 MB JSONL file compresses to 15–25 MB with gzip, but loading then decompressing on every epoch wastes throughput.
Slow to parse. Python json.loads is one of the slower hot paths in a typical training pipeline. At 50 K examples/s on a single core, an 80 M-example dataset takes 25 minutes per epoch just for parsing.
No column projection. Reading one field requires reading the whole record. Reading one record requires scanning until that record’s line break.
No types. A column that is sometimes a string, sometimes a number, sometimes null is silently accepted on write and explodes on read.

Decision rule: use JSONL up to ~500 MB or ~1 M examples. Beyond that, convert to Parquet for storage and use JSONL only as an export format for sharing samples or feeding small jobs.

Parquet — when to choose it

Parquet is the storage format you want once your dataset stops fitting comfortably in RAM. Its design properties translate directly to ML pipeline benefits:

Columnar layout. Reading the text column without touching the metadata column is a real I/O saving — often 5–10× faster than row-oriented formats.
Type-safe schema. Each column has a declared type (string, int32, float64, list<string>, struct<…>). The schema is stored in the footer; downstream readers cannot silently coerce a string to a number.
Column-level compression. Each column is compressed independently with the codec best fit to its type. Text columns hit zstd or snappy; numerical columns hit dictionary or delta encoding. Typical compression ratios for text-heavy ML datasets: 3–5× over JSONL.
Predicate pushdown. Modern readers (PyArrow, DuckDB, Polars) can filter on column values before decompressing — a one-line filter on language code reads only the matching row groups.
Splittable. Files are organized in row groups (default ~128 MB). Distributed training and Spark-style processing can split files transparently.

The trade-offs:

Not designed for append. Parquet files are written once. Updating a single record means rewriting the file (or the partition).
Schema-on-write. Changing a column type requires rewriting all files. Add columns liberally up-front; they cost nothing if unused.
Binary. Cannot be inspected with cat or head. Use parquet-tools, pyarrow, duckdb, or any of the visual Parquet viewers.
Tooling weight. Reading a 50 GB Parquet dataset in pure Python without PyArrow or Polars is impractical. Add the dependency.

Decision rule: use Parquet for any dataset above 500 MB, for any dataset that will be read more than once, and for any dataset whose schema is stable enough to declare.

Arrow — when to choose it

Arrow is not really a competing storage format — it is the in-memory representation that Parquet, Feather, and most modern columnar engines deserialize into. The distinction matters because Arrow has its own on-disk variants (Feather / Arrow IPC) used in specific places:

Hugging Face datasets cache. When you call load_dataset(), the library materializes the dataset as an Arrow file on disk and memory-maps it during training. This gives near-zero-cost random access to billions of examples.
Inter-process and inter-language data exchange. Arrow IPC is the format Pandas, Polars, R, Julia, and Rust use to pass dataframes around without serialization overhead.
Streaming RPC. Arrow Flight is the de facto protocol for moving columnar data between services in 2026, replacing gRPC-with-Protobuf for analytics workloads.

For dataset storage, Feather is rarely the right choice over Parquet: it has weaker compression and weaker tooling support outside the Arrow ecosystem. Use Feather only when the file lifetime is short (intermediate cache) or when zero-overhead read latency matters more than disk size.

A canonical pipeline that uses all three

Looking for production-grade Parquet shards of French regulatory text?

Our corpus ships as Parquet with a stable 10-column schema, deduplicated, and ready to ingest into a Hugging Face datasets pipeline.

See the schema

JSONL at the edges where humans look. Parquet in the middle where machines store. Arrow at the end where trainers consume. Forcing one format across all stages produces either opaque binary or a pipeline parsing text.

A production ML dataset pipeline in 2026 typically uses each format at the stage where it shines:

Ingest layer (JSONL): raw extractions from APIs, scrapers, or curated sources land as JSONL. Schema is loose. Files are small enough to inspect by hand and small enough that re-running the extractor is cheap.
Storage layer (Parquet): after deduplication, language filtering, and schema normalization, the canonical dataset is written as Parquet shards (one shard per source, partitioned by date or split). This is the version that gets versioned, hashed, and referenced in model cards.
Training layer (Arrow): the trainer loads Parquet, materializes to Arrow in the OS page cache (via datasets.load_dataset or PyArrow), and memory-maps during the actual training loop. Random access is constant time; shuffling is cheap.
Export layer (JSONL): when a tier of the dataset is released to a downstream consumer (sample for evaluation, premium for a paying customer), the chosen subset is exported back to JSONL for portability. The Parquet master remains the source of truth.

This is the pattern we use ourselves for the French regulatory text corpus we publish — JSONL at the source edges, Parquet for the canonical store, Arrow under the trainer.

Common mistakes

Most expensive mistake

Pre-tokenizing into the storage format. Tokenizer version changes (new vocab, special tokens, model update) invalidate the entire cache. Store raw text; tokenize at trainer-time. Compute cost is negligible compared to the rebuild.

One giant Parquet file. Splittability is a Parquet feature only when there are multiple row groups, ideally across multiple files. Aim for 100–500 MB per file, ~50–200 MB per row group.
Mixing types in a JSON field. A field that is “string in 95 % of records, list of strings in 5 %” will be silently coerced on Parquet conversion. Normalize at ingest, not at the converter.
Pre-tokenizing into the storage format. Tokenizer changes (new vocabulary, special tokens, model version) invalidate the cache. Store raw text; tokenize on the fly inside the trainer. The compute cost is negligible compared to the cache rebuild cost.
Storing model outputs in the same file as inputs. Adds noise to your dataset hash and complicates retraction. Use a sidecar Parquet with a foreign-key column to the input record.
Ignoring file naming. Files like data.parquet, final_v2_NEW.parquet are time bombs. Use a deterministic naming scheme: {source}/{date}/part-{shard:04d}.parquet.

Quick reference table

Criterion	JSONL	Parquet	Arrow IPC / Feather
Storage layout	Row-oriented text	Columnar binary	Columnar binary
Compression	None (gzip optional)	Built-in, per column	Built-in (lighter)
Schema	Implicit	Declared, typed	Declared, typed
Best at scale	Below 1 GB	Above 500 MB	In-memory / cache
Append support	Yes	No (write-once)	No
Human inspection	Trivial	Needs tool	Needs tool
Tool dependency	None	pyarrow, polars, duckdb	pyarrow
Typical role	Edge / export	Canonical storage	Trainer cache

Bottom line

Use JSONL at the edges of your pipeline where humans look at the data. Use Parquet in the middle where machines store and version it. Use Arrow at the end where training loops consume it. Forcing one format across all stages produces either an unworkable pile of opaque binary or a pipeline that wastes most of its time parsing text.

For more on how this storage choice interacts with downstream SFT and pre-training workflows, see our companion guides on SFT dataset formats and training data size for LLMs.

Frequently asked questions

When does JSONL stop working?

Around 500 MB or 1 M examples. Above that, parse time dominates, no column projection, no compression, no schema enforcement. Convert to Parquet for storage and keep JSONL only as an export.

Parquet or Feather/Arrow IPC?

Parquet for storage. Feather/Arrow IPC only for short-lived caches or inter-language data exchange. Feather has weaker compression and weaker tooling for long-term storage.

Should I gzip my JSONL?

Yes if you’re stuck with JSONL at scale and need to ship over the network. No if you’re training: the decompression on every epoch wastes throughput. Convert to Parquet instead.

How big should each Parquet file be?

100–500 MB per file, with row groups of 50–200 MB inside. Multiple files enable splittability; one giant file means a single reader streams it linearly.

Does Hugging Face datasets use Parquet or Arrow?

Both. load_dataset() reads Parquet (or other formats), materializes to Arrow on disk in the local cache, and memory-maps the Arrow file during training. You write Parquet; HF handles the Arrow side.

Looking for Parquet-shipped French regulatory text?

Our corpus uses the canonical pipeline described above — JSONL at edges, Parquet for the canonical store, Arrow under the trainer. Per-record provenance baked in.

Talk to us

Comments

One response to “Choosing a dataset format — Parquet vs JSONL vs Arrow for ML pipelines.”

May 13, 2026

How to train an LLM on your own data — a practical 2026 guide – French Corpus LLM

[…] If you are unsure which file format to use for storage and downstream pipelines, the trade-offs between Parquet, JSONL, and Arrow are covered in our companion article on dataset formats. […]

Choosing a dataset format — Parquet vs JSONL vs Arrow for ML pipelines.

Key takeaways

In this article

The three formats in one sentence each

JSONL — when to choose it

Parquet — when to choose it

Arrow — when to choose it

A canonical pipeline that uses all three

Looking for production-grade Parquet shards of French regulatory text?

Common mistakes

Quick reference table

Bottom line

Frequently asked questions

When does JSONL stop working?

Parquet or Feather/Arrow IPC?

Should I gzip my JSONL?

How big should each Parquet file be?

Does Hugging Face datasets use Parquet or Arrow?

Looking for Parquet-shipped French regulatory text?

Keep reading

SFT datasets — format and best practices

Training data size for LLMs

Building an audit-ready provenance trail

Comments

One response to “Choosing a dataset format — Parquet vs JSONL vs Arrow for ML pipelines.”

Leave a Reply Cancel reply

Keep reading.

French legal NLP — corpora, benchmarks, and evaluation tasks in 2026.

Instruction tuning vs SFT vs RLHF — choosing the right dataset shape.

RAG datasets — what makes a corpus retrieval-friendly in 2026.

Data deduplication for LLM corpora — MinHash LSH, exact match, semantic.