Skip to content

NDJSON I/O (newline-delimited JSON)

Primary: DataFrame[Schema].read_ndjson, write_ndjson, and DataFrameModel methods. Secondary: pydantable.io.

Each line of the file is one JSON object; the scanner infers or aligns columns across lines.

Read (sources)

DataFrame[Schema] and DataFrameModel

  • DataFrame[Schema].read_ndjson(path, *, columns=None, **scan_kwargs)
  • MyModel.read_ndjson(...), await MyModel.aread_ndjson(..., executor=None)
  • materialize_ndjson, await amaterialize_ndjson from pydantable.io, then MyModel(cols)

pydantable.io

  • read_ndjson, aread_ndjson
  • materialize_ndjson, amaterialize_ndjson
  • fetch_ndjson_url — HTTP(S) → temp file → read
  • iter_ndjson, iter_json_lines (alias), aiter_ndjson, aiter_json_lines, write_ndjson_batches — JSON-object lines batched into dict[str, list] (IO_OVERVIEW).

scan_kwargs: low_memory, rechunk, ignore_errors, n_rows, infer_schema_length, glob, include_file_paths, row_index_name, row_index_offset. Unknown keys raise ValueError. See DATA_IO_SOURCES.

Paths, directories, and glob

Use glob=True (or omit it) when reading a directory or a glob pattern so your call matches Parquet / CSV lazy reads. Polars 0.53 builds NDJSON lazy scans with UnifiedScanArgs { glob: true, … } internally; glob expansion cannot be disabled from the LazyJsonLineReader API. Passing glob=False raises ValueError from pydantable.

Hive-style partitions are disabled for NDJSON in Polars 0.53 (no partition columns from paths). A single glob such as *.jsonl only matches that extension; use another pattern or a second read for .ndjson files. Details: Polars 0.53 vs pydantable scan audit.

Write (targets)

DataFrame[Schema] and DataFrameModel

  • df.write_ndjson(path, *, write_kwargs=..., streaming=...)
  • model.write_ndjson(...)

write_kwargs: json_format ("lines" / "json"). See DATA_IO_SOURCES.

pydantable.io

  • export_ndjson, aexport_ndjson
  • write_ndjson_batches — stream many batches to one NDJSON file.

Runnable example

Run conventions: IO_OVERVIEW (Runnable example).

python docs/examples/io/ndjson_roundtrip.py

"""NDJSON: append-only API / audit log → lazy scan; round-trip via write_ndjson.

Each line is one JSON object (common for log shipping and CDC-style exports).

Needs pydantable._core. Run::

python docs/examples/io/ndjson_roundtrip.py

"""

from future import annotations

import tempfile from pathlib import Path

from pydantable import DataFrameModel

class ApiAccessEvent(DataFrameModel): """One request line from an edge log (NDJSON)."""

status: int
path: str

def main() -> None: with tempfile.TemporaryDirectory() as logs: access_log = Path(logs) / "access-20250325.ndjson" access_log.write_text( '{"status": 200, "path": "/v1/health"}\n' '{"status": 404, "path": "/v1/missing"}\n', encoding="utf-8", )

    df = ApiAccessEvent.read_ndjson(str(access_log))
    rows = df.collect()
    assert [r.status for r in rows] == [200, 404]
    assert [r.path for r in rows] == ["/v1/health", "/v1/missing"]

    replay = Path(logs) / "replay.ndjson"
    ApiAccessEvent({"status": [500], "path": ["/v1/checkout"]}).write_ndjson(
        str(replay)
    )
    got = ApiAccessEvent.read_ndjson(str(replay))
    assert got.to_dict() == {"status": [500], "path": ["/v1/checkout"]}

print("ndjson_roundtrip: ok")

if name == "main": main()

Output

ndjson_roundtrip: ok

Large-file patterns (lazy scan; optional iter_ndjson batches in IO_JSON): python docs/examples/io/large_ndjson_patterns.py.

See also

IO_OVERVIEW · IO_HTTP · EXECUTION