NDJSON I/O (newline-delimited JSON)¶
Primary: DataFrame[Schema].read_ndjson, write_ndjson, and DataFrameModel methods. Secondary: pydantable.io.
Each line of the file is one JSON object; the scanner infers or aligns columns across lines.
Read (sources)¶
DataFrame[Schema] and DataFrameModel¶
DataFrame[Schema].read_ndjson(path, *, columns=None, **scan_kwargs)MyModel.read_ndjson(...),await MyModel.aread_ndjson(..., executor=None)materialize_ndjson,await amaterialize_ndjsonfrompydantable.io, thenMyModel(cols)
pydantable.io¶
read_ndjson,aread_ndjsonmaterialize_ndjson,amaterialize_ndjsonfetch_ndjson_url— HTTP(S) → temp file → readiter_ndjson,iter_json_lines(alias),aiter_ndjson,aiter_json_lines,write_ndjson_batches— JSON-object lines batched intodict[str, list](IO_OVERVIEW).
scan_kwargs: low_memory, rechunk, ignore_errors, n_rows, infer_schema_length, glob, include_file_paths, row_index_name, row_index_offset. Unknown keys raise ValueError. See DATA_IO_SOURCES.
Paths, directories, and glob¶
Use glob=True (or omit it) when reading a directory or a glob pattern so your call matches Parquet / CSV lazy reads. Polars 0.53 builds NDJSON lazy scans with UnifiedScanArgs { glob: true, … } internally; glob expansion cannot be disabled from the LazyJsonLineReader API. Passing glob=False raises ValueError from pydantable.
Hive-style partitions are disabled for NDJSON in Polars 0.53 (no partition columns from paths). A single glob such as *.jsonl only matches that extension; use another pattern or a second read for .ndjson files. Details: Polars 0.53 vs pydantable scan audit.
Write (targets)¶
DataFrame[Schema] and DataFrameModel¶
df.write_ndjson(path, *, write_kwargs=..., streaming=...)model.write_ndjson(...)
write_kwargs: json_format ("lines" / "json"). See DATA_IO_SOURCES.
pydantable.io¶
export_ndjson,aexport_ndjsonwrite_ndjson_batches— stream many batches to one NDJSON file.
Runnable example¶
Run conventions: IO_OVERVIEW (Runnable example).
"""NDJSON: append-only API / audit log → lazy scan; round-trip via write_ndjson.
Each line is one JSON object (common for log shipping and CDC-style exports).
Needs pydantable._core. Run::
python docs/examples/io/ndjson_roundtrip.py
"""
from future import annotations
import tempfile from pathlib import Path
from pydantable import DataFrameModel
class ApiAccessEvent(DataFrameModel): """One request line from an edge log (NDJSON)."""
status: int
path: str
def main() -> None: with tempfile.TemporaryDirectory() as logs: access_log = Path(logs) / "access-20250325.ndjson" access_log.write_text( '{"status": 200, "path": "/v1/health"}\n' '{"status": 404, "path": "/v1/missing"}\n', encoding="utf-8", )
df = ApiAccessEvent.read_ndjson(str(access_log))
rows = df.collect()
assert [r.status for r in rows] == [200, 404]
assert [r.path for r in rows] == ["/v1/health", "/v1/missing"]
replay = Path(logs) / "replay.ndjson"
ApiAccessEvent({"status": [500], "path": ["/v1/checkout"]}).write_ndjson(
str(replay)
)
got = ApiAccessEvent.read_ndjson(str(replay))
assert got.to_dict() == {"status": [500], "path": ["/v1/checkout"]}
print("ndjson_roundtrip: ok")
if name == "main": main()
Output¶
Large-file patterns (lazy scan; optional iter_ndjson batches in IO_JSON): python docs/examples/io/large_ndjson_patterns.py.