JSON (array and JSON Lines)¶
How JSON values map to typed columns (including nested objects and maps) is summarized in SUPPORTED_TYPES (JSON (RFC 8259) vs column types).
Naming: read_json vs read_ndjson vs materialize_json¶
| API | Role | Layout |
|---|---|---|
DataFrameModel.read_json / DataFrameModel.read_ndjson |
Typed lazy DataFrame / DataFrameModel (same Polars path). |
JSON Lines only — read_json is an alias of read_ndjson. |
materialize_json |
Eager dict[str, list] read (from pydantable import materialize_json). |
Detects array [{...}, ...] or JSON Lines in one file. |
Internals: untyped lazy scans return a raw pydantable_native._core.ScanFileRoot; that layer lives in the pydantable.io implementation package—application code should use DataFrameModel / DataFrame classmethods or import helpers from pydantable.
There is no lazy scan for a single JSON array file in pydantable today — use materialize_json (or pydantable.io.iter_json_array for batched array processing after a full parse; iter_json_array is not on the root package—prefer lazy NDJSON or materialize_json).
Multi-file / directory: lazy read_json / read_ndjson follow the same JSON Lines path semantics (directories, *.jsonl globs, scan_kwargs including glob=True for API parity with read_csv / read_parquet). Polars uses LazyJsonLineReader; details and glob=False behavior: IO_NDJSON. JSON array files and array datasets in directories are not lazily scanned—use per-file materialize_json / iter_json_array, or convert to NDJSON (see Polars 0.53 vs pydantable scan audit).
Large files, memory, and entrypoint choice¶
Prefer lazy JSON Lines for big logs: DataFrameModel.read_ndjson / read_json (or DataFrame classmethods) keep work on a Polars LazyFrame until you DataFrame.collect, DataFrame.to_dict, or DataFrame.write_parquet / write_ndjson / etc. The full file is not loaded as a Python column dict first.
Terminal materialization: DataFrame.collect runs the plan and returns rows (or use collect(as_lists=True) / DataFrame.to_dict for columnar dicts). For very large results, consider DataFrame.head / DataFrame.slice before collect, or write to disk with a lazy write_* sink instead of pulling everything into Python.
Polars streaming engine: on terminal APIs you can pass streaming=True (or set PYDANTABLE_ENGINE_STREAMING) so the Rust engine requests Polars streaming collect where supported — best-effort; some plans fall back to in-memory behavior. See EXECUTION (Streaming / engine collect).
Chunked Python-side reads: iter_ndjson / aiter_ndjson (from pydantable import iter_ndjson, aiter_ndjson) yield dict[str, list] batches of a fixed row count — useful when you want bounded Python memory without building a lazy DataFrame plan (e.g. simple ETL scripts). See IO_OVERVIEW (Batched column dict I/O).
JSON array files ([{...}, {...}]): use materialize_json / amaterialize_json for a full eager column dict, or pydantable.io.iter_json_array / pydantable.io.aiter_json_array — those paths load the entire JSON first, then chunk in Python (advanced; iter_json_array is not re-exported on pydantable).
Runnable example (lazy filter + collect):
"""Large NDJSON: typed lazy read_ndjson → filter → collect().
Transforms stay on the Polars :class:~polars.LazyFrame until collect/to_dict.
Needs pydantable._core. Run::
python docs/examples/io/large_ndjson_patterns.py
"""
from future import annotations
import tempfile from pathlib import Path
from pydantable import DataFrameModel
class LogLine(DataFrameModel): """One JSON object per line."""
level: str
msg: str
def main() -> None: with tempfile.TemporaryDirectory() as tmp: path = Path(tmp) / "app.ndjson" path.write_text( '{"level":"info","msg":"start"}\n' '{"level":"error","msg":"bad"}\n' '{"level":"info","msg":"done"}\n', encoding="utf-8", )
# Lazy: transforms stay on Polars LazyFrame until collect.
df = LogLine.read_ndjson(str(path))
errs = df.filter(df.level == "error")
rows = errs.collect()
assert len(rows) == 1
assert rows[0].msg == "bad"
print("large_ndjson_patterns: ok")
if name == "main": main()
Output¶
NDJSON scan kwargs (presets)¶
Typed and untyped lazy reads accept **scan_kwargs forwarded to Polars (see DATA_IO_SOURCES). read_json uses the same allowlist as read_ndjson. For NDJSON / JSON Lines, allowed keys include:
infer_schema_length— how many lines Polars uses to infer dtypes (orNonefor engine default). Increase if early lines are not representative of the full file (e.g. sparse optional fields appear later).n_rows— cap rows read from the file (sampling / debugging).ignore_errors— skip malformed lines where the engine allows it; can drop data silently — use only for dirty logs when you accept lossy ingest.low_memory,rechunk— Polars memory / chunking hints.
Omitted keys use Polars defaults for LazyJsonLineReader (pydantable pins Polars 0.53; see upstream Polars docs for default numeric values).
Eager column dict¶
Default: materialize_json / amaterialize_json return dict[str, list] (detects array vs JSON Lines); wrap with MyModel(cols, ...) for a typed frame.
Utilities: DataFrameModel.export_json / DataFrameModel.aexport_json write one JSON array of row objects (or call the same shapes via pydantable.io helpers if you are extending the library).
Eager export_json serialization¶
DataFrameModel.export_json uses the standard library json.dump with default=str (same behavior as the pydantable.io implementation). Nested Python dict and list cells serialize as JSON objects and arrays. Values that are not JSON-native (for example datetime, Decimal, UUID) are written as str(value), which is not necessarily ISO-8601 or another stable wire format. For API-stable JSON, build rows with DataFrameModel.to_dicts / Pydantic model_dump(mode="json") after materialization, or normalize scalars before calling export_json.
Lazy NDJSON and eager materialize_json can infer nested shapes differently; map-like JSON objects may round-trip more reliably through eager materialize_json + a typed constructor than through a lazy scan—see tests in tests/test_json_io_phase_a.py.
To emit JSON text from nested model columns inside a frame, use Expr.struct_json_encode on a struct-typed Expr (Polars-backed); see SUPPORTED_TYPES.
To parse JSON strings in-column back into structs or maps (without a separate file read), use Expr.str_json_decode after you have a str column of JSON text (e.g. from struct_json_encode, logging pipelines, or materialize_json on a string field).
See also: IO_NDJSON, IO_DECISION_TREE.