Choosing an I/O API¶
Use this page to pick the right entry point. Execution semantics (lazy collect vs streaming) are in EXECUTION.
Quick reference¶
| Goal | Use | Returns / effect |
|---|---|---|
| Scan a local file without loading full columns into Python | MyModel.read_* or DataFrame[Schema].read_* (default) |
ScanFileRoot → lazy plan |
| Run transforms on a big file, then write without a giant dict | MyModel.read_* → … → DataFrame.write_parquet (or write_*) |
File on disk |
| Load everything eagerly into a typed frame | pydantable.io.materialize_*, fetch_sqlmodel / fetch_sql_raw, await afetch_sqlmodel / await afetch_sql_raw, then MyModel(cols) |
DataFrameModel / DataFrame |
Raw dict[str, list] only (utilities) |
pydantable.io.materialize_*, fetch_sqlmodel / fetch_sql_raw, fetch_*_url, read_from_object_store |
Column dict |
| Persist columns to a file | MyModel.export_* / await MyModel.aexport_* |
File on disk |
| Query or load from a database | fetch_sqlmodel / fetch_sql_raw (async: afetch_*) / iter_sqlmodel / iter_sql_raw + MyModel(...); MyModel.fetch_sqlmodel / write_sqlmodel (deprecated unprefixed SQL names: IO_SQL) |
Typed frame or none |
| Lazy typed transforms on SQL (plans compiled to SQL; swap-in engine) | SqlDataFrame / SqlDataFrameModel with sql_config= / sql_engine= — install pydantable[sql] (SQL_ENGINE) |
Lazy DataFrame / DataFrameModel |
| Lazy typed transforms on a MongoDB collection (swap-in engine) | Sync DB: MongoDataFrame[Schema].from_beanie(Doc, database=sync_db) or from_collection(coll) (MONGO_ENGINE). Async-only lazy (no sync client): from_beanie_async(Doc, …) or from_beanie_async(prebuilt_query) (BEANIE). |
Lazy DataFrame / DataFrameModel |
Eager MongoDB column dict (read / write dict[str, list], no DataFrame) |
Driver-level: fetch_mongo / iter_mongo / write_mongo (sync Collection) or await afetch_mongo / await aiter_mongo / await awrite_mongo (AsyncCollection uses native async — is_async_mongo_collection, MONGO_ENGINE). ODM-aware: afetch_beanie / awrite_beanie (BEANIE). pydantable[mongo] ships PyMongo + Beanie. |
Column dict batches or insert count |
| HTTP(S) download | fetch_*_url (eager dicts) or MyModel.read_parquet_url / await MyModel.aread_parquet_url (lazy temp file — IO_HTTP) |
Varies |
Object-store URIs (s3://, …) |
read_from_object_store ([cloud]) |
Column dict |
| Tier-2 readers (Excel, Delta, …) | pydantable.io.extras |
Column dict or helpers |
Module reference: every symbol also exists on pydantable.io; see IO_OVERVIEW (Module reference).
Multi-file, directories, and globs¶
| Goal | Use | Notes |
|---|---|---|
| Typed lazy pipeline over a directory, glob, or hive-style dataset | MyModel.read_* / DataFrame[Schema].read_* with **scan_kwargs (e.g. glob for Parquet/CSV) |
Preferred for large data; scanning is delegated to Polars—see Polars 0.53 vs pydantable scan audit. |
Eager dict[str, list] from multiple files |
materialize_* per file in a loop, or defer to lazy read_* |
materialize_* is oriented to single sources; multi-file concat is often clearer as a lazy read_* then to_dict(). |
Bounded-memory Python batches without a DataFrame plan |
iter_* / aiter_* over one path per call |
Expand globs in Python, then one iter_* per file (or iter_chain_batches); for many files, lazy read_* is often simpler—see IO_OVERVIEW (Batched column dict I/O → Multi-file paths, globs, and memory) and docs/examples/io/iter_glob_parquet_batches.py. |
Write a hive-style partitioned Parquet dataset (directory of col=value/... shards) |
DataFrame.write_parquet / DataFrameModel.write_parquet with partition_by |
path is the dataset root; see IO_PARQUET (Partitioned (hive-style) Parquet output). For a single Parquet file from dict[str, list] batches, use export_parquet or write_parquet_batches. |
| Parquet files disagree on column names (multi-file / glob scan) | read_parquet(..., allow_missing_columns=True, ...) ( scan_kwargs ) + optional Expr.cast / optional model fields |
Polars unions schemas; missing columns become null when allowed—see IO_PARQUET (Multi-file Parquet: columns, dtypes, and allow_missing_columns). |
Engine selection (materialize_parquet and friends)¶
Set PYDANTABLE_IO_ENGINE to auto (default), rust, or pyarrow where supported.
Parquet (materialize_parquet)¶
auto/rust: local file path andcolumns is None→ Rust fast path when the extension is built.- PyArrow path:
columnsis set, orsourceisbytes/BinaryIO, or Rust fails andautofalls back.
CSV / NDJSON (materialize_csv, materialize_ndjson)¶
auto/rust: try Rust readers for a local path; on failureautofalls back to stdlibcsvor JSON lines parsing in Python.
IPC (materialize_ipc)¶
auto/rust: local on-disk IPC file withas_stream=False; streams and buffers use PyArrow.
See implementation notes in pydantable.io module docstrings for edge cases.
Typed frame first vs pydantable.io only¶
- Default:
DataFrameModel/DataFrame[Schema]classmethods (read_*,materialize_*,fetch_sqlmodel/fetch_sql_raw, …) for validation and typedExprpipelines. pydantable.ioonly when you need untypeddict[str, list],ScanFileRootwithout wrapping, orextras— scripts, tests, notebooks, internal glue.
See DATAFRAMEMODEL and IO_OVERVIEW.