Skip to content

Arrow IPC / Feather file I/O

Primary: DataFrame[Schema].read_ipc, write_ipc, and DataFrameModel methods. Secondary: pydantable.io.

This covers Arrow IPC file (.arrow / .feather-style single file), not arbitrary streaming IPC on a socket unless you materialize through PyArrow yourself.

Batch iterators / writers (1.5.0+): iter_ipc and write_ipc_batches take as_stream=. Use the same value on read and write: on-disk IPC file format is as_stream=False; IPC stream bytes are as_stream=True (the write_ipc_batches default). See IO_OVERVIEW (Batched column dict I/O).

Read (sources)

DataFrame[Schema] and DataFrameModel

  • DataFrame[Schema].read_ipc(path, *, columns=None, **scan_kwargs)
  • MyModel.read_ipc(...), await MyModel.aread_ipc(..., executor=None)
  • materialize_ipc, await amaterialize_ipc from pydantable.io, then MyModel(cols)as_stream, engine

pydantable.io

  • read_ipc, aread_ipc
  • materialize_ipc, amaterialize_ipc

scan_kwargs: forwarded to IpcScanOptions (record_batch_statistics) and UnifiedScanArgs (glob, cache, rechunk, n_rows, hive_partitioning, hive_start_idx, try_parse_hive_dates, include_file_paths, row_index_name, row_index_offset). Unknown keys raise ValueError. See DATA_IO_SOURCES.

Paths, directories, and multi-file

Lazy read_ipc uses Polars LazyFrame::scan_ipc with pydantable-built IpcScanOptions and UnifiedScanArgs (defaults match Polars Default: glob: true, hive options enabled). Tune glob / hive / lineage kwargs like other read_* roots—see Polars 0.53 vs pydantable scan audit.

as_stream=False (default): local file paths can use Rust; otherwise PyArrow. as_stream=True uses PyArrow stream decoding. Install pydantable[arrow] when the path goes through PyArrow.

Write (targets)

DataFrame[Schema] and DataFrameModel

  • df.write_ipc(path, *, compression=..., write_kwargs=..., streaming=...)
  • model.write_ipc(...)

IPC sink options are intentionally narrow: use top-level compression=. Non-empty write_kwargs is rejected.

pydantable.io

  • export_ipc, aexport_ipc
  • iter_ipc, aiter_ipc, write_ipc_batches — rectangular dict[str, list] batches (PyArrow); as_stream must match how the bytes were produced.

Runnable example

Run conventions: IO_OVERVIEW (Runnable example).

python docs/examples/io/ipc_roundtrip.py

"""Arrow IPC: hand off a columnar batch between processes (Feather/IPC on disk).

Needs pydantable._core. Run::

python docs/examples/io/ipc_roundtrip.py

"""

from future import annotations

import tempfile from pathlib import Path

from pydantable import DataFrameModel

class SensorReading(DataFrameModel): """Two samples from a batch job writing IPC for a downstream consumer."""

sensor_id: int
celsius: int

def main() -> None: with tempfile.TemporaryDirectory() as scratch: from_worker = Path(scratch) / "batch_17.arrow" to_consumer = Path(scratch) / "batch_17_copy.arrow" SensorReading({"sensor_id": [1, 2], "celsius": [21, 22]}).write_ipc( str(from_worker) )

    df = SensorReading.read_ipc(str(from_worker))
    df.write_ipc(str(to_consumer))

    got = SensorReading.read_ipc(str(to_consumer))
    assert [int(x) for x in got.to_dict()["sensor_id"]] == [1, 2]
    assert [int(x) for x in got.to_dict()["celsius"]] == [21, 22]

print("ipc_roundtrip: ok")

if name == "main": main()

Output

ipc_roundtrip: ok

See also

IO_OVERVIEW · EXECUTION · SUPPORTED_TYPES (Arrow interchange)