Skip to content

I/O: Parquet over HTTP with automatic temp-file cleanup

Lazy HTTP Parquet reads download to a temp file. Prefer the *_ctx context managers to ensure the temp file is removed when the pipeline is done.

Recipe

from pydantable import DataFrameModel


class User(DataFrameModel):
    id: int
    age: int | None


url = "https://example.com/users.parquet"

with User.read_parquet_url_ctx(url) as df:
    out = df.filter(df.age.is_not_null()).select("id", "age")
    # Materialize inside the context
    _ = out.to_dict()

Pitfalls

  • Do not keep a lazy frame alive after the context exits; the backing file is deleted.
  • For safety limits on HTTP reads, see IO_HTTP (max_bytes).