Skip to content

Mental model

This page is the “map” of pydantable: the core concepts and how they relate.

If you’re new, read this once, then jump to the Five-minute tour or DataFrameModel.

The four core nouns

Schema (row shape)

Schema is a Pydantic-style row model used to describe the shape and types of a table.

  • Used with DataFrame[RowSchema]
  • Used as an output type for materialized rows (e.g. collect())

See: DataFrameModel (it covers Schema as part of the typed table story).

DataFrame[T] (typed table)

DataFrame[T] is a typed table whose columns match the schema T.

You can create a DataFrame from columnar data, from I/O helpers, or from optional engine-specific sources.

Start here:

Expr (typed expressions)

Expr is how you refer to and compute columns (e.g., df.score > 8.0).

Expressions are designed to remain type-aware so transforms can stay typed and composable.

Start here:

DataFrameModel (a “table model”)

DataFrameModel is the higher-level “SQLModel-style” concept: a reusable typed table definition with methods for:

  • ingest/validation rules
  • typed transforms
  • structured I/O entrypoints

If you’re building real pipelines (especially in services), this is usually the best place to start.

Start here: DataFrameModel

Execution: plans and materialization

pydantable is designed so that many operations can remain lazy until you explicitly choose to materialize.

Two key pages:

Materialization outputs (what you get at the end)

Common end states include:

  • Pydantic rows (a list of row models)
  • Columnar dicts (dict[str, list])
  • Engine-native objects (e.g. Polars/Arrow) when the relevant extras are installed

See: Execution (copy/interchange + display) and Materialization (modes).

A common early gotcha: shape is not always “executed rows”

After lazy transforms, df.shape follows root-buffer semantics and may not reflect the number of rows that will materialize after execution.

This is documented as part of the compatibility contract:

Engines and backends (what “engine” means here)

There are two related ideas:

1) Execution backend: where the plan runs (the default native engine is Polars-backed inside the Rust extension). 2) Data sources / sinks: how you read/write data (files, HTTP, SQL, etc.).

Default execution

Out of the box, pydantable executes via the native extension.

If you want to understand the runtime and cost model:

Optional swap-in engines

pydantable also supports optional engines that keep the DataFrame API but use different backends:

I/O is a separate story (choose an entrypoint)

Even if you stay on the default execution engine, you still need to choose I/O entrypoints:

Where pydantable fits in the ecosystem

If you’re deciding between tools, start here: