Skip to content

Five-minute tour

This page is the RTD-friendly version of the optional notebook in the repository at notebooks/five_minute_tour.ipynb (same steps). It uses the same steps: build a typed DataFrame, inspect it, summarize, filter, and materialize.

Note

Requires a working pydantable install with the Rust extension (pip install . or wheels).

If you’re deciding between tools

If you’re choosing between pydantable and a general-purpose DataFrame library, start here:

1. Define a table model

from pydantable import DataFrameModel


class Sales(DataFrameModel):
    id: int
    score: float
    label: str


df = Sales({
    "id": [1, 2, 3],
    "score": [10.0, 20.5, 7.0],
    "label": ["a", "b", "a"],
})

Other patterns

  • DataFrame[Schema] — same engine, generic API (Typing guide)
  • Avoid plain pydantic.BaseModel unless you intentionally skip DataFrameModel helpers

2. String repr and HTML (Jupyter)

In a terminal, repr(df) shows the schema and column dtypes (no row count—plans may be lazy).

In Jupyter / VS Code, the last expression in a cell can render as HTML via _repr_html_() (bounded preview; same cost class as head() + to_dict() for the slice). See EXECUTION Jupyter / HTML and Display options.

3. Discovery helpers

df.columns
df.shape  # root-buffer semantics after lazy transforms—see [INTERFACE_CONTRACT](../semantics/interface-contract/)
df.info()
print(df.describe())

Runnable script:

python docs/examples/getting_started/quickstart_discovery_helpers.py

from future import annotations

from pydantable import DataFrame from pydantic import BaseModel

class Row(BaseModel): id: int score: float label: str

def main() -> None: df = DataFrameRow

print("df.columns =", df.columns)
print("df.shape =", df.shape)
print()
print(df.info())
print()
print(df.describe())

if name == "main": main()

Output

df.columns = ['id', 'score', 'label']
df.shape = (3, 3)

DataFrame[Row]
  schema: Row
  columns: 3
  shape (root buffer): 3 x 3
  Note: after lazy transforms (e.g. filter), root row count may not match materialized rows; use to_dict() or collect() for true count.

dtypes:
  id: int
  score: float
  label: str

describe() — one to_dict(); int/float/bool/str/date/datetime columns.

id: count=3 mean=2 std=1 min=1 max=3
score: count=3 mean=12.5 std=7.08872 min=7.0 max=20.5
label: count=3 n_unique=2 min_len=1 max_len=1 null=0

4. Filter and materialize

filtered = df.filter(df.score > 8.0)
rows = filtered.collect()  # list[Pydantic row models]
cols = filtered.to_dict()  # dict[str, list]

Use to_polars() / to_arrow() when the optional extras are installed (EXECUTION Copy as / interchange).

5. Join, group, and window (optional)

For a longer analytics walkthrough (join + groupby + window), run:

python docs/examples/core/join_groupby_window.py
  • DataFrameModel — validation, transforms, service patterns
  • PANDAS_UI — optional pydantable.pandas import (assign, merge, cleaning helpers)
  • EXECUTION — materialization cost, async, display limits
  • INTERFACE_CONTRACT — semantics (joins, nulls, shape vs executed rows)
  • IO_DECISION_TREE — pick lazy vs eager I/O; prefer DataFrameModel / DataFrame[Schema] classmethods over raw pydantable.io
  • IO_OVERVIEW — per-format tables (Parquet, CSV, NDJSON, JSON, IPC, HTTP, SQL)
  • MONGO_ENGINE / BEANIE — optional pydantable[mongo] (lazy MongoDataFrame, eager fetch_mongo / afetch_mongo, Beanie ODM helpers)