Lazy SQL DataFrame: SqlDataFrame and SqlDataFrameModel¶
This guide covers the optional SQLAlchemy-backed lazy execution path: typed
DataFrame / DataFrameModel instances whose execution engine comes from the
same optional stack installed with pip install "pydantable[sql]" (SQLAlchemy 2.x connection plumbing plus pydantable’s bridge to ExecutionEngine — see CUSTOM_ENGINE_PACKAGE).
The reason to use a SQL execution engine is to keep transformations on the
database side for as long as possible: the lazy-SQL bridge compiles the typed plan to SQL
so you are not forced to pull entire result sets into Python memory just to
select, sort, join, and similar steps. That matters most when the end of
the pipeline writes back to the same database — there is little benefit in
round-tripping the full dataset through the app when the work can stay in the
server.
Note
Install: pip install "pydantable[sql]" (SQLAlchemy bridge; add a DB-API driver for your DSN — e.g. psycopg / asyncpg for Postgres, aiosqlite or optional rapsqlite for async SQLite). The
core pydantable package does not import the optional SQL bridge at import time;
SqlDataFrame / SqlDataFrameModel are loaded lazily from the root package
(from pydantable import SqlDataFrame) or imported explicitly from
pydantable.sql_dataframe.
SQLite: sync (lazy-SQL bridge) vs async (rapsqlite)¶
The lazy-SQL stack builds a synchronous SQLAlchemy engine from EngineConfig. For SQLite, use a normal URL such as sqlite:///:memory: or sqlite:///path/to.db.
rapsqlite registers the sqlite+rapsqlite dialect for async SQLAlchemy (create_async_engine). That stack is for your async SQLAlchemy application code, not for the current Pydantable wiring described below.
Async drivers and SqlDataFrame (today)¶
The optional SQL bridge ships both sync and async connection helpers (ConnectionManager vs AsyncConnectionManager), and the latter can build an AsyncEngine from an async-style DSN (including normalizing sqlite:// → sqlite+aiosqlite where applicable).
However, the engine pydantable constructs for sql_config= / sql_engine_from_config() is implemented on top of the sync stack only: it uses QueryExecutor with a synchronous ConnectionManager. Its async_execute_plan coroutine offloads the synchronous execute_plan to a worker thread (asyncio.to_thread), so the event loop stays responsive, but database I/O is still sync SQLAlchemy under the hood.
So you cannot point sql_config= at an async-only URL and expect PydanTable to construct create_async_engine today: EngineConfig passed through ConnectionManager must resolve to a sync Engine (see the bridge package’s sql/connection.py).
If you need async SQLAlchemy today¶
- Use sync DSNs with
SqlDataFrame/sql_engine=as documented here. - Use async engines and sessions (e.g.
sqlite+rapsqlite://…,postgresql+asyncpg://…) in FastAPI / SQLAlchemy code paths that do not go through the lazy-SQLExecutionEngine, or use the eager SQL I/O helpers in IO_SQL with your own session/engine.
Roadmap (upstream)¶
Native async driver support for this path would require the SQL bridge to expose an async-first engine so plan execution uses AsyncQueryExecutor + AsyncConnectionManager and async_execute_plan awaits real async I/O. When that exists, PydanTable can add matching constructors (e.g. async sql_config resolution) without duplicating SQL logic here.
Until then, ato_dict / acollect on SqlDataFrame still use the engine’s async_execute_plan entrypoint (thread offload) — see EXECUTION.
When to use this¶
| Goal | Use |
|---|---|
| Default Polars/Rust execution for in-memory or file-backed workflows | DataFrame / DataFrameModel (see DATAFRAMEMODEL, EXECUTION). |
| Eager SQL I/O: load columns from a DB into a frame, or write tables | from pydantable import … — IO_SQL (SQLModel-first: fetch_sqlmodel, write_sqlmodel, …; string SQL: fetch_sql_raw, write_sql_raw, …). |
| Lazy execution with transforms staying in SQL where the engine supports it (plans compiled to SQL; avoid full-table pulls when you only need a terminal write or small materialization) | SqlDataFrame / SqlDataFrameModel with sql_config= or sql_engine=. |
The SQL I/O helpers materialize column dicts in Python; they do not replace
DataFrame._engine. SqlDataFrame wires the lazy-SQL engine so
select, filter, collect, etc. go through the SQL bridge (subject to what
the engine implements today).
Imports¶
# Lazy (optional SQL bridge not imported until accessed)
from pydantable import SqlDataFrame, SqlDataFrameModel
# Explicit
from pydantable.sql_dataframe import (
SqlDataFrame,
SqlDataFrameModel,
sql_engine_from_config,
)
If the optional SQL bridge is missing, constructing these classes raises ImportError
with an install hint (pydantable[sql]).
pandas- or PySpark-shaped SQL frames (same lazy-SQL engine; lazy-loaded so
import pydantable.pandas does not pull the bridge until you access the
names):
from pydantable.pandas import SqlDataFrame, SqlDataFrameModel # pandas-style API
from pydantable.pyspark import SqlDataFrame, SqlDataFrameModel # PySpark-style API
# Or explicitly:
from pydantable.pandas_sql_dataframe import SqlDataFrame, SqlDataFrameModel
from pydantable.pyspark.sql_dataframe import SqlDataFrame, SqlDataFrameModel
EngineConfig and constructing an engine¶
Examples use EngineConfig from the SQLAlchemy bridge module (import moltres_core after installing [sql]) for SQLAlchemy 2.x connection
settings. You must supply exactly one of dsn, engine, or session
(see the bridge package docs shipped with the extra).
from moltres_core import EngineConfig
# Example: in-memory SQLite (DSN)
sql_config = EngineConfig(dsn="sqlite:///:memory:")
Reuse one engine across many frames:
from pydantable.sql_dataframe import sql_engine_from_config
engine = sql_engine_from_config(sql_config)
# Pass sql_engine=engine on each SqlDataFrame / SqlDataFrameModel
SqlDataFrame¶
SqlDataFrame subclasses DataFrame. Use
SqlDataFrame[YourSchema](data, ...) the same way as DataFrame[YourSchema],
but you must provide a SQL execution backend via one of:
engine=— any object satisfyingExecutionEngine(advanced / tests).sql_engine=— a pre-built lazy-SQLExecutionEngineinstance.sql_config=— anEngineConfig; PydanTable buildsConnectionManager+ the bridge engine internally.
Precedence is exactly that order: explicit engine= wins.
Lazy read from a SQL table¶
Use SqlDataFrame[Schema].from_sql_table(table, sql_config=…) (or sql_engine=) so the frame holds a SqlRootData root: no SELECT runs until you call to_dict(), collect(), head(), etc. The table argument is a SQLAlchemy Table / FromClause — with SQLModel, pass YourModel.__table__. Column names must match the Pydantic schema fields on SqlDataFrame. For SqlDataFrameModel, use MyModel.read_sql_table(table, …).
For SQLite in-memory (sqlite:///:memory:), each new SQLAlchemy connection pool is an empty database unless you share one sql_engine for both DDL and the frame. Prefer a file URL (sqlite:///…/app.db) while wiring this up, or build tables using the same ConnectionManager / engine you pass as sql_engine=.
from pathlib import Path
import tempfile
from pydantic import BaseModel
from moltres_core import ConnectionManager, EngineConfig
from sqlmodel import Field, SQLModel, Session
from pydantable.sql_dataframe import SqlDataFrame, sql_engine_from_config
class Row(BaseModel):
"""Pydantable row schema; field names align with the SQL table."""
id: int
name: str
class Item(SQLModel, table=True):
"""Physical table; use ``Item.__table__`` for ``from_sql_table``."""
id: int | None = Field(default=None, primary_key=True)
name: str = Field(max_length=40)
with tempfile.TemporaryDirectory() as td:
db_file = Path(td) / "app.db"
cfg = EngineConfig(dsn=f"sqlite:///{db_file}")
eng = sql_engine_from_config(cfg)
cm = ConnectionManager(cfg)
SQLModel.metadata.create_all(cm.engine)
with Session(cm.engine) as session:
session.add(Item(id=1, name="a"))
session.commit()
df = SqlDataFrame[Row].from_sql_table(Item.__table__, sql_engine=eng) # lazy
cols = df.to_dict() # runs SELECT here
You can define a single SQLModel class and use it both as the table definition and as the dataframe schema type (SqlDataFrame[ThatModel]) when the shapes match; the split above shows how Row (view schema) and Item (DDL) line up by column name.
Eager in-memory columns (no SQL root) — same constructor as DataFrame:
from moltres_core import EngineConfig
from pydantable import SqlDataFrame
cfg = EngineConfig(dsn="sqlite:///:memory:")
df = SqlDataFrame[Row]({"id": [1, 2], "name": ["a", "b"]},
sql_config=cfg,)
Constructor flags (trusted_mode, fill_missing_optional, validation hooks,
etc.) match DataFrame — see DATAFRAMEMODEL and STRICTNESS.
Lazy root from a SQLAlchemy selectable (from_sql)¶
When you already have a SQLAlchemy selectable (a Select, subquery, or other
FromClause-like object), use SqlDataFrame[Schema].from_sql(selectable, ...).
This constructor is type-safe in that it:
- Requires a concrete schema type (
SqlDataFrame[MySchema]). - Validates the selectable exposes the schema’s field names as columns.
- Projects exactly the schema columns at the root (so extra columns on the selectable don’t accidentally become part of the plan’s required root shape).
from sqlalchemy import select
from pydantable.sql_dataframe import SqlDataFrame
# stmt is any SQLAlchemy selectable; field names must align with the schema fields.
stmt = select(Item.__table__.c.id, Item.__table__.c.name)
df = SqlDataFrame[Row].from_sql(stmt, sql_engine=eng)
Typed-safe WHERE pushdown (where)¶
SqlDataFrame.where(whereclause) wraps the lazy root so the SQL bridge emits a
server-side WHERE clause.
- Input is a SQLAlchemy boolean expression (e.g.
table.c.x > 1). - Referenced column names are validated against the schema (unknown columns raise
KeyError).
df = SqlDataFrame[Row].from_sql_table(Item.__table__, sql_engine=eng)
out = df.where(Item.__table__.c.id > 1).select("name").to_dict()
DataFrame transformations (lazy-SQL engine)¶
SqlDataFrame / SqlDataFrameModel use the same method names as DATAFRAMEMODEL / EXECUTION, but the SQL bridge only implements a subset of the full Polars/Rust pipeline.
Generally work (the bridge can execute or fold these into the SQL plan where supported):
- Projection:
select,drop - Row windows:
head,slice - Order:
sort(and related ordering helpers your SQL bridge version exposes) - Terminals:
collect,to_dict,to_dicts(materialize rows or column dicts) - Async terminals:
ato_dict,acollect,ato_dicts, … (SQL work may run on a thread pool; see EXECUTION)
Example (sync):
from pydantic import BaseModel
from moltres_core import EngineConfig
from pydantable.sql_dataframe import SqlDataFrame
class Row(BaseModel):
id: int
name: str
cfg = EngineConfig(dsn="sqlite:///:memory:")
df = SqlDataFrame[Row]({"id": [3, 1, 2], "name": ["c", "a", "b"]},
sql_config=cfg,)
assert df.select("name").to_dict() == {"name": ["c", "a", "b"]}
assert df.head(2).to_dict() == {"id": [3, 1], "name": ["c", "a"]}
assert df.sort("id").to_dict() == {"id": [1, 2, 3], "name": ["a", "b", "c"]}
Not supported today: filter / with_columns / other paths that require the native Expr runtime raise UnsupportedEngineOperationError (see Expressions below). For Expr-heavy work, use the default Polars/Rust engine or materialize with IO_SQL helpers first.
SqlDataFrameModel¶
SqlDataFrameModel subclasses DataFrameModel
and sets the inner frame class to SqlDataFrame. Define columns on a subclass
as usual; pass sql_config=, sql_engine=, or engine= on
construction.
from moltres_core import EngineConfig
from pydantable import SqlDataFrameModel
class Users(SqlDataFrameModel):
id: int
name: str
cfg = EngineConfig(dsn="sqlite:///:memory:")
users = Users({"id": [1], "name": ["Ada"]}, sql_config=cfg)
FastAPI: shared engine and async terminals¶
Install pydantable[fastapi,sql] (or your project’s equivalent). Build one shared lazy-SQL engine per process (or pool) and reuse sql_engine= on frames — do not create a fresh EngineConfig(dsn="sqlite:///:memory:") per request, or each handler would see an empty database.
The pattern below stores the engine on app.state at startup and returns columnar JSON with await …ato_dict() (async terminal). Sync routes can call to_dict() directly. See FASTAPI and EXECUTION for wider FastAPI + pydantable guidance.
from contextlib import asynccontextmanager
from fastapi import FastAPI
from moltres_core import EngineConfig
from pydantable.sql_dataframe import SqlDataFrameModel, sql_engine_from_config
class AppUser(SqlDataFrameModel):
id: int
name: str
@asynccontextmanager
async def lifespan(app: FastAPI):
cfg = EngineConfig(dsn="sqlite:///:memory:") # use a file DSN or pooled URL in production
app.state.sql_engine = sql_engine_from_config(cfg)
yield
app = FastAPI(lifespan=lifespan)
@app.get("/users/names")
async def user_names():
df = AppUser(
{"id": [1, 2], "name": ["Ada", "Bob"]},
sql_engine=app.state.sql_engine,
)
names = df.select("name")
return await names.ato_dict()
Expressions (Expr) and the native runtime¶
Expr-based filter / with_columns / … rely on
pydantable.engine.get_expression_runtime(), which is tied to the native
Rust core when the default engine is NativePolarsEngine. With only the
lazy-SQL engine bound to your frame, expression-heavy APIs may raise
UnsupportedEngineOperationError unless you integrate a compatible expression
runtime — see Expressions in CUSTOM_ENGINE_PACKAGE and ADR-engines.
Prefer operations your SQL bridge version documents as supported, or use the native engine for expression-heavy paths.
File I/O vs execution engine¶
Lazy read_* paths can still use pydantable-native for local files; that
is separate from DataFrame._engine. A custom SQL engine does not
automatically route read_parquet through the database. See File I/O vs
execution engine in CUSTOM_ENGINE_PACKAGE.
Versioning¶
The SQL bridge declares a dependency on pydantable-protocol compatible with your pydantable release line. Keep pydantable, pydantable-protocol, and the bridge package on mutually supported combinations (see VERSIONING).
See also¶
- CUSTOM_ENGINE_PACKAGE — third-party
ExecutionEnginepackages (reference implementations ship alongside the SQL bridge). - IO_SQL — eager SQL I/O (SQLModel-first:
fetch_sqlmodel,write_sqlmodel, …; string SQL:*_rawhelpers). - SQLMODEL_SQL_ROADMAP — SQLModel-first API history and design notes.
- FASTAPI — columnar bodies, NDJSON, and service patterns (works alongside
SqlDataFrameModel). - ADR-engines — engine abstraction and extension points.
- EXECUTION — materialization and engines.
- DATAFRAMEMODEL —
DataFrameModelpatterns shared bySqlDataFrameModel.