Supported data types¶
This page is the authoritative list of column types pydantable accepts on
DataFrameModel / DataFrame[Schema] fields, uses in expression typing
(Rust AST), and maps to Rust schema descriptors (scalar:
{"base": ..., "nullable": ...} plus optional homogeneous "literals": [...] for
typing.Literal[...] columns, nested struct:
{"kind": "struct", "nullable": ..., "fields": [...]}, or homogeneous list:
{"kind": "list", "nullable": ..., "inner": <descriptor>}).
For behavior contracts (nulls, joins, ordering), see INTERFACE_CONTRACT.md.
JSON (RFC 8259) vs column types¶
When data is JSON on the wire (files, HTTP bodies), use this mapping from JSON value kinds to pydantable fields. For lazy JSON Lines scans vs eager array files, see IO_JSON.
| JSON kind | Model as | Notes |
|---|---|---|
null |
Optional[T] / T \| None |
SQL-style null propagation; see Nullability below |
| boolean | bool |
|
| string | str, Literal[...], UUID, enums, IP types, Annotated[str, ...], etc. |
Same as Scalar base types |
| number | int, float, or Decimal |
JSON does not distinguish integers from floats; Pydantic coercion applies on validation |
| array | list[T] (homogeneous) |
Heterogeneous arrays (e.g. [1, "a", {}]) are not a single list[T] column—use a str column plus parse/validate elsewhere, or split fields |
| object (nested) | Nested Schema / BaseModel |
Struct column; use Expr.struct_field, unnest, etc. |
| object (string-keyed map) | dict[str, T] |
Map column (string keys only); see Map-like columns below |
Arbitrary or schemaless JSON: pydantable does not provide a catch-all “JSON cell” dtype. Store str and use Expr.str_json_path_match or parse in Python after collect() / to_dicts(), or normalize upstream into typed columns.
Scalar base types (columns + expressions)¶
Each column in your schema is one scalar Python type from this set:
| Python type | Import | Descriptor base (Rust) |
|---|---|---|
int |
built-in | int |
float |
built-in | float |
bool |
built-in | bool |
str |
built-in | str |
UUID |
from uuid import UUID |
uuid (stored as Polars Utf8; cells round-trip as uuid.UUID) |
Decimal |
from decimal import Decimal |
decimal (Polars Decimal(38, 9); scale fixed at 9) |
enum.Enum subclass |
import enum |
enum (Polars Utf8; wire value is the member’s .value when it is a string, otherwise str(member)) |
datetime |
from datetime import datetime |
datetime |
date |
from datetime import date |
date |
time |
from datetime import time |
time (Polars Time; wall clock, distinct from datetime / timedelta) |
timedelta |
from datetime import timedelta |
duration |
bytes |
built-in | binary (Polars Binary; limited Expr surface) |
typing.Literal[...] |
from typing import Literal |
Same base as the homogeneous value kind (str / int / bool); descriptor includes literals list (all-str, all-int, or all-bool parameters only; no mixing). Stored like plain str / int / bool. filter(col == "x") is rejected at expression build time if "x" is not in the Literal set. |
IPv4Address |
from ipaddress import IPv4Address |
ipv4 (Polars Utf8; canonical IPv4 string) |
IPv6Address |
from ipaddress import IPv6Address |
ipv6 (Polars Utf8; canonical IPv6 string) |
WKB |
from pydantable.types import WKB |
wkb (Polars Binary; bytes subclass with Pydantic validation; same limited Expr surface as bytes) |
Use these types in Pydantic field annotations on DataFrameModel subclasses (or
Schema models for DataFrame[Schema]). Runtime cell values must be instances
compatible with that annotation (validated by Pydantic under default ingest, i.e. trusted_mode="off" or omitted).
Nullability¶
Nullable columns use the same base types with None as a cell value:
Optional[T],T | None, orUnion[T, None]withTfrom the table above.
Expression results and filter conditions follow SQL-like null rules; see
INTERFACE_CONTRACT.md.
Typed strings (Annotated[str, ...])¶
Annotated[str, ...] is accepted as a column annotation: metadata is stripped for
the Rust dtype (logical str / Polars Utf8), while Pydantic on
collect() / RowModel still applies your constraints (for example
Annotated[str, pydantic.HttpUrl]). Match other projects’ “newtype string” patterns
without a separate Rust base type.
Custom semantic scalar types (Phase 3)¶
If you want a domain-specific scalar type (e.g. ULID, CountryCode) that behaves
like str/int/bytes but is validated/coerced by Pydantic v2, define a custom
type with __get_pydantic_core_schema__ and register it with pydantable:
See CUSTOM_DTYPES for the full guide.
Practical notes (1.2.0 scalars)¶
typing.Literal[...]
- Parameters must be all
str, allint, or allbool(no mixing). filter(col == literal)is checked when the expression is built: the constant must appear in the column’sLiteralset (same idea for!=).- Nullable columns use
Literal[...] | None(orOptional[...]);Noneis not aLiteralmember for those checks.
IP addresses (IPv4Address, IPv6Address)
- Column input may be strings; pydantable normalizes to
ipaddressinstances under default validation. - In
Exprcomparisons, wrap addresses on the RHS withIPv4Address(...)/IPv6Address(...). The Python expression builder types the RHS literal asstrotherwise, which does not satisfy the IP column dtype incompare_op(even though the Rust core allows some IP/string combinations in other paths).
WKB
pydantable.types.WKBis abytessubclass with Pydantic integration for row models.- Use
WKB(b"...")(or equalWKBcells) on the RHS of==/!=for reliable typing.Expr.binary_len()is implemented forbytescolumns; forWKB, usedf.col.cast(bytes).binary_len()today.
Annotated[str, ...]
- For
collect()/RowModel, Pydantic enforces your metadata (length, URL, regex, etc.). Invalid cells may only surface at materialization time unless the constructor path validates early—see tests intests/test_extended_scalar_dtypes_v12.py.
from __future__ import annotations
import ipaddress
from typing import Annotated, Literal
from pydantic import Field, HttpUrl
from pydantable import DataFrameModel
from pydantable.types import WKB
class Row(DataFrameModel):
env: Literal["dev", "prod"]
host: ipaddress.IPv4Address
geom: WKB | None
url: Annotated[str, HttpUrl]
df = Row(
{
"env": ["dev"],
"host": ["192.168.0.1"],
"geom": [WKB(b"\x01\x02")],
"url": ["https://example.com"],
}
)
needle = ipaddress.IPv4Address("192.168.0.1")
subset = df.filter(df.env == "dev").filter(df.host == needle)
Nested Pydantic models (struct columns)¶
A column may use a Schema / BaseModel subclass whose fields are themselves
supported column types (scalars or further nested models). Each cell is one nested
model instance: in columnar Python input, use a list of dicts (or objects that
validate as the nested model).
Rust maps these to struct dtypes and Polars struct columns. Expression
support for whole-struct ops is conservative (no arithmetic on whole structs;
equality is allowed only when struct shapes match). Use Expr.struct_field(...)
for scalar field projection, plus struct helpers below (all Polars-backed;
see INTERFACE_CONTRACT row-wise notes).
Map-like columns (dict[str, T])¶
String-keyed maps are supported as dict[str, T] where T can be scalar,
list, map, struct, or unions with None (JSON-like payloads). Cells
are Python dict values; the engine stores a logical map as Polars
List(Struct{key: str, value: T}). Expression support is intentionally small:
map_len(), map_get(key), map_contains_key(key), map_keys(),
map_values(), and map_entries() (see Expression typing below); not all
Polars map ops are exposed.
map_from_entries() builds map cells from a list of {key, value} entry structs.
If the entry list contains duplicate string keys, the last entry for that key
wins (Polars map semantics); do not rely on raising an error for duplicates.
Arrow-native map columns (0.15.0+): You may pass a PyArrow Array or ChunkedArray typed as map<string|large_string, V> for a dict[str, T] field. Each row is converted to a Python dict (or None if the map cell is null). Non-string Arrow map keys are rejected. With trusted_mode='strict', scalar T is checked against the Arrow value type; nested T (list, struct, map) uses best-effort acceptance—prefer trusted_mode='off' when you need full Pydantic validation of nested cells. Heterogeneous keys (e.g. dict[int, T]) are not supported.
0.17.0 — Map expressions after Arrow ingest: After ingest, the column is a normal dict[str, T] map for planning and Expr: map_get(key) yields null when the key is absent (or the whole map cell is null); map_contains_key(key) is boolean. Same rules apply to maps built from Python dict cells. See tests/test_pyarrow_map_ingest.py (test_arrow_map_ingest_then_map_get_and_contains).
0.18.0 / 0.19.0 / 0.20.0 — Non-string map keys: dict[int, T], other non-string Python keys, and Arrow map types whose keys are not string / large_string remain unsupported. That work is explicitly deferred; see ROADMAP Later. 0.20.0 does not change map dtypes or ingest.
Homogeneous list columns (list[T] / List[T])¶
Use list[T] (or typing.List[T]) where T is any supported column type
(scalar, nested struct, or another list[...]). Each cell is a Python list of
values matching T. Rust uses DTypeDesc::List and Polars list columns.
explode(...)is supported for list-typed columns: it lowers to Polarsexplodeand sets the column’s dtype to the innerT(nullable, even when the list column was non-nullable). Multi-column explode requires equal list lengths on each row; empty list cells produce no output rows for that row.unnest(...)is supported for struct columns (nested models): fields become top-level columns named{struct_column}_{field}(separator_), with dtypes and nullability derived from the nested schema.- Expressions on list columns (see below): indexing, membership, length, and
numeric reductions—not element-wise arithmetic between two list columns.
Use
explodewhen you need one row per list element.
Descriptor shape (list)¶
Descriptor shape (Rust ↔ Python)¶
Scalars keep the existing flat form for compatibility:
Nested models use kind: "struct" with ordered fields (each field has a name
and recursive dtype):
{
"kind": "struct",
"nullable": false,
"fields": [
{"name": "street", "dtype": {"base": "str", "nullable": false}},
{"name": "zip", "dtype": {"base": "int", "nullable": true}}
]
}
When the logical plan changes shape, pydantable rebuilds field types from Rust
descriptors. If a column name was already in the current schema and the new
descriptor matches that annotation (via
descriptor_matches_column_annotation in python/pydantable/schema.py), the
original Python type is kept—including your nested Schema classes. New or
renamed columns, or columns whose dtype changed (for example after with_columns
or fill_null), still use anonymous create_model types where no prior
annotation applies.
Expression typing (Rust)¶
The native core builds a typed expression tree for Expr (column references,
literals, arithmetic, comparisons, etc.). Invalid combinations fail when the
expression is built, not only at execution time. Scalar base types match the table
above (including temporal types); struct columns follow the conservative rules
described above.
Type-specific Expr methods (common operations)¶
Beyond generic arithmetic and comparisons, the following are supported (see
Expr in the Python API):
- Numeric:
abs(),round(decimals=...),floor(),ceil()onint/floatcolumns. - String:
strip(),upper(),lower(),str_reverse(),str_pad_start(len, fill_char=" "),str_pad_end(len, fill_char=" "),str_zfill(len),str_extract_regex(pattern, group_index=1)(Rustregexdialect; group 0 = whole match),str_json_path_match(path)(Polars JSONPath; resultstr, null on miss/invalid JSON per engine),str_json_decode(dtype)(Polarsstr.json_decode: JSON text → struct ordict[str, T]; null text → null; invalid JSON typically errors atcollect()in Polars 0.53—see INTERFACE_CONTRACT; map JSON must be an array of{key,value}objects),str_replace(old, new, literal=True)(default: literal substring;literal=Falseuses Rustregex, not Pythonre),starts_with,ends_with,str_contains(literal substring),str_contains_pat(pattern, literal=False)(regex whenliteral=False),str_split(delimiter)→list[str],strip_prefix,strip_suffix,strip_chars, plussubstr,char_length,concat. - Boolean:
&,|,~for combining boolean-typed expressions. - Datetime / date / time:
dt_year()…dt_day()ondateordatetime;dt_weekday()(ISO weekday: Monday = 1 … Sunday = 7, same as Polars) anddt_quarter()(1–4) ondateordatetime;dt_week()(ISO 8601 week number 1–53, same as Polarsdt.week()and Pythondate.isocalendar().week);dt_dayofyear()(day of year 1–366; Polarsdt.ordinal_day()) ondateordatetime;dt_hour()…dt_nanosecond()ondatetimeortime;dt_date()ondatetime(calendardate).strptime(format, to_datetime=...)parsesstr→dateordatetime.unix_timestamp(unit=...)returns epochintfromdate/datetime.from_unix_time(unit=...)onint/floatcolumns yields UTC-naivedatetime(inverse ofunix_timestampfor typical values).datetime ± timedeltaanddate ± timedeltause typed binary ops (see Rustinfer_arith_dtype). - Homogeneous lists:
list_len(),list_get(index)(int index; OOB → null),list_contains(value),list_min()/list_max()/list_sum()/list_mean()onlist[int]orlist[float](list_meanresult isfloat; empty list cells yield null);list_join(separator, ignore_nulls=False)forlist[str]→str;list_sort(...)andlist_unique(stable=False)for lists whose elements are sortable scalars (see below). - Maps (
dict[str, T]):map_len()(number of entries),map_get(key)(value or null),map_contains_key(key)(boolean),map_keys()(list of keys),map_values()(list of values),map_entries()(list of{key, value}structs); physical encoding isList(Struct{key, value}). - Struct (nested model columns):
struct_field(name);struct_json_encode()→str(JSON text per cell via Polarsstruct.json_encode);struct_json_path_match(path)(JSON-encode thenstr.json_path_match; same empty-pathValueErrorand null-on-miss semantics asstr_json_path_match);struct_rename_fields([...])(exactly one new name per existing subfield, unique names);struct_with_fields(...)(keyword argsfield=Exprto add or replace subfields; at least one update). Nullable outer struct cells propagate nullability to projectedstr/ subfield dtypes likestruct_field. - Binary (
bytes):binary_len()(per-row byte length). - Cast:
cast(T)supports the usual primitive conversions plusdatetime→date/stranddate→str, andstr→date/datetimeusing Polars’ string parsing (ISO-8601-shaped strings; behavior follows Polars). For a fixed format, usestrptime(format, ...)instead ofcast.
Lazy Parquet vs model validation: A read_parquet directory/glob scan builds a Polars lazy schema (including allow_missing_columns behavior for files that omit columns—see IO_PARQUET). Pydantic validation of cell values still happens when you materialize into a DataFrameModel / DataFrame; align dtypes with cast / strptime in the plan if files disagree on representations.
Temporal part extraction and dt_date() on timezone-aware datetime values follow Polars’ interpretation of the stored dtype.
Semantics: string predicates, regex, and str_split¶
Boolean predicates (starts_with, ends_with, str_contains, str_contains_pat) return a boolean column. Null string cells produce null in the output (SQL-style three-valued logic).
str_contains(substring)is always a literal substring search (metacharacters such as.are not special).str_contains_pat(pattern, *, literal=False)uses the Rustregexcrate whenliteral=False. This is not Python’sremodule: escaping and feature flags differ. Useliteral=Truefor a literal substring match with the same API.- Empty
patternwithliteral=Falseis rejected when the expression is built (ValueError). Invalid regex syntax may surface as null per row at execution time (Polars), not necessarily as a raised error; prefer validating patterns in application code when you need strict failures. str_replace(..., literal=False)applies the same Rust-regex match semantics for the search pattern; replacement follows Polarsreplace_allbehavior.
str_split(delimiter) returns list[str] per row. The delimiter is a literal string (not regex). An empty delimiter follows Polars split rules (typically per-character splits for non-empty strings; an empty input string often yields an empty list). Null inputs remain null.
Semantics: dt_weekday, dt_quarter, dt_week, dt_dayofyear, and list_mean¶
dt_weekday()/dt_quarter()/dt_week()/dt_dayofyear()are allowed only ondateordatetime(not ontime); other dtypes raiseTypeErrorat expression build time. Weekday matches ISO ordering as in Polars: Monday = 1 … Sunday = 7 (aligned withdatetime.isoweekday()). Quarter is 1–4 from the calendar month.dt_week()is the ISO 8601 week number 1–53 (same asdate.isocalendar().weekand Polarsdt.week()); ondatetime, the week is taken from the calendar date of the stored timestamp (wall time, engine timezone rules apply for aware dtypes).list_mean()requireslist[int]orlist[float]; other element types raiseTypeError. The result is alwaysfloat. Empty lists and null list cells yield null in the output column.
Semantics: list_join, list_sort, list_unique¶
list_join: onlylist[str]; otherwiseTypeError. An empty list cell yields an empty string.ignore_nullsis passed through to Polarslist.join(null elements inside a list may be skipped whenTrue). Multi-codepointseparatorstrings are allowed (UTF-8).list_sort: list elements must beint,float,bool,str,date,datetime,time,enum, oruuid(scalar list cells).descending,nulls_last, andmaintain_ordermap to PolarsSortOptionsforlist.sort.list_unique: same element-type restriction aslist_sort.stable=Trueuses Polarsunique_stable(keep first-seen order among duplicates).
Semantics: str_reverse, str_pad, str_zfill¶
str_reverse(): Polarsstr.reverse(UTF-8 string reversal). Combining characters and other Unicode edge cases follow Polars rules, not necessarily naive per-codepoint reversal. Null string cells stay null.str_pad_start/str_pad_end:fill_charmust be a single non-empty Unicode scalar (one user-perceived character). Emptyfill_charor more than one character raisesValueErrorwhen the expression is built. Null inputs stay null.str_zfill: same null propagation as other string ops; numeric-string padding follows Polarsstr.zfill.
Semantics: str_extract_regex, str_json_path_match, and str_json_decode¶
str_extract_regex: emptypatternraisesValueErrorwhen the expression is built. Uses Polarsstr.extract(Rust regex).group_index0is the full match;1+are capture groups. Out-of-rangegroup_indexor non-matching rows typically yield null at execution (no error per row).str_json_path_match:pathuses Polars JSONPath ($...style). EmptypathraisesValueError. Each cell must hold JSON text; malformed JSON or no match typically yields null (engine-dependent). The output dtype isstr(matched fragment as string, e.g. JSON scalars without quotes per Polars). Use astrcolumn whose values are JSON documents (or JSON embedded in a string literal), not a separate JSON dtype.str_json_decode: supply a nested model type ordict[str, T]asdtype(same style ascast). Prefer pairing withstruct_json_encodefor round trips on struct columns. Map values in JSON must use Polars’ physical encoding: a JSON array of objects withkey/valuefields (not a single JSON object with arbitrary keys). Invalid JSON handling differs fromstr_json_path_match—see INTERFACE_CONTRACT.struct_json_path_match: same JSONPath dialect and typical null behavior, applied after encoding each struct cell as JSON text (seestruct_json_encode). Prefer this over nestingstruct_json_encode()andstr_json_path_match(...)in user code when both are on the same struct column.
Not supported as schema column types¶
These are out of scope for the current schema system:
dicttypes with non-string keys (dict[int, ...], etc.)- Arbitrary objects as per-cell values (except nested
BaseModelcolumns, homogeneouslist[T], anddict[str, T]maps as documented above)
When unsupported field types fail¶
DataFrameModelsubclasses: each field annotation is validated when the class is defined (in__init_subclass__). Unsupported types (for example barelistwithout an inner type,dict[int, str](non-string keys),int | str, ortyping.Any) raiseTypeErrorimmediately, beforeRowModelis generated. The message lists supported dtypes and points to this page.DataFrame[Schema]with a hand-writtenSchemasubclass: there is no class-time check on theSchemamodel (unlikeDataFrameModel). Unsupported annotations surface when you first constructDataFrame[YourSchema](...)(native plan build fromschema_fields()), or from Pydantic during validation. NestedBaseModelfields are supported when annotations match what the Rust dtype layer accepts (nested scalars or further nested models).
Future / planned types (roadmap direction)¶
The following are not implemented today; follow project priorities in ROADMAP.md.
| Planned category | Examples | Notes |
|---|---|---|
| Richer geospatial | e.g. GeoJSON column type, CRS metadata | WKB covers opaque binary geometry today; heavier GIS scope is deferred. |
Already shipped (scalar columns): primitives in the table above, including
typing.Literal[...] (homogeneous str/int/bool), ipaddress.IPv4Address /
IPv6Address, pydantable.types.WKB, uuid.UUID, decimal.Decimal, concrete
enum.Enum, datetime, date, time, timedelta, bytes, plus
homogeneous dict[str, T] map columns, Homogeneous list[T] columns, explode(),
list Expr helpers, and unnest() on struct columns (see ROADMAP.md).
Runtime column payloads (Python)¶
For default construction, columns are typically dict[str, list] with one
Python value per row per column; struct columns use a list of dicts (or
compatible row objects). Lists may be plain list, tuple, or
numpy.ndarray (see schema.validate_columns_strict).
0.16.0 — PyArrow Table / RecordBatch: When pyarrow is installed, you may pass a pa.Table or RecordBatch as DataFrame / DataFrameModel input. It is converted to dict[str, list] via to_pylist() per column (copies), then the usual validation runs. materialize_parquet / materialize_ipc (from pydantable import …) produce the same column shape for file/bytes sources. DataFrame.to_arrow() goes the other way: execute the plan as for to_dict(), then pyarrow.Table.from_pydict. Supported cell types are those that already round-trip through list materialization (scalars, JSON-friendly nested shapes per engine limits); exotic Arrow extension types may not map cleanly—validate with trusted_mode='off' when in doubt.
With trusted_mode="shape_only" or "strict", trusted bulk paths may pass NumPy, PyArrow,
or a Polars DataFrame as documented in EXECUTION and PERFORMANCE;
scalar dtypes must still match the schema. trusted_mode on DataFrame /
DataFrameModel selects shape_only vs strict checks (0.11.0; 0.12.0 extends strict to nested list / dict / struct shapes on
Polars and columnar Python paths; 0.13.0 adds strict dtype checks for PyArrow
Array / ChunkedArray columns and accepts concrete Arrow array classes as trusted
buffers). See schema.validate_columns_strict for the low-level API (validate_elements remains a bridge for direct callers).
See DATAFRAMEMODEL (“Trusted ingest”). The legacy validate_data constructor argument was removed in 0.15.0.
0.14.0 — shape_only dtype drift: when trusted_mode="shape_only", pydantable
may emit pydantable.DtypeDriftWarning if a column would be rejected under
strict (e.g. string cells for an int field). Set environment variable
PYDANTABLE_SUPPRESS_SHAPE_ONLY_DRIFT_WARNINGS=1 to silence these warnings in
noisy pipelines.
See also¶
- DATAFRAMEMODEL —
DataFrameModeland row vs column inputs - INTERFACE_CONTRACT — null semantics, joins, reshape constraints
pydantable-core/src/dtype.rs— mapping from Python annotations to internal dtypes