PySpark API parity (pydantable.pyspark)¶

This matrix compares Apache Spark pyspark.sql concepts to pydantable’s facade. It is not a guarantee of behavioral identity with Spark.

For how to import and use the PySpark-style DataFrame and sql package, see PYSPARK_UI.

Spark API area	PydanTable status	Notes
`SparkSession`, `spark.sql(...)`	Out of scope	No distributed engine or SQL parser in pydantable.
`DataFrame.select`, `filter`, `where`	Supported	Typed `Expr`; `where` mirrors Spark.
`DataFrame.withColumn`	Supported
`DataFrame.join`	Supported	Suffix/collision rules per `INTERFACE_CONTRACT.md`.
`DataFrame.join(..., how=\"left_semi\"\|\"left_anti\")`	Supported (1.9.0+)	Spark-ish `left_semi`/`left_anti` map to core `semi`/`anti` joins; output is left-only columns.
`DataFrame.join(..., how=\"right_semi\"\|\"right_anti\")`	Supported (1.9.0+)	Spark-ish `right_*` aliases implemented by swapping join sides; output is right-only columns.
`DataFrame.join(left_on=..., right_on=...)`	Supported (1.9.0+)	Join on differently named keys; list/tuple and `ColumnRef` keys supported in the PySpark facade.
`DataFrame.join(validate=\"1:1\"\|\"1:m\"\|\"m:1\"\|\"m:m\")`	Supported (1.9.0+)	Shorthands forwarded to core join validation.
`DataFrame.groupBy` / `.agg`	Supported (1.9.0+)	CamelCase `groupBy` returns a PySpark grouped wrapper; tuple `agg` specs (not Spark `agg(expr)` only).
`GroupedDataFrame.agg({col: op(s)})`	Supported (1.9.0+)	Dict-form `agg`: `{\"v\":\"sum\"}` → `v_sum`; multi-op lists supported; common Spark op synonyms mapped (e.g. `avg` → `mean`).
`GroupedDataFrame.pivot(...).agg(...)`	Supported (1.9.0+)	Spark-style `groupBy(...).pivot(...).agg(...)` lowers to core `group_by(...).agg(...)` + `pivot(...)` (typed, in-process).
`GroupedDataFrame.pivot(...).agg({col: op(s)})`	Supported (1.9.0+)	Dict-form pivot `agg` with Spark-ish naming: `<pivot_value>_<col_op>` (e.g. `x_v_sum`).
`GroupedDataFrame.pivot(...).count/sum/avg/min/max`	Supported (1.9.0+)	Convenience wrappers over `pivot(...).agg(...)`; `count()` counts rows per group+pivot cell.
`GroupedData.count()` (no args)	Supported (1.9.0+)	Per-group row count (core `len` / synthetic sum).
`DataFrame.orderBy` / `sort`	Supported	Column names + ascending flags; global sort only (not `sortWithinPartitions`).
`DataFrame.crossJoin`	Supported (1.9.0+)	`join(how="cross")`.
`DataFrame.count()` (action)	Supported (1.9.0+)	Returns `int` via `global_row_count()` in the plan.
`DataFrame.unionByName`	Supported (1.9.0+)	Name-aligned concat; optional `allowMissingColumns` null-fill.
`DataFrame.intersect` / `subtract` / `except`	Partial (1.9.0+)	Distinct-set semantics: `intersect` ≈ inner join on all columns + `distinct`; `subtract`/`except` ≈ anti join on all columns + `distinct`. (`except` is a runtime alias of `except_` in Python.)
`DataFrame.exceptAll` / `intersectAll`	Supported (1.9.0+)	Multiset semantics: `exceptAll` yields `max(count_left-count_right,0)`; `intersectAll` yields `min(count_left,count_right)`.
`DataFrame.fillna` / `dropna` / `na`	Supported (1.9.0+)	Map to `fill_null` / `drop_nulls`; unsupported kw combinations raise clearly.
`DataFrame.printSchema` / `explain`	Supported (1.9.0+)	Readable schema tree; printed logical plan.
`DataFrame.toPandas`	Supported (1.9.0+)	Eager via `to_dict()`; requires pandas.
`DataFrame.limit`	Supported
`DataFrame.show`	Supported (0.20.0)	Prints a bounded text preview (`head`-like); not distributed Spark.
`DataFrame.summary`	Partial (0.20.0)	Returns the same string as core `describe()` (int/float/bool/str/date/datetime summaries; one string, not a stats `DataFrame`)—not Spark’s full `summary` column set.
`DataFrame.drop`	Supported	Drop by column name(s).
`DataFrame.distinct`	Supported	All-column distinct rows; optional `subset=` matches core `distinct`.
`DataFrame.withColumnRenamed`	Supported	Single rename per call (or use `rename` with dict).
`DataFrame.union` / `unionAll`	Supported	Vertical concat via core `concat(..., how="vertical")`.
`DataFrame.explode` / `explode_outer`	Supported	List columns only; implemented as a frame reshape (Polars `explode`). `outer=True` / `explode_outer` uses Spark-ish null/empty handling via Polars `ExplodeOptions` (`empty_as_null: true`, `keep_nulls: true`).
`DataFrame.posexplode` / `posexplode_outer`	Supported	One list column at a time; adds a 0-based position column plus the element column (name configurable).
`DataFrame.unnest` / `unnest_all`	Supported	Struct flattening to top-level fields (often what Spark users mean by “struct explode”).
`DataFrame.explode_all`	Supported	Schema-driven: explode every list-typed column.
`functions.explode`	Raises `TypeError`	pydantable has no `select(explode(col))` generator expressions; use `DataFrame.explode` / `posexplode`.
`functions.lit`, `functions.col`	Supported	`col` requires `dtype=` or use `df.col()`.
`functions.isnull`, `isnotnull`, `coalesce`	Supported	Via Rust `ExprNode`.
`functions.when` / `otherwise`	Supported	`CaseWhen` in Rust; chain `.when(...).otherwise(...)`.
`functions.cast`, `between`, `isin`, `concat`, `substring`, `length`	Supported	Base types only; `substring` is 1-based (Spark-style).
`functions.str_replace`, `regexp_replace`, `strip_prefix`, `strip_suffix`, `strip_chars`, `strptime`, `binary_len`, `list_len`, `list_get`, `list_contains`, `list_min`, `list_max`, `list_sum`	Supported (0.17.0)	Thin wrappers over core `Expr` methods (same Rust lowering). `regexp_replace` is an alias for literal substring replace, not full regex.
`functions.rlike` / `regexp_like` / `regexp_substr`	Supported (1.9.0+)	Regex predicates and substring extract via Rust `regex` dialect; requires Polars-backed execution for regex.
`functions.year` … `unix_timestamp`, `dayofyear`, `from_unixtime`, …	Supported	See Phase B; epoch conversions are UTC naive; ISO week / weekday where noted.
`functions.sum`, `avg`, `count`, `min`, `max`, … as column exprs	Supported (global)	Global `sum`/`avg`/`mean`/`count`/`min`/`max` on a typed `Expr` in `DataFrame.select(...)` (single-row). `count()` with no argument → row count (0.8.0). Grouped paths use `group_by().agg`.
`Column.cast`, `isin`, `between`, `substr`/`char_length`	Supported	On `Expr` / `Column`; includes `str` → `date` / `datetime` via Polars parsing (use `strptime` for fixed formats).
`Window`, window functions	Partial	`Window.partitionBy().orderBy(..., nulls_last=...)` (NULLS FIRST/LAST); `row_number`, `rank`, `dense_rank`, `window_sum`, `window_avg`, `window_min`, `window_max`, `lag`, `lead` + core `Expr` lowering. Framing support includes `rowsBetween` for all named window ops and `rangeBetween` for numeric/temporal aggregates: first `orderBy` column must be numeric, `date`, `datetime`, or `duration`; additional `orderBy` columns are sort tie-breakers only (`WINDOW_SQL_SEMANTICS.md`). Unframed multi-key windows: only the first key’s `nulls_last` is passed to Polars `SortOptions`.
`functions.map_len`, `map_get`, `map_contains_key`, `map_keys`, `map_values`, `map_entries`, `map_from_entries`, `element_at`	Supported	Per-row map cardinality, lookup, membership, key/value lists, entry structs, and entry-to-map reconstruction on `dict[str, T]` columns; `element_at` is a map lookup alias.
`types` (Array, Map, nested Struct, Decimal, Timestamp)	Partial	Engine supports nested structs/lists, `Decimal`, `datetime`/`date`, homogeneous `dict[str, T]` maps, `bytes`, and `time`; PySpark `types` mirrors annotations for docs/schema views.
`Row`, encoders, streaming	Out of scope

For execution, the PySpark UI uses the same Rust/Polars path as the default export.

0.18.0: The parity matrix above is unchanged—no new sql.functions wrappers this release.

0.19.0: Matrix unchanged—documentation and 0.x versioning policy only; see ROADMAP.md Shipped in 0.19.0.

0.20.0: DataFrame.show() / summary() rows above; core discovery helpers are shared with the default DataFrame. See ROADMAP.md Shipped in 0.20.0.

1.9.0: PySpark-shaped groupBy, row-count count(), crossJoin, unionByName, join-layer set ops, fillna / dropna / .na, printSchema, explain, toPandas, and matching DataFrameModel methods — see table rows marked 1.9.0+ above. Automated tests: tests/test_pyspark_dataframe_coverage.py, tests/test_pyspark_interface_surface.py

Phase B status (expression surface)¶

Delivered in-tree: IsNull, IsNotNull, Coalesce, CaseWhen (when / otherwise), Cast, InList, Between, StringConcat, Substring, StringLength — Rust ExprNode, Polars lowering, and pydantable.pyspark.sql.functions / Expr methods.

Also delivered — date/datetime (functions + Expr): year, month, day / dayofmonth, dayofweek, quarter, weekofyear, dayofyear, hour, minute, second, nanosecond, to_date (optional format= for strings), strptime, unix_timestamp (and from_unixtime for numeric epoch → datetime, UTC naive). Week semantics: weekofyear / core dt_week use ISO 8601 week number (Polars dt.week()); dayofweek is ISO weekday (Monday = 1 … Sunday = 7). Compare Spark’s weekofyear / dayofweek definitions if you need exact JVM parity.

String / numeric: lower, upper, trim (core Expr.strip); abs, round, floor, ceil.

Global row count: pydantable.expressions.global_row_count() or functions.count() with no argument.

Deferred: current_date / current_timestamp as lazy plan literals (no clock node today). Spark’s optional from_unixtime format string is not modeled — use parsing helpers on string columns instead.

Phase D — Aggregates as `functions.sum(Column)`¶

Global aggregates: functions.sum(F.col("x", dtype=int)) / avg / mean build ExprNode::GlobalAgg nodes. Use them in DataFrame.select(...) (positional or keyword) to get a single-row frame, matching Spark’s select(F.sum(...)) without a groupBy. Grouped aggregations remain group_by(...).agg(...).

Row count without a column: global_row_count() or functions.count() with no argument in global select.

Phase E — Windows¶

Delivered: Rust ExprNode::Window with Polars .over(...) lowering; Python row_number(), rank(), dense_rank(), window_sum(), window_mean(), lag(), lead() finished with Window.partitionBy(...).orderBy(...) / .spec() (see pydantable.window_spec and pydantable.pyspark.sql.window). row_number requires order_by in the window spec (Spark-style ordering). lag / lead require order_by. Framed status: - rowsBetween: supported for row_number, rank, dense_rank, window_sum, window_avg, window_min, window_max, lag, and lead. - rangeBetween: supported for window_sum, window_avg, window_min, and window_max with multi-column orderBy: range offsets use the first key only; see WINDOW_SQL_SEMANTICS.md. - Unsupported framed combinations raise typed errors.

Phases F–G — Nested types and real Spark¶

F: ArrayType / MapType / nested structs imply a v2 columnar schema contract in Rust.
G: A PySparkBackend that translates logical plans to pyspark.sql.DataFrame is a separate product track from façade completeness.

PySpark API parity (pydantable.pyspark)¶

Phase B status (expression surface)¶

Phase D — Aggregates as functions.sum(Column)¶

Phase E — Windows¶

Phases F–G — Nested types and real Spark¶

Phase D — Aggregates as `functions.sum(Column)`¶