PySpark UI (pydantable.pyspark)¶
The PySpark UI is an optional import surface that adds Apache Spark–style names (withColumn, where, orderBy, …) and a small pyspark.sql-like submodule on top of pydantable’s typed logical DataFrame. It is not the Apache Spark DataFrame; there is no JVM, no SparkSession, and no runtime dependency on the pyspark package.
Execution uses pydantable’s Rust/Polars core (see Execution).
Install: base package vs pydantable[spark]¶
import pydantable.pyspark(this façade —DataFrame,DataFrameModel,pydantable.pyspark.sql, …) works with only the corepip install pydantabledependencies. It does not require PySpark, SparkDantic, or raikou-core at import time.pydantable.pyspark.sparkdantic(JVM schema helpers from SparkDantic) is loaded lazily and needspip install "pydantable[spark]"(pulls insparkdantic,pyspark,raikou-core, …). Accessingpydantable.pyspark.sparkdanticwithout those packages installed raisesModuleNotFoundErrorfor the missing optional dependency.- Real PySpark execution (
SparkDataFrame,SparkDataFrameModelover apyspark.sql.DataFrame) is documented in Spark engine and requires the same[spark]stack plus a JVM.
Release context (1.8.0 vs 1.9.0)¶
- 1.8.0 focused on core ergonomics (selectors, joins,
drop_nulls, reshape parity, etc.)—the same engine every import style uses; see the CHANGELOG 1.8.0 section. - 1.9.0 adds the Spark-shaped
DataFrame/DataFrameModelsurface in this document:groupBy, framecount(),crossJoin,unionByName, set-style helpers,fillna/dropna/.na,printSchema,explain,toPandas,F.dayofyear/F.from_unixtime, and related typing/stubs. Coredescribe()(and PySparksummary()) includedate/datetimesummary lines when those columns exist. Behavior and limitations are summarized in PySpark parity and Interface contract.
Tests¶
CI and local runs exercise the PySpark UI via:
tests/test_pyspark_dataframe_coverage.py— method coverage, error contracts,DataFrame/DataFrameModel, grouped handles,unionByName, set ops, NA helpers,explain/printSchema.tests/test_pyspark_interface_surface.py— end-to-end pipelines (joins,groupBy().agg, melt/pivot, windows, temporal filters).
When adding Spark-named wrappers, extend those files (or add focused tests next to them) so regressions are caught on all platforms.
Semantic differences vs Apache Spark¶
- No cluster: all methods lower to the in-process Rust/Polars plan;
count()is a logical row count, not a distributed action across executors. subtract: implemented as an anti join on all columns (distinct-set semantics), not Spark multisetEXCEPT ALL. UseexceptAllfor multiset semantics.sort/orderBy: global sort only; there is nosortWithinPartitions.summary(): same string as coredescribe()(int/float/bool/str/date/datetime lines where those columns exist), not Spark’s full multi-columnsummaryDataFrameunless/until a future release adds a table-shaped stats path.
When to use it¶
- You want Spark-flavored method names and
pydantable.pyspark.sql.functions as Fwhile keeping schema-safeExprandDataFrameModel. - You are migrating or comparing code mentally against PySpark and need familiar entry points.
from pydantable.pyspark import DataFrame, DataFrameModel
from pydantable.pyspark.sql import functions as F
class User(DataFrameModel):
id: int
name: str
df = User({"id": [1], "name": ["Ada"]})
out = df.withColumn("greeting", F.concat(F.col("name", dtype=str), F.lit("!")))
print(out.to_dict())
Output (one run):
Imports¶
| Symbol | Role |
|---|---|
DataFrame |
Core logical DataFrame + Spark-like methods (same Rust engine as default). |
DataFrameModel |
Pydantic model; inner frame is the PySpark UI DataFrame. |
Expr, Schema |
Re-exported from pydantable core. |
sql |
Package with functions, types, Column, etc. |
Implementation for the DataFrame wrappers is in python/pydantable/pyspark/dataframe.py.
DataFrame (PySpark UI)¶
Core operations (collect, join, group_by, typed filter, …) behave like the default pydantable DataFrame. Spark-named aliases:
| Method | Maps to |
|---|---|
withColumn(name, col) |
with_columns(**{name: col}) |
where(condition) / filter(condition) |
filter(condition) |
select(*cols) |
Core select |
groupBy(...) / group_by(...) |
Core group_by; returns PySparkGroupedDataFrame so .agg() stays Spark-flavored (1.9.0+). |
groupBy(...).pivot(pivot_col, values=[...]).agg(...) |
Group at (keys + pivot_col) then core pivot to wide columns (1.9.0+). |
groupBy(...).agg({\"v\": \"sum\", \"w\": \"max\"}) |
Dict-form agg (Spark-ish) auto-names outputs as v_sum, w_max; multi-op per column supported via lists (1.9.0+). |
groupBy(...).pivot(...).agg({\"v\": [\"sum\", \"max\"]}) |
Dict-form pivot agg auto-names per pivot value: e.g. x_v_sum, x_v_max (1.9.0+). |
groupBy(...).pivot(...).count() / .sum(...) / .avg(...) / .min(...) / .max(...) |
Convenience wrappers over grouped-pivot .agg(...) (Spark-shaped), including count() as rows per (keys, pivot_value) cell (1.9.0+). |
orderBy(*columns, ascending=...) / sort(...) |
Core sort / order_by (global sort only; not Spark sortWithinPartitions) |
crossJoin(other) |
join(other, how="cross") (1.9.0+) |
join(other, on=[...], how="left_semi"|"left_anti") |
Core join(how="semi"|"anti") with Spark-ish names; output is left-only columns (1.9.0+). |
join(other, on=[...], how="right_semi"|"right_anti") |
Spark-ish aliases implemented by swapping join sides; output is right-only columns (1.9.0+). |
join(..., on=[...]) key handling |
Spark-ish USING behavior: duplicate right join keys are dropped by default; set keepRightJoinKeys=True to opt out (1.9.0+). |
join(left_on=[...], right_on=[...]) |
Join on differently named keys (Spark-style). Keys accept str or ColumnRef and list/tuple forms (1.9.0+). |
join(validate="1:1"|"1:m"|"m:1"|"m:m") |
Join cardinality validation shorthands (forwarded to core join) (1.9.0+). |
count() |
Row count as int via global_row_count() in the plan (1.9.0+); distinct from grouped GroupedDataFrame.count(*cols) |
unionByName(other, allowMissingColumns=False) |
Reorder other by name, then vertical concat; optional null-padding for missing columns (1.9.0+) |
intersect / subtract / except |
Typed join + distinct / anti-join (1.9.0+). Note: except is exposed at runtime as df.except(...) but is implemented as except_ in Python/stubs since except is a keyword. |
intersectAll / exceptAll |
Multiset set ops (1.9.0+); exceptAll keeps duplicates per max(left-right,0) counts, intersectAll keeps min(left,right) counts. |
fillna / dropna / na.drop / na.fill |
fill_null / drop_nulls with Spark-shaped kwargs (1.9.0+) |
printSchema() |
Text tree from df.schema (1.9.0+) |
explain(...) |
Prints core logical plan string (1.9.0+) |
toPandas() |
to_dict() → pandas.DataFrame (eager; requires pandas) (1.9.0+) |
limit(num) |
limit(num) |
sample(withReplacement=False, fraction=..., seed=None) |
Core sample(fraction=..., seed=..., with_replacement=...) (eager; materializes via to_dict()) |
explode(column(s), outer=False) |
Core list reshape (Polars explode); outer=True maps to Spark-ish explode_outer null/empty handling (see parity doc). |
explode_outer(...) |
Same as explode(..., outer=True). |
explode_all() |
Explode every list-typed column (schema-driven). |
posexplode(column, pos='pos', value=None, outer=False) |
One list column → synchronized position (0-based) + element columns; value defaults to the list column name. |
posexplode_outer(...) |
posexplode(..., outer=True). |
unnest(column(s)) / unnest_all() |
Core struct flattening (Spark users often want this for nested structs, not explode). |
drop(*cols) |
drop(*cols) |
distinct() |
All-column distinct rows |
withColumnRenamed(existing, new) |
with_column_renamed |
dropDuplicates(subset=None) |
Core distinct(keep="first") (engine-dependent “first” unless ordered); subset= maps to distinct(subset=..., keep="first") |
union / unionAll |
Core vertical concat (same schema required) |
show(n=20, truncate=True, vertical=False) |
0.20.0+ — prints a bounded text table (head-like sample). |
summary() |
0.20.0+ — see summary vs Spark below. |
summary vs Apache Spark¶
Spark’s DataFrame.summary() returns another DataFrame of statistics (and accepts optional stat names). In pydantable, PySpark summary() is deliberately a thin alias of core describe(): it returns a single multi-line string, not a table-shaped frame.
What describe() / summary() include (after one to_dict() materialization): int and float (count, mean, std, min, max; skew/kurtosis/sem when numpy is available and count ≥ 4), bool (true/false/null counts), str (count, n_unique, min/max length), date / datetime (non-null count, min, max, nulls). Other column types are omitted from the report.
For distributional stats Spark’s summary column set, use Polars or pandas on a materialized export, or build explicit select aggregates.
Naming map (core ↔ pandas ↔ PySpark)¶
See PANDAS_UI Naming map for with_columns / assign / withColumn, filter, joins, and sorts—same Rust engine for all three import styles.
Schema and columns¶
columns— list of logical column names (same idea as Spark’sdf.columns).schema— a lightweightStructTypebuilt from pydantable annotations viaannotation_to_data_type(seepyspark/sql/types.py). These types are facade objects, not JVM SparkDataTypeinstances.
__getitem__¶
df["col"]→Exprfor that column.df[["a", "b"]]→select("a", "b").
DataFrameModel (PySpark UI)¶
Delegates Spark-like methods to the inner PySpark UI DataFrame and re-wraps as the same model class. schema and columns follow the inner frame. 1.9.0+ adds the same groupBy, sort, crossJoin, count(), unionByName, set-style helpers, fillna / dropna / .na, printSchema, explain, and toPandas surface as on DataFrame. explode, posexplode, unnest, and explode_all / unnest_all are documented on both DataFrame and DataFrameModel.
List explosion vs functions.explode¶
Apache Spark allows select(explode(col)) because explode is a generator expression. pydantable does not: there is no table-generating Expr for explosion in select. Importing pydantable.pyspark.sql.functions as F and calling F.explode(...) raises TypeError with directions to use DataFrame.explode(...) or DataFrame.posexplode(...) instead.
pydantable.pyspark.sql¶
Mirrors common import paths only—not binary or behavioral parity with Spark.
This block only checks that imports resolve; it has no printed output.
functions—lit, typedcol(..., dtype=...),isnull/isnotnull,coalesce,when/otherwise,cast,between,isin,concat,substring,length,explode(raisesTypeError; useDataFrame.explode/posexplode), date/time:year,month,day,dayofmonth,dayofweek,quarter,weekofyear,dayofyear,hour,minute,second,nanosecond,to_date,strptime,unix_timestamp,from_unixtime(see PySpark parity Phase B for semantics), globalsum/avg/mean/count/min/maxforselect, and window helpers.Column— type alias for pydantableExpr.types— simpleIntegerType,StringType,StructField,StructType, … for documentation andschemaviews.
Full coverage vs Spark is summarized in PySpark parity matrix.
What is intentionally out of scope¶
SparkSession,spark.sql("..."), streaming, catalogs.- SQL window frames (
rowsBetween/rangeBetween): partition + order viaWindow/WindowSpecare supported (seepydantable.pyspark.sql.window); frames execute on the Polars-backed core perINTERFACE_CONTRACT.md(includingrangeBetweenmulti-key rules inWINDOW_SQL_SEMANTICS.md). - Untyped
F.col("x")withoutdtype=(pydantable requires static types at build time). - Interop with a real
pyspark.sql.DataFrameunless a dedicated integration is added later.