Replies: 1 comment
-
|
Thanks @nssalian for the write-up! Let me reply to some of the parts. From the start, it wasn't the goal to have an engine inside of PyIceberg, but this grew as an example and to bootstrap the project. The ideal situation would be that PyIceberg would be part of PyArrow and be hidden behind the Dataset API, rather than the Table API. With PyIceberg, I don't think we want to build another layer on top of an execution engine, but rather an easy way to go through Iceberg metadata.
I don't think this should be a goal for PyIceberg itself, since it would make more sense for DuckDB to rely on the cpp library. Similar to Datafusion which has dropped Iceberg-Java in favor of Iceberg-Rust.
I think another good example is Daft, which collects the tasks and then does a distributed read. They are Rust based as well, and also looking into Iceberg-Rust, but there were still some features missing IIRC.
If we want to be engine agnostic, we don't want to bind to PyArrow, while that's the defacto standard for most Python based data processing frameworks.
In Java this is all encapsulated in the FileIO, where the FileIO opens a stream to the object store. The FileIO encapsulates all the credentials that come from the catalog or config, and should also take care of refreshing credentials. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I want to pull together some related threads and frame a bigger picture.
I've been working on the File Format API (#3100, PR #3119) to decouple format handling from the write path.
While tracing through both the read and write paths I noted that
PyArrow is hardwired as the only execution engine, and that bottleneck shows up in multiple places:
ArrowScan.to_record_batcheswith multithreaded workersscan_iceberg()already bypasses PyArrow for file readsContext
Right now every output method goes through
to_arrow():DuckDB has its own highly optimized C++ Parquet reader.
DataFusion and Polars have Rust-native ones. None of them can use their own readers today.
They all get a pre-materialized
pa.Tablefrom PyArrow.Currently, in the code
If you look at the read path, there's a clean split point:
plan_files()returnsIterable[FileScanTask]— file paths, delete files, partition info —all engine-agnostic. The engine-specific part starts when
ArrowScantakes those tasks and actually reads the Parquet files.Polars already does this using
scan_iceberg()that calls PyIceberg for the plan,then reads the files with its own Rust engine. The idea is to formalize that handoff.
What this could look like
I propose something like an
IcebergEngineABC following the same pattern asCatalog(ABC + factory + registration):PyArrowEnginewould wrap the existingArrowScanwith zero behavior change.DataScan.to_arrow()would delegate to the engine.Something like a table property
pyiceberg.enginewould select the engine (default"pyarrow").The File Format APIs I'm building in #3100 (
FileFormatWriter,FileFormatReadBuilder,FileFormatModel,FileFormatFactory) would stay internalto the PyArrow engine. DuckDB/DataFusion/Polars use their own readers.
How this relates to #3100
#3100 separates Parquet/ORC/Avro handling inside the PyArrow engine as part of the file format layer. The engine abstraction proposed here would be a layer above:
which engine runs the scan. The idea is
FileIOfor storage,FileFormatModelfor format,IcebergEnginefor execution.Rollout
I'd like to land #3100 first since it establishes the pattern. After that, the engine work could start with just the ABC + a
PyArrowEnginethat wraps
ArrowScan. Prototyping with the preferred next engine (DuckDB/DataFusion) would be next, then the rest.Questions I'd like input on
IcebergEnginevsExecutionBackendvs something else? Java uses module separation, so no precedent.pa.Tabledirectly, or a lazyRecordBatchIteratorwithto_arrow()?FileIO.properties. What's the cleanest way to propagate them?Happy to write up more detail on any of this or prototype a piece of it.
Mainly looking for feedback on whether this direction makes sense and what the priorities should be.
Beta Was this translation helpful? Give feedback.
All reactions