Engine Abstraction for PyIceberg #3219

nssalian · 2026-04-06T16:28:36Z

nssalian
Apr 6, 2026

I want to pull together some related threads and frame a bigger picture.

I've been working on the File Format API (#3100, PR #3119) to decouple format handling from the write path.
While tracing through both the read and write paths I noted that
PyArrow is hardwired as the only execution engine, and that bottleneck shows up in multiple places:

Memory pressure when using `ArrowScan.to_record_batches` with multi-threaded workers #3122 Memory pressure in ArrowScan.to_record_batches with multithreaded workers
File Format API for PyIceberg #3100 File Format API (pluggable read + write, in progress)
#20 ORC write support
Polars' scan_iceberg() already bypasses PyArrow for file reads

Context

Right now every output method goes through to_arrow():

def to_duckdb(self, ...):   con.register(table_name, self.to_arrow())
def to_ray(self, ...):      return ray.data.from_arrow(self.to_arrow())
def to_pandas(self, ...):   return self.to_arrow().to_pandas(**kwargs)
def to_polars(self, ...):   return pl.from_arrow(self.to_arrow())

DuckDB has its own highly optimized C++ Parquet reader.
DataFusion and Polars have Rust-native ones. None of them can use their own readers today.
They all get a pre-materialized pa.Table from PyArrow.

Currently, in the code

If you look at the read path, there's a clean split point:

Table.scan() - DataScan - plan_files() - [FileScanTask, ...]  -  ArrowScan - pa.Table
                          ^^^^^^^^^^^^                            ^^^^^^^^^
                          engine-agnostic                         engine-specific

plan_files() returns Iterable[FileScanTask] — file paths, delete files, partition info —
all engine-agnostic. The engine-specific part starts when ArrowScan takes those tasks and actually reads the Parquet files.

Polars already does this using scan_iceberg() that calls PyIceberg for the plan,
then reads the files with its own Rust engine. The idea is to formalize that handoff.

What this could look like

I propose something like an IcebergEngine ABC following the same pattern as Catalog (ABC + factory + registration):

class IcebergEngine(ABC):
    @abstractmethod
    def execute_scan(
        self,
        tasks: Iterable[FileScanTask],
        table_metadata: TableMetadata,
        projected_schema: Schema,
        bound_row_filter: BooleanExpression,
        case_sensitive: bool = True,
        limit: int | None = None,
    ) -> pa.Table: ...

    @abstractmethod
    def execute_write(
        self,
        df: Any,  # engine-native (pa.Table, DuckDB relation, Polars DF, etc.)
        table_metadata: TableMetadata,
        file_io: FileIO,
        properties: Properties = {},
    ) -> list[DataFile]: ...

PyArrowEngine would wrap the existing ArrowScan with zero behavior change. DataScan.to_arrow() would delegate to the engine.
Something like a table property pyiceberg.engine would select the engine (default "pyarrow").

The File Format APIs I'm building in #3100 (FileFormatWriter, FileFormatReadBuilder, FileFormatModel, FileFormatFactory) would stay internal
to the PyArrow engine. DuckDB/DataFusion/Polars use their own readers.

How this relates to #3100

#3100 separates Parquet/ORC/Avro handling inside the PyArrow engine as part of the file format layer. The engine abstraction proposed here would be a layer above:
which engine runs the scan. The idea is FileIO for storage, FileFormatModel for format, IcebergEngine for execution.

Rollout

I'd like to land #3100 first since it establishes the pattern. After that, the engine work could start with just the ABC + a PyArrowEngine
that wraps ArrowScan. Prototyping with the preferred next engine (DuckDB/DataFusion) would be next, then the rest.

Questions I'd like input on

Naming: IcebergEngine vs ExecutionBackend vs something else? Java uses module separation, so no precedent.
Granularity: Should read and write be separate interfaces? A DuckDB engine might be read-only initially.
Return type: pa.Table directly, or a lazy RecordBatchIterator with to_arrow()?
Credentials: Engines with their own I/O (DuckDB, DataFusion) need S3/GCS creds from FileIO.properties. What's the cleanest way to propagate them?
Scope: Data path only? Metadata (catalogs, snapshots) is already engine-agnostic, so I don't think it needs change.
First engine after PyArrow: DuckDB and DataFusion both have native Parquet readers and Arrow interop. Any preference on which to prototype first?

Happy to write up more detail on any of this or prototype a piece of it.
Mainly looking for feedback on whether this direction makes sense and what the priorities should be.

Fokko · 2026-04-16T18:32:46Z

Fokko
Apr 16, 2026
Collaborator

Thanks @nssalian for the write-up! Let me reply to some of the parts. From the start, it wasn't the goal to have an engine inside of PyIceberg, but this grew as an example and to bootstrap the project. The ideal situation would be that PyIceberg would be part of PyArrow and be hidden behind the Dataset API, rather than the Table API. With PyIceberg, I don't think we want to build another layer on top of an execution engine, but rather an easy way to go through Iceberg metadata.

DuckDB has its own highly optimized C++ Parquet reader.

I don't think this should be a goal for PyIceberg itself, since it would make more sense for DuckDB to rely on the cpp library. Similar to Datafusion which has dropped Iceberg-Java in favor of Iceberg-Rust.

Polars already does this using scan_iceberg() that calls PyIceberg for the plan, then reads the files with its own Rust engine.

I think another good example is Daft, which collects the tasks and then does a distributed read. They are Rust based as well, and also looking into Iceberg-Rust, but there were still some features missing IIRC.

Granularity

If we want to be engine agnostic, we don't want to bind to PyArrow, while that's the defacto standard for most Python based data processing frameworks.

Credentials

In Java this is all encapsulated in the FileIO, where the FileIO opens a stream to the object store. The FileIO encapsulates all the credentials that come from the catalog or config, and should also take care of refreshing credentials.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine Abstraction for PyIceberg #3219

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Engine Abstraction for PyIceberg #3219

Uh oh!

nssalian Apr 6, 2026

Context

Currently, in the code

What this could look like

How this relates to #3100

Rollout

Questions I'd like input on

Replies: 1 comment

Uh oh!

Fokko Apr 16, 2026 Collaborator

nssalian
Apr 6, 2026

Fokko
Apr 16, 2026
Collaborator