Exploring Enhanced Compaction Support in Rust

This discussion is related to issues #624 and #607. I have been investigating the compaction process in the Rust library, specifically comparing it to the Java implementation using Spark. During this investigation, I noticed a difference in how the `FileScanTask` class is handled between the two implementations.

In the Java version, the `FileScanTask` includes :
1. `DataFile` object , which provides crucial information about partitions and `specId`, `content`. This information is necessary for the rewrite process in compaction. However, I am aware that @sdd previously raised a valid concern regarding the inclusion of this data in the `FileScanTask`(in this issue https://github.com/apache/iceberg-rust/pull/607#issuecomment-2334603319)
2.   List<DeleteFile>  - which is used to remove the necessary rows from existing files. 

I would like to explore the preferred approach for adding the necessary data to facilitate the implementation of compaction in the Rust library. Here are a few potential options I am considering:

1. Add the fields DataFile &  List<DeleteFile>  to `FileScanTask`.
2. Propose a new API - that returns a more informative version (perhaps `FileScanPlan`?) of `FileScanTask`, which includes the required data but is not serializable.
3. Other possible solutions? -  I am open to suggestions on alternative approaches.

I Also tried to map the logic that is going on in the java + spark implementation to help us understand the flow in the hopes that we can do the same with rust and datafusion and maybe comet 

Would love to get your input  @sdd @Xuanwo & @ZENOTME  


![compaction-RewriteDataFilesSparkAction Diagram drawio (1)](https://github.com/user-attachments/assets/ca2b9c15-9063-4d19-9acd-758eb45f543f)





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exploring Enhanced Compaction Support in Rust #657

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Exploring Enhanced Compaction Support in Rust #657

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions