You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This discussion is related to issues #624 and #607. I have been investigating the compaction process in the Rust library, specifically comparing it to the Java implementation using Spark. During this investigation, I noticed a difference in how the FileScanTask class is handled between the two implementations.
In the Java version, the FileScanTask includes :
DataFile object , which provides crucial information about partitions and specId, content. This information is necessary for the rewrite process in compaction. However, I am aware that @sdd previously raised a valid concern regarding the inclusion of this data in the FileScanTask(in this issue refactor: Store DataFile in FileScanTask instead #607 (comment))
List - which is used to remove the necessary rows from existing files.
I would like to explore the preferred approach for adding the necessary data to facilitate the implementation of compaction in the Rust library. Here are a few potential options I am considering:
Add the fields DataFile & List to FileScanTask.
Propose a new API - that returns a more informative version (perhaps FileScanPlan?) of FileScanTask, which includes the required data but is not serializable.
Other possible solutions? - I am open to suggestions on alternative approaches.
I Also tried to map the logic that is going on in the java + spark implementation to help us understand the flow in the hopes that we can do the same with rust and datafusion and maybe comet
This discussion is related to issues #624 and #607. I have been investigating the compaction process in the Rust library, specifically comparing it to the Java implementation using Spark. During this investigation, I noticed a difference in how the
FileScanTaskclass is handled between the two implementations.In the Java version, the
FileScanTaskincludes :DataFileobject , which provides crucial information about partitions andspecId,content. This information is necessary for the rewrite process in compaction. However, I am aware that @sdd previously raised a valid concern regarding the inclusion of this data in theFileScanTask(in this issue refactor: Store DataFile in FileScanTask instead #607 (comment))I would like to explore the preferred approach for adding the necessary data to facilitate the implementation of compaction in the Rust library. Here are a few potential options I am considering:
FileScanTask.FileScanPlan?) ofFileScanTask, which includes the required data but is not serializable.I Also tried to map the logic that is going on in the java + spark implementation to help us understand the flow in the hopes that we can do the same with rust and datafusion and maybe comet
Would love to get your input @sdd @Xuanwo & @ZENOTME