[SYSTEMDS-3902] Accelerated data transfer Python <--> JVM#2296
[SYSTEMDS-3902] Accelerated data transfer Python <--> JVM#2296e-strauss wants to merge 1 commit into
Conversation
|
@e-strauss. The speedups look good. Can you also mention the size of the inputs beside #elements? Size is easier to comprehend? [1] https://docs.databricks.com/aws/en/pandas/pyspark-pandas-conversion |
|
@phaniarnab afaik Apache Arrow gives us just an unified format, but not per-se transport way for the data transfer. In this case, we are transferring dense arrays, where we wouldn't gain anything from the Arrow format and have the overhead for the transformation from pandas to arrow and from arrow to our internal matrix format, I guess. When it comes to transferring frames, we can use the arrow representation as format, and send it over our FIFO pipe |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2296 +/- ##
============================================
+ Coverage 72.61% 72.65% +0.03%
- Complexity 46267 46320 +53
============================================
Files 1490 1491 +1
Lines 174390 174590 +200
Branches 34210 34236 +26
============================================
+ Hits 126634 126845 +211
+ Misses 38194 38173 -21
- Partials 9562 9572 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@phaniarnab I added a new experiment for spark with arrow tables as comparison. For the experiment, I created an arrow table in python and transferred it to spark by triggering the computation using count action. The experiment can be found here. For larger data sizes, the runtime goes down significantly, since spark switches from a LocalRelation in createDataFrame to a RDD-based createDataFrame. Both with Arrow optimization. |
0afe221 to
563b70b
Compare
563b70b to
25f7bad
Compare
Introduced a new data transfer mechanism on Unix systems using FIFO (named) pipes as a faster alternative to py4j-based communication. - Supports multiple value types (uint8, int32, fp32, fp64) for dense matrix exchange. - Adds experimental support for partitioned matrix transfer from Python to Java via multiple concurrent pipes (disabled by default due to limited performance improvement). - Significantly reduces overhead compared to py4j for large matrix transfers in supported scenarios Closes apache#2296.
Benchmark results
I did experiments on two different systems:
Experiment 1: Python --> Java:
Experiment-Setup:
Note:
Hardware: M2 w/ 16GB memory
Hardware: Ryzen 5 w/ 64GB memory
For the spark experiment, I used reference implementation with pyspark and arrow. I created a single column arrow table with the numeric data, created a spark df from the table and and triggered the transfer via a simple count action.
Runtime in s