[SYSTEMDS-3902] Accelerated data transfer Python <--> JVM by e-strauss · Pull Request #2296 · apache/systemds

e-strauss · 2025-07-22T13:34:07Z

Benchmark results

I did experiments on two different systems:

Mac w/ M2 and 16GB memory
Linux ThinkPad w/ Ryzen 5 and 64GB memory

Experiment 1: Python --> Java:

Experiment-Setup:

allocate np array in python (time excluded)
start systemds context w/ pipe setup (time included)
transfer the array to systemds 5x (time included)

Note:

for arrays, that are larger than ~152.588 MB, the py4j implementation had a runtime exception (allocation of an array with negative size)

Hardware: M2 w/ 16GB memory

Array Size	py4j runtime in s	unix pipe runtime in s
0.763 MB	0.186	0.103
3.815 MB	0.3071	0.119
7.629 MB	0.421	0.145
38.147 MB	1.357	0.332
76.294 MB	2.52	0.607
152.588 MB	7.096	1.073
381.470 MB	ERR	2.39
0.745 GB	ERR	4.555
1.49 GB	ERR	13.4067

Hardware: Ryzen 5 w/ 64GB memory

For the spark experiment, I used reference implementation with pyspark and arrow. I created a single column arrow table with the numeric data, created a spark df from the table and and triggered the transfer via a simple count action.

Runtime in s

#elements	Size	py4j	single unix pipe	multi unix pipes	mmap	spark
100000	781.25 KB	0.337	0.119	0.197	0.148	1.196
500000	3.81 MB	0.646	0.161	0.261	0.215	3.496
1000000	7.63 MB	1.657	0.194	0.324	0.277	6.349
5000000	38.15 MB	3.793	0.454	0.560	0.798	36.516
10000000	76.29 MB	7.487	0.767	0.915	1.434	4.426
20000000	152.59 MB	14.030	1.426	1.574	2.673	5.617
50000000	381.47 MB	ERR	3.397	3.579	6.275	12.467
100000000	762.94 MB	ERR	6.682	7.354	12.909	25.100
200000000	1.49 GB	ERR	11.801	12.470	27.549	47.890

phaniarnab · 2025-07-22T13:41:48Z

@e-strauss. The speedups look good. Can you also mention the size of the inputs beside #elements? Size is easier to comprehend?
Wondering why we are not using Arrow format for JVM <-> Python data transfer like Spark does [1]?

[1] https://docs.databricks.com/aws/en/pandas/pyspark-pandas-conversion

e-strauss · 2025-07-22T14:20:55Z

@phaniarnab afaik Apache Arrow gives us just an unified format, but not per-se transport way for the data transfer. In this case, we are transferring dense arrays, where we wouldn't gain anything from the Arrow format and have the overhead for the transformation from pandas to arrow and from arrow to our internal matrix format, I guess.

When it comes to transferring frames, we can use the arrow representation as format, and send it over our FIFO pipe

codecov · 2025-07-23T08:48:51Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.65%. Comparing base (480e9a0) to head (25f7bad).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2296      +/-   ##
============================================
+ Coverage     72.61%   72.65%   +0.03%     
- Complexity    46267    46320      +53     
============================================
  Files          1490     1491       +1     
  Lines        174390   174590     +200     
  Branches      34210    34236      +26     
============================================
+ Hits         126634   126845     +211     
+ Misses        38194    38173      -21     
- Partials       9562     9572      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

e-strauss · 2025-07-25T12:44:59Z

@phaniarnab I added a new experiment for spark with arrow tables as comparison. For the experiment, I created an arrow table in python and transferred it to spark by triggering the computation using count action. The experiment can be found here.

For larger data sizes, the runtime goes down significantly, since spark switches from a LocalRelation in createDataFrame to a RDD-based createDataFrame. Both with Arrow optimization.

Baunsgaard

LGTM

Introduced a new data transfer mechanism on Unix systems using FIFO (named) pipes as a faster alternative to py4j-based communication. - Supports multiple value types (uint8, int32, fp32, fp64) for dense matrix exchange. - Adds experimental support for partitioned matrix transfer from Python to Java via multiple concurrent pipes (disabled by default due to limited performance improvement). - Significantly reduces overhead compared to py4j for large matrix transfers in supported scenarios Closes apache#2296.

github-project-automation Bot added this to SystemDS PR Queue Jul 22, 2025

github-project-automation Bot moved this to In Progress in SystemDS PR Queue Jul 22, 2025

e-strauss force-pushed the data_transfer_PR branch 3 times, most recently from 0afe221 to 563b70b Compare July 30, 2025 12:37

[SYSTEMDS-3902] Accelerated data transfer Python <--> JVM

25f7bad

e-strauss force-pushed the data_transfer_PR branch from 563b70b to 25f7bad Compare July 30, 2025 21:39

e-strauss changed the title ~~[WIP] Data transfer Python <--> Java~~ [SYSTEMDS-3902] Accelerated data transfer Python <--> JVM Jul 30, 2025

e-strauss requested a review from Baunsgaard July 30, 2025 21:58

Baunsgaard reviewed Jul 31, 2025

View reviewed changes

Comment thread src/main/java/org/apache/sysds/runtime/util/UnixPipeUtils.java

e-strauss closed this in 264bdcd Jul 31, 2025

github-project-automation Bot moved this from In Progress to Done in SystemDS PR Queue Jul 31, 2025

e-strauss deleted the data_transfer_PR branch July 31, 2025 17:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYSTEMDS-3902] Accelerated data transfer Python <--> JVM#2296

[SYSTEMDS-3902] Accelerated data transfer Python <--> JVM#2296
e-strauss wants to merge 1 commit into
apache:mainfrom
e-strauss:data_transfer_PR

e-strauss commented Jul 22, 2025 •

edited

Loading

Uh oh!

phaniarnab commented Jul 22, 2025

Uh oh!

e-strauss commented Jul 22, 2025

Uh oh!

codecov Bot commented Jul 23, 2025 •

edited

Loading

Uh oh!

e-strauss commented Jul 25, 2025

Uh oh!

Baunsgaard left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

e-strauss commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results

Experiment 1: Python --> Java:

Uh oh!

phaniarnab commented Jul 22, 2025

Uh oh!

e-strauss commented Jul 22, 2025

Uh oh!

codecov Bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

e-strauss commented Jul 25, 2025

Uh oh!

Baunsgaard left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

e-strauss commented Jul 22, 2025 •

edited

Loading

codecov Bot commented Jul 23, 2025 •

edited

Loading