Skip to content

Core: Use zero-copy wrapper for equalityFieldIds#13668

Merged
amogh-jahagirdar merged 1 commit into
apache:mainfrom
bvolpato:equalityfieldids-zerocopy
Jul 25, 2025
Merged

Core: Use zero-copy wrapper for equalityFieldIds#13668
amogh-jahagirdar merged 1 commit into
apache:mainfrom
bvolpato:equalityfieldids-zerocopy

Conversation

@bvolpato
Copy link
Copy Markdown
Contributor

In one of our Trino -> Iceberg use case, which relies on equality deletes, we observed that a lot of time and allocations are being spent in BaseFile.equalityFieldIds() (~34% of our overall allocs).

Checking the implementation, it seems that the current implementation based on streams has to copy and box the data, which is very inefficient.

image

There is already prior art (#8336) to use Guava 0-copy wrappers for longs / split offsets in the same class, I'm doing that same change.

Drafted a quick JMH to show the difference, and it clearly makes a huge difference:

Benchmark                              (arraySize)  Mode  Cnt        Score         Error  Units
IntListBenchmark.guavaImplementation            10    ss   10     1266.800 ±     199.548  ns/op
IntListBenchmark.guavaImplementation          1000    ss   10     1387.200 ±     531.909  ns/op
IntListBenchmark.guavaImplementation        100000    ss   10     1262.500 ±     188.043  ns/op
IntListBenchmark.guavaImplementation       1000000    ss   10     1600.000 ±    1509.810  ns/op
IntListBenchmark.streamImplementation           10    ss   10     8883.500 ±    1858.339  ns/op
IntListBenchmark.streamImplementation         1000    ss   10    46229.000 ±   25265.952  ns/op
IntListBenchmark.streamImplementation       100000    ss   10   736904.100 ±  299912.894  ns/op
IntListBenchmark.streamImplementation      1000000    ss   10  7321966.800 ± 7146920.166  ns/op

@github-actions github-actions Bot added the core label Jul 25, 2025
@bvolpato bvolpato force-pushed the equalityfieldids-zerocopy branch from 171652f to dafcb3d Compare July 25, 2025 01:09
@bvolpato bvolpato force-pushed the equalityfieldids-zerocopy branch from dafcb3d to a70413b Compare July 25, 2025 01:10
Copy link
Copy Markdown
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great find @bvolpato, the improvement makes a lot of sense to me!

@amogh-jahagirdar
Copy link
Copy Markdown
Contributor

Thanks for the PR @bvolpato and thank you @mrcnc @pvary for reviewing. I'll go ahead and merge

@amogh-jahagirdar amogh-jahagirdar merged commit fb81fcf into apache:main Jul 25, 2025
42 checks passed
@findinpath
Copy link
Copy Markdown
Contributor

Could you pls share the code of IntListBenchmark ?


@Override
public List<Integer> equalityFieldIds() {
return ArrayUtil.toIntList(equalityIds);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth considering rewriting org.apache.iceberg.util.ArrayUtil#toIntList as well even though it is used at the moment only in tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just found this and I agree.

@wendigo
Copy link
Copy Markdown
Contributor

wendigo commented Jul 30, 2025

@bvolpato off topic: what’s the app from the screenshot?

@bvolpato
Copy link
Copy Markdown
Contributor Author

Could you pls share the code of IntListBenchmark ?

@findinpath Sorry, it was a bit of throwaway code, but I still had it saved here, pushed to https://github.com/bvolpato/guava-temp-benchmark/blob/main/src/jmh/java/com/benchmark/IntListBenchmark.java

@bvolpato off topic: what’s the app from the screenshot?

@wendigo (Disclaimer: I work at Datadog) It's from Datadog continuous profiling https://docs.datadoghq.com/profiler/, it has awesome instrumentation/tooling to proactively sample applications with very little overhead, which allows us to just go back in time for a particular time that had CPU/memory pressure and look at the flamegraphs - which was the case here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants