Benchmarks

JMH microbenchmarks comparing zstd-java against the common JVM zstd options, in both heap (byte[]) and zero-copy (MemorySegment) modes. Source lives in the benchmark/ module; see its README to reproduce.

Contestant	Binding	Modes
zstd-java (this project)	FFM (no JNI)	`byte[]` and zero-copy `MemorySegment`
zstd-jni (`com.github.luben`)	JNI	`byte[]` and zero-copy direct `ByteBuffer`
aircompressor (`io.airlift:aircompressor-v3`)	pure Java	caller-buffer `byte[]`

TL;DR

Best vs best (our MemorySegment vs zstd-jni's own zero-copy direct ByteBuffer, same zstd 1.5.7 both sides): allocation is a tie — both are ~0 B/op. The throughput edge is per-call overhead: clearest on small, call-overhead-dominated payloads (+10–23%) and converging to a tie when compute or bandwidth dominates (large decompress, 64 MiB). This is the honest FFM-vs-JNI shape — biggest where the payload is smallest. See Golden corpus: best vs best.
vs the allocating byte[] APIs: the MemorySegment path is allocation-free (flat ~0 B/op at any size) while byte[] / JNI-byte[] allocate ~the output size every call (67–79 MB/op at 64 MiB). Real, but it compares our zero-copy path against their heap API — not their zero-copy one. The allocation win is over the convenient API, not over JNI per se.

The honest headline: against zstd-jni's best path we match on allocation and lead modestly on call overhead; against the convenient byte[] APIs the zero-copy path additionally eliminates per-op heap allocation.

Environment

Apple M5, 32 GB. P-core L2 16 MiB (Apple Silicon: shared SLC, no classic L3). The 64 MiB payload is the cache-busting case.
JDK 25 (Azul). zstd-jni 1.5.7-11 (bundles zstd 1.5.7, matching our build), aircompressor-v3 3.6, JMH 1.37.
Level 3 (zstd default).
Golden corpus run: 3 forks × 3 warmup × 5 measurement, -prof gc, error bars are 99.9% CIs.
Synthetic tables below: deterministic ~3x-compressible text, and a quick, low-iteration run (1 fork, 2 warmup, 3 measurement) — directional, not publication-grade; the 64 MiB rows have wide intervals. Rerun with JMH defaults before quoting those.

Golden corpus: best vs best

The fairest comparison: our best zero-copy path against zstd-jni's best zero-copy path, both reusing a context and off-heap buffers, neither allocating per call — our MemorySegment (compressJavaSegment / decompressJavaSegment) vs zstd-jni's direct-ByteBuffer API (compressJniByteBuffer / decompressJniByteBuffer). Inputs are real fixtures from zstd's own golden corpus, not synthetic text, so the small/structured cases exercise per-call boundary overhead — exactly where FFM-vs-JNI differs. Both sides link the same zstd 1.5.7, so any gap is binding overhead, not codec version.

This run is publication-grade for the cut shown (3 forks × 3 warmup × 5 measurement, -prof gc), on the environment below.

Throughput (ops/ms, higher is better)

file (size)	JavaSegment	JniByteBuffer	edge
compress `http` (1.2 KiB)	353.6 ±3.0	322.1 ±22.9	+9.8%
compress `large-literal` (200 KiB)	46.1 ±1.4	42.2 ±0.3	+9.4%
decompress `http`	922.7 ±5.9	750.8 ±0.9	+22.9%
decompress `large-literal`	56.1 ±0.7	55.6 ±0.4	+0.9% (tie)

Allocation (`gc.alloc.rate.norm`, B/op)

	JavaSegment	JniByteBuffer
every case	~0.00	~0.00

Reading it: we lead ~+9–10% on compress and +23% on small decompress (the call-overhead-dominated cases), tie on large decompress (bandwidth-bound), and match exactly on allocation — both genuinely zero-copy. The earlier "allocation-free vs JNI" claim only held against JNI's byte[] API; against its zero-copy path the allocation advantage is gone, and the speed edge is the expected FFM call-overhead margin, largest at the smallest payloads.

Throughput (ops/ms, higher is better)

The tables below use the original synthetic payloads and compare against zstd-jni's allocating byte[] API (zstdJni), not its zero-copy path. They show the MemorySegment-vs-byte[] allocation story; for the fair zero-copy-vs-zero-copy comparison see Golden corpus: best vs best above.

Compress

size	zstdJavaSegment	zstdJavaBytes	zstdJni	aircompressor
1 KiB	287.7	278.2	238.3	148.3
64 KiB	14.34	13.60	13.46	8.00
1 MiB	0.927	0.889	0.906	0.562
64 MiB	0.014	0.014	0.013	0.007

Decompress

size	zstdJavaSegment	zstdJavaBytes	zstdJni	aircompressor
1 KiB	590.7	560.1	460.6	477.0
64 KiB	32.3	29.7	26.6	26.8
1 MiB	2.15	2.02	1.82	1.74
64 MiB	0.029	0.029	0.028	0.016

At 64 MiB the segment/byte[]/jni columns are statistically indistinguishable — all bound by memory bandwidth, not API overhead.

Allocation (`gc.alloc.rate.norm`, bytes/op, lower is better)

The decisive chart. Note the segment column is flat near zero while the others scale linearly with payload size.

Compress

size	zstdJavaSegment	zstdJavaBytes	zstdJni	aircompressor
1 KiB	0.07	408	1,464	35,464
64 KiB	6.8	11,694	77,473	724,034
1 MiB	44	182,069	1,234,501	1,449,282
64 MiB	1,296	11,621,915	78,992,794	2,971,072

Decompress

size	zstdJavaSegment	zstdJavaBytes	zstdJni	aircompressor
1 KiB	0.03	1,136	1,072	48
64 KiB	13	65,651	65,585	49
1 MiB	16	1,048,785	1,048,634	58
64 MiB	1,969	67,111,319	67,109,554	4,799

Notes

Segment is allocation-free both directions — the caller owns the off-heap source and destination, so nothing touches the heap.
byte[]/JNI allocate ~the output size every call (their APIs return a fresh array). At 64 MiB that is 11-67 MB/op.
JNI compress allocates ~79 MB/op at 64 MiB (~7x the others): its compress(byte[], level) allocates a worst-case compressBound buffer and then a trimmed copy — double buffering.
aircompressor decompress is also ~0 B/op here, because the benchmark reuses a caller-supplied destination array. A caller-buffer API kills decode allocation regardless of binding; the segment path additionally avoids the off-heap↔heap copy. aircompressor compress still allocates internal hash tables per call.

Where the work goes (async-profiler, 64 MiB)

Profiling the cache-busting 64 MiB case with async-profiler corroborates the allocation counters from two angles — allocation flamegraph (what hits the heap) and itimer/CPU flamegraph (what burns cycles).

Allocation flamegraph — dominant heap-allocation site per benchmark:

benchmark	dominant alloc site
Compress `zstdJavaSegment`	none — no heap allocation sampled
Compress `zstdJavaBytes`	`byte[]` in `Zstd.copyOut` (the returned frame array)
Compress `zstdJni`	`byte[]` inside `luben…Zstd.compress`
Compress `aircompressor`	`byte[]` + `BlockCompressionState.<init>` (internal tables)
Decompress `zstdJavaSegment`	none — no heap allocation sampled
Decompress `zstdJavaBytes` / `zstdJni`	`byte[]` (the returned output array)

CPU (itimer) flamegraph — the heap paths additionally pay a memcpy bounce and GC work that the segment path does not:

zstdJavaBytes: MemorySegment.copy → ScopedMemoryAccess.copyMemory → Unsafe.copyMemory (heap↔native in/out), plus G1ParCopyClosure::do_oop_work (GC triggered by the output allocation).
zstdJni: byte_disjoint_arraycopy + Arrays.copyOfRange (the JNI copy and the trim copy).
zstdJavaSegment: neither — no copy frames, no GC frames. Only codec work (ZSTD_*, FSE_buildCTable_wksp, encodeSequences, FSE_readNCount).

So the segment API does strictly less work: no per-op heap allocation (hence no GC) and no memcpy bounce. At 64 MiB this overhead is a small fraction of total codec time — which is why throughput ties — but it is pure waste the zero-copy path removes entirely, and it dominates under sustained, allocation-sensitive load.

Reproduce

./mvnw -q -pl benchmark -am package -DskipTests

# golden corpus, best vs best (our MemorySegment vs zstd-jni direct ByteBuffer)
java --enable-native-access=ALL-UNNAMED -jar benchmark/target/benchmarks.jar \
  "GoldenCorpusBenchmark.*(Segment|JniByteBuffer)" \
  -p file=http,large-literal-and-match-lengths -f 3 -wi 3 -i 5 -prof gc

# synthetic throughput + allocation, all sizes
java -jar benchmark/target/benchmarks.jar CompressBenchmark DecompressBenchmark -prof gc

# single size
java -jar benchmark/target/benchmarks.jar -prof gc -p size=67108864

# async-profiler flamegraphs for the 64 MiB case (macOS: itimer; Linux: cpu)
LIB=/opt/homebrew/lib/libasyncProfiler.dylib
java -jar benchmark/target/benchmarks.jar -p size=67108864 \
  -prof "async:libPath=$LIB;output=flamegraph;event=alloc;dir=benchmark/target/async-alloc"
java -jar benchmark/target/benchmarks.jar -p size=67108864 \
  -prof "async:libPath=$LIB;output=flamegraph;event=itimer;dir=benchmark/target/async-cpu"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks

TL;DR

Environment

Golden corpus: best vs best

Throughput (ops/ms, higher is better)

Allocation (`gc.alloc.rate.norm`, B/op)

Throughput (ops/ms, higher is better)

Compress

Decompress

Allocation (`gc.alloc.rate.norm`, bytes/op, lower is better)

Compress

Decompress

Notes

Where the work goes (async-profiler, 64 MiB)

Reproduce

FilesExpand file tree

benchmarks.md

Latest commit

History

benchmarks.md

File metadata and controls

Benchmarks

TL;DR

Environment

Golden corpus: best vs best

Throughput (ops/ms, higher is better)

Allocation (gc.alloc.rate.norm, B/op)

Throughput (ops/ms, higher is better)

Compress

Decompress

Allocation (gc.alloc.rate.norm, bytes/op, lower is better)

Compress

Decompress

Notes

Where the work goes (async-profiler, 64 MiB)

Reproduce

Allocation (`gc.alloc.rate.norm`, B/op)

Allocation (`gc.alloc.rate.norm`, bytes/op, lower is better)