Skip to content

feat(server): aligned buffer memory pool#2921

Merged
hubcio merged 27 commits intoapache:masterfrom
tungtose:aligned-buffer-memory-pool
Mar 24, 2026
Merged

feat(server): aligned buffer memory pool#2921
hubcio merged 27 commits intoapache:masterfrom
tungtose:aligned-buffer-memory-pool

Conversation

@tungtose
Copy link
Copy Markdown
Contributor

@tungtose tungtose commented Mar 12, 2026

pooled_buffer

Prepare the memory pool and buffer infrastructure for O_DIRECT I/O. Direct I/O requires buffers to be aligned to the underlying block size (commonly 4096 bytes). This allows the kernel to bypass the page cache, reducing double buffering and giving more predictable I/O latency.

Known Trade-offs:

  • Minimum allocation size is now 4096 bytes, meaning small utility buffers (e.g. put_u32_le, put_u64_le) now consume more memory than before
  • make_mutable in the HTTP path now copies buffers due to alignment incompatibility with Bytes

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 12, 2026

Codecov Report

❌ Patch coverage is 44.33962% with 59 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.04%. Comparing base (7ed29dc) to head (b606e29).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
core/common/src/alloc/buffer.rs 26.66% 33 Missing ⚠️
core/common/src/alloc/memory_pool.rs 52.38% 10 Missing ⚠️
...ore/common/src/types/message/messages_batch_mut.rs 0.00% 7 Missing ⚠️
core/partitions/src/lib.rs 0.00% 6 Missing ⚠️
...e/common/src/types/segment_storage/index_reader.rs 90.00% 0 Missing and 1 partial ⚠️
...ommon/src/types/segment_storage/messages_reader.rs 90.00% 0 Missing and 1 partial ⚠️
core/partitions/src/iggy_partition.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2921      +/-   ##
============================================
- Coverage     72.09%   72.04%   -0.06%     
  Complexity      930      930              
============================================
  Files          1124     1124              
  Lines         93832    93856      +24     
  Branches      71181    71213      +32     
============================================
- Hits          67649    67616      -33     
- Misses        23612    23646      +34     
- Partials       2571     2594      +23     
Flag Coverage Δ
csharp 67.43% <ø> (-0.19%) ⬇️
go 38.68% <ø> (ø)
java 62.08% <ø> (ø)
node 91.28% <ø> (-0.26%) ⬇️
python 81.43% <ø> (ø)
rust 72.75% <44.33%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../binary/handlers/messages/send_messages_handler.rs 100.00% <ø> (ø)
core/server/src/http/messages.rs 85.00% <100.00%> (+0.38%) ⬆️
...e/common/src/types/segment_storage/index_reader.rs 72.81% <90.00%> (-0.43%) ⬇️
...ommon/src/types/segment_storage/messages_reader.rs 81.94% <90.00%> (+4.72%) ⬆️
core/partitions/src/iggy_partition.rs 0.00% <0.00%> (ø)
core/partitions/src/lib.rs 0.00% <0.00%> (ø)
...ore/common/src/types/message/messages_batch_mut.rs 49.89% <0.00%> (-0.31%) ⬇️
core/common/src/alloc/memory_pool.rs 70.00% <52.38%> (-0.87%) ⬇️
core/common/src/alloc/buffer.rs 67.40% <26.66%> (-12.00%) ⬇️

... and 31 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@tungtose tungtose force-pushed the aligned-buffer-memory-pool branch from d83f2f6 to f55e808 Compare March 13, 2026 14:50
@tungtose tungtose force-pushed the aligned-buffer-memory-pool branch from 260a3c2 to cd3560e Compare March 13, 2026 15:23
@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented Mar 15, 2026

make_mutable in the HTTP path now copies buffers due to alignment incompatibility with Bytes

can you elaborate?

@tungtose
Copy link
Copy Markdown
Contributor Author

make_mutable in the HTTP path now copies buffers due to alignment incompatibility with Bytes

can you elaborate?

Correct me if I'm wrong, to make the make_mutable function in the HTTP path zero copy, we need to change IggyMessagesBatch to store PooledBuffer instead of Bytes. However, IggyMessagesBatch is a core type with dependencies throughout the codebase, so this could be refactored in a separate PR if needed

Other than that, make_mutable is not in the hot path. If user is using the HTTP API for high-throughput message sending, they already have bigger problems than this one copy

@tungtose tungtose force-pushed the aligned-buffer-memory-pool branch from 2d54146 to 56fc8ec Compare March 16, 2026 17:59
@tungtose
Copy link
Copy Markdown
Contributor Author

Since this PR deeply affect to the core, could you also please take a look @numinnex @spetz

@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented Mar 18, 2026

@tungtose did you check the performance? is there any difference?

@tungtose
Copy link
Copy Markdown
Contributor Author

@tungtose did you check the performance? is there any difference?

@hubcio here is the bench result running run-benches.sh:

on main:

2026-03-18T13:40:40.123993Z INFO bench_report::prints: Producers Results: Total throughput: 3854.07 MB/s, 3854076 messages/s, average throughput per Producer: 481.76 MB/s, p50 latency: 1.30 ms, p90 latency: 2.75 ms, p95 latency: 7.14 ms, p99 latency: 18.61 ms, p999 latency: 44.92 ms, p9999 latency: 91.50 ms, average latency: 2.12 ms, median latency: 1.30 ms, min: 0.51 ms, max: 209.82 ms, std dev: 2.34 ms, total time: 2.50 s

Running iggy-bench pinned-consumer tcp...
Poll results:
2026-03-18T13:40:44.577717Z INFO bench_report::prints: Consumers Results: Total throughput: 2748.06 MB/s, 2748056 messages/s, average throughput per Consumer: 343.51 MB/s, p50 latency: 2.73 ms, p90 latency: 4.15 ms, p95 latency: 4.62 ms, p99 latency: 5.60 ms, p999 latency: 6.58 ms, p9999 latency: 7.35 ms, average latency: 2.84 ms, median latency: 2.73 ms, min: 1.22 ms, max: 8.01 ms, std dev: 0.44 ms, total time: 3.17 s

on this branch:

Send results:
2026-03-18T11:49:46.897632Z INFO bench_report::prints: Producers Results: Total throughput: 4115.45 MB/s, 4115452 messages/s, average throughput per Producer: 514.43 MB/s, p50 latency: 1.18 ms, p90 latency: 2.44 ms, p95 latency: 3.30 ms, p99 latency: 18.39 ms, p999 latency: 61.13 ms, p9999 latency: 115.81 ms, average latency: 1.94 ms, median latency: 1.18 ms, min: 0.38 ms, max: 213.28 ms, std dev: 2.38 ms, total time: 2.22 s

Running iggy-bench pinned-consumer tcp...
Poll results:
2026-03-18T11:49:51.299828Z INFO bench_report::prints: Consumers Results: Total throughput: 2920.55 MB/s, 2920547 messages/s, average throughput per Consumer: 365.07 MB/s, p50 latency: 2.54 ms, p90 latency: 4.05 ms, p95 latency: 4.44 ms, p99 latency: 5.72 ms, p999 latency: 9.07 ms, p9999 latency: 11.36 ms, average latency: 2.68 ms, median latency: 2.54 ms, min: 0.73 ms, max: 19.75 ms, std dev: 0.66 ms, total time: 3.10 s

@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented Mar 19, 2026

could you please run benchmarks with rate limit, 4 producers / consumers, total rate limit equal to 500MB? this way we'll see the p50. run it for like 30s or so

@tungtose
Copy link
Copy Markdown
Contributor Author

could you please run benchmarks with rate limit, 4 producers / consumers, total rate limit equal to 500MB? this way we'll see the p50. run it for like 30s or so

@hubcio here is the bench result with this bench cmd: target/release/iggy-bench --rate-limit 500MB --warmup-time 3s --total-data 10GB pinned-producer-and-consumer --producers 4 --consumers 4 tcp

on this branch:

\x1b[34mBenchmark: Pinned Producer And Consumer, 4 producers, 4 consumers, 4 streams, 1 topic per stream, 1 partitions per topic, 20000000 messages, 1000 messages per batch, 20000 message batches, 1000 bytes per message, 20GB of data processed
\x1b[0m
2026-03-20T10:05:19.783194Z INFO bench_report::prints: \x1b[32mProducers Results: Total throughput: 249.96 MB/s, 249960 messages/s, average throughput per Producer: 62.49 MB/s, p50 latency: 1.14 ms, p90 latency: 1.99 ms, p95 latency: 2.30 ms, p99 latency: 2.97 ms, p999 latency: 6.18 ms, p9999 latency: 13.36 ms, average latency: 1.30 ms, median latency: 1.14 ms, min: 0.51 ms, max: 13.47 ms, std dev: 0.26 ms, total time: 39.98 s\x1b[0m
2026-03-20T10:05:19.783203Z INFO bench_report::prints: \x1b[32mConsumers Results: Total throughput: 251.13 MB/s, 251131 messages/s, average throughput per Consumer: 62.78 MB/s, p50 latency: 1.79 ms, p90 latency: 17.87 ms, p95 latency: 76.84 ms, p99 latency: 147.42 ms, p999 latency: 168.97 ms, p9999 latency: 171.45 ms, average latency: 10.04 ms, median latency: 1.79 ms, min: 0.83 ms, max: 286.22 ms, std dev: 22.96 ms, total time: 39.92 s\x1b[0m
2026-03-20T10:05:19.783213Z INFO bench_report::prints: \x1b[31mAggregate Results: Total throughput: 501.09 MB/s, 501091 messages/s, average throughput per Actor: 62.64 MB/s, p50 latency: 1.46 ms, p90 latency: 9.93 ms, p95 latency: 39.57 ms, p99 latency: 75.19 ms, p999 latency: 87.57 ms, p9999 latency: 92.40 ms, average latency: 5.67 ms, median latency: 1.46 ms, min: 0.51 ms, max: 286.22 ms, std dev: 14.20 ms, total time: 39.98 s\x1b[0m

on master branch:

\x1b[34mBenchmark: Pinned Producer And Consumer, 4 producers, 4 consumers, 4 streams, 1 topic per stream, 1 partitions per topic, 20000000 messages, 1000 messages per batch, 20000 message batches, 1000 bytes per message, 20GB of data processed \x1b[0m 2026-03-20T09:50:53.541140Z INFO bench_report::prints: \x1b[32mProducers Results: Total throughput: 249.96 MB/s, 249962 messages/s, average throughput per Producer: 62.49 MB/s, p50 latency: 0.91 ms, p90 latency: 1.53 ms, p95 latency: 1.85 ms, p99 latency: 2.89 ms, p999 latency: 11.57 ms, p9999 latency: 12.51 ms, average latency: 1.06 ms, median latency: 0.91 ms, min: 0.42 ms, max: 13.61 ms, std dev: 0.25 ms, total time: 39.98 s\x1b[0m 2026-03-20T09:50:53.541151Z INFO bench_report::prints: \x1b[32mConsumers Results: Total throughput: 252.47 MB/s, 252466 messages/s, average throughput per Consumer: 63.12 MB/s, p50 latency: 1.31 ms, p90 latency: 152.23 ms, p95 latency: 247.42 ms, p99 latency: 340.48 ms, p999 latency: 362.19 ms, p9999 latency: 364.22 ms, average latency: 34.25 ms, median latency: 1.31 ms, min: 0.70 ms, max: 526.35 ms, std dev: 64.63 ms, total time: 39.82 s\x1b[0m 2026-03-20T09:50:53.541162Z INFO bench_report::prints: \x1b[31mAggregate Results: Total throughput: 502.43 MB/s, 502428 messages/s, average throughput per Actor: 62.80 MB/s, p50 latency: 1.11 ms, p90 latency: 76.88 ms, p95 latency: 124.63 ms, p99 latency: 171.68 ms, p999 latency: 186.88 ms, p9999 latency: 188.36 ms, average latency: 17.66 ms, median latency: 1.11 ms, min: 0.42 ms, max: 526.35 ms, std dev: 38.38 ms, total time: 39.98 s\x1b[0m

@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented Mar 20, 2026

so the results for producers:

Metric Master PR Delta
p50 0.91 ms 1.14 ms +25%
p90 1.53 ms 1.99 ms +30%
p99 2.89 ms 2.97 ms +3%

consumers:

Metric Master PR Delta
p50 1.31 ms 1.79 ms +37%
p90 152.23 ms 17.87 ms -88%
p99 340.48 ms 147.42 ms -57%
max 526.35 ms 286.22 ms -46%
std dev 64.63 ms 22.96 ms -64%

p50 regressed ~25-37%. any clue why? but tail latencies (p90+) improved 57-88%, especially on consumer side.

@tungtose tungtose force-pushed the aligned-buffer-memory-pool branch from ece90bf to 1a19a7d Compare March 20, 2026 13:23
@tungtose
Copy link
Copy Markdown
Contributor Author

tungtose commented Mar 20, 2026

@hubcio Here is a benchmark update using the command line below. I believe the slow performance come from the freeze() function (converting from AVec back to Byte), the current implementation is temporary. It will be improved in an upcoming PR that integrates DirectIOFile and a proper implementation of the freeze() function
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
target/release/iggy-bench --rate-limit 500MB --warmup-time 3s --total-data 10GB pinned-producer-and-consumer --producers 4 --consumers 4 tcp

master:

bench_report::prints: \x1b[32mProducers Results: Total throughput: 249.96 MB/s, 249964 messages/s, average throughput per Producer: 62.49 MB/s, p50 latency: 1.02 ms, p90 latency: 1.81 ms, p95 latency: 2.02 ms, p99 latency: 2.68 ms, p999 latency: 8.90 ms, p9999 latency: 11.91 ms, average latency: 1.17 ms, median latency: 1.02 ms, min: 0.46 ms, max: 13.07 ms, std dev: 0.13 ms, total time: 39.98 s\x1b[0m 2026-03-20T12:14:11.068760Z INFO bench_report::prints: \x1b[32mConsumers Results: Total throughput: 249.97 MB/s, 249971 messages/s, average throughput per Consumer: 62.49 MB/s, p50 latency: 1.47 ms, p90 latency: 2.31 ms, p95 latency: 2.58 ms, p99 latency: 3.34 ms, p999 latency: 10.22 ms, p9999 latency: 13.16 ms, average latency: 1.62 ms, median latency: 1.47 ms, min: 0.67 ms, max: 14.89 ms, std dev: 0.43 ms, total time: 40.08 s\x1b[0m

PR:

x1b[32mProducers Results: Total throughput: 249.97 MB/s, 249968 messages/s, average throughput per Producer: 62.49 MB/s, p50 latency: 0.90 ms, p90 latency: 1.39 ms, p95 latency: 1.74 ms, p99 latency: 2.39 ms, p999 latency: 9.49 ms, p9999 latency: 16.87 ms, average latency: 1.02 ms, median latency: 0.90 ms, min: 0.43 ms, max: 11.01 ms, std dev: 0.13 ms, total time: 39.98 s\x1b[0m 2026-03-20T13:15:34.355418Z INFO bench_report::prints: \x1b[32mConsumers Results: Total throughput: 249.92 MB/s, 249916 messages/s, average throughput per Consumer: 62.48 MB/s, p50 latency: 1.27 ms, p90 latency: 1.97 ms, p95 latency: 2.34 ms, p99 latency: 3.16 ms, p999 latency: 10.42 ms, p9999 latency: 17.44 ms, average latency: 1.42 ms, median latency: 1.27 ms, min: 0.65 ms, max: 20.51 ms, std dev: 0.30 ms, total time: 40.04 s\x1b[0m 2026-03-20T13:15:34.355422Z INFO bench_report::prints: \x1b[31mAggregate Results: Total throughput: 499.89 MB/s, 499885 messages/s, average throughput per Actor: 62.49 MB/s, p50 latency: 1.08 ms, p90 latency: 1.68 ms, p95 latency: 2.04 ms, p99 latency: 2.77 ms, p999 latency: 9.95 ms, p9999 latency: 17.16 ms, average latency: 1.22 ms, median latency: 1.08 ms, min: 0.43 ms, max: 20.51 ms, std dev: 0.14 ms, total time: 40.04 s\x1b[0m

Producers

Metric Master PR delta
p50 latency 1.02 ms 0.90 ms -11.8%
p90 latency 1.81 ms 1.39 ms -23.2%
p95 latency 2.02 ms 1.74 ms -13.9%
p99 latency 2.68 ms 2.39 ms -10.8%
p999 latency 8.90 ms 9.49 ms +6.6%
p9999 latency 11.91 ms 16.87 ms +41.6%
Std dev 0.13 ms 0.13 ms 0%

Consumers

Metric Master PR delta
p50 latency 1.47 ms 1.27 ms -13.6%
p90 latency 2.31 ms 1.97 ms -14.7%
p95 latency 2.58 ms 2.34 ms -9.3%
p99 latency 3.34 ms 3.16 ms -5.4%
p999 latency 10.22 ms 10.42 ms +2.0%
p9999 latency 13.16 ms 17.44 ms +32.5%
Std dev 0.43 ms 0.30 ms -30.2%

@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented Mar 24, 2026

@numinnex what are we doing with this one?
EDIT: we decided to merge this now. @tungtose this code will be moved to iobuf crate, see #3020

@hubcio hubcio merged commit a81bcc6 into apache:master Mar 24, 2026
79 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants