feat(server): aligned buffer memory pool#2921
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #2921 +/- ##
============================================
- Coverage 72.09% 72.04% -0.06%
Complexity 930 930
============================================
Files 1124 1124
Lines 93832 93856 +24
Branches 71181 71213 +32
============================================
- Hits 67649 67616 -33
- Misses 23612 23646 +34
- Partials 2571 2594 +23
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
d83f2f6 to
f55e808
Compare
260a3c2 to
cd3560e
Compare
can you elaborate? |
Correct me if I'm wrong, to make the make_mutable function in the HTTP path zero copy, we need to change Other than that, make_mutable is not in the hot path. If user is using the HTTP API for high-throughput message sending, they already have bigger problems than this one copy |
2d54146 to
56fc8ec
Compare
|
@tungtose did you check the performance? is there any difference? |
@hubcio here is the bench result running on main:2026-03-18T13:40:40.123993Z INFO bench_report::prints: Producers Results: Total throughput: 3854.07 MB/s, 3854076 messages/s, average throughput per Producer: 481.76 MB/s, p50 latency: 1.30 ms, p90 latency: 2.75 ms, p95 latency: 7.14 ms, p99 latency: 18.61 ms, p999 latency: 44.92 ms, p9999 latency: 91.50 ms, average latency: 2.12 ms, median latency: 1.30 ms, min: 0.51 ms, max: 209.82 ms, std dev: 2.34 ms, total time: 2.50 s Running iggy-bench pinned-consumer tcp... on this branch:Send results: Running iggy-bench pinned-consumer tcp... |
|
could you please run benchmarks with rate limit, 4 producers / consumers, total rate limit equal to 500MB? this way we'll see the p50. run it for like 30s or so |
@hubcio here is the bench result with this bench cmd: on this branch:\x1b[34mBenchmark: Pinned Producer And Consumer, 4 producers, 4 consumers, 4 streams, 1 topic per stream, 1 partitions per topic, 20000000 messages, 1000 messages per batch, 20000 message batches, 1000 bytes per message, 20GB of data processed on master branch:\x1b[34mBenchmark: Pinned Producer And Consumer, 4 producers, 4 consumers, 4 streams, 1 topic per stream, 1 partitions per topic, 20000000 messages, 1000 messages per batch, 20000 message batches, 1000 bytes per message, 20GB of data processed \x1b[0m 2026-03-20T09:50:53.541140Z INFO bench_report::prints: \x1b[32mProducers Results: Total throughput: 249.96 MB/s, 249962 messages/s, average throughput per Producer: 62.49 MB/s, p50 latency: 0.91 ms, p90 latency: 1.53 ms, p95 latency: 1.85 ms, p99 latency: 2.89 ms, p999 latency: 11.57 ms, p9999 latency: 12.51 ms, average latency: 1.06 ms, median latency: 0.91 ms, min: 0.42 ms, max: 13.61 ms, std dev: 0.25 ms, total time: 39.98 s\x1b[0m 2026-03-20T09:50:53.541151Z INFO bench_report::prints: \x1b[32mConsumers Results: Total throughput: 252.47 MB/s, 252466 messages/s, average throughput per Consumer: 63.12 MB/s, p50 latency: 1.31 ms, p90 latency: 152.23 ms, p95 latency: 247.42 ms, p99 latency: 340.48 ms, p999 latency: 362.19 ms, p9999 latency: 364.22 ms, average latency: 34.25 ms, median latency: 1.31 ms, min: 0.70 ms, max: 526.35 ms, std dev: 64.63 ms, total time: 39.82 s\x1b[0m 2026-03-20T09:50:53.541162Z INFO bench_report::prints: \x1b[31mAggregate Results: Total throughput: 502.43 MB/s, 502428 messages/s, average throughput per Actor: 62.80 MB/s, p50 latency: 1.11 ms, p90 latency: 76.88 ms, p95 latency: 124.63 ms, p99 latency: 171.68 ms, p999 latency: 186.88 ms, p9999 latency: 188.36 ms, average latency: 17.66 ms, median latency: 1.11 ms, min: 0.42 ms, max: 526.35 ms, std dev: 38.38 ms, total time: 39.98 s\x1b[0m |
|
so the results for producers:
consumers:
p50 regressed ~25-37%. any clue why? but tail latencies (p90+) improved 57-88%, especially on consumer side. |
ece90bf to
1a19a7d
Compare
|
@hubcio Here is a benchmark update using the command line below. I believe the slow performance come from the freeze() function (converting from AVec back to Byte), the current implementation is temporary. It will be improved in an upcoming PR that integrates DirectIOFile and a proper implementation of the freeze() function master:bench_report::prints: \x1b[32mProducers Results: Total throughput: 249.96 MB/s, 249964 messages/s, average throughput per Producer: 62.49 MB/s, p50 latency: 1.02 ms, p90 latency: 1.81 ms, p95 latency: 2.02 ms, p99 latency: 2.68 ms, p999 latency: 8.90 ms, p9999 latency: 11.91 ms, average latency: 1.17 ms, median latency: 1.02 ms, min: 0.46 ms, max: 13.07 ms, std dev: 0.13 ms, total time: 39.98 s\x1b[0m 2026-03-20T12:14:11.068760Z INFO bench_report::prints: \x1b[32mConsumers Results: Total throughput: 249.97 MB/s, 249971 messages/s, average throughput per Consumer: 62.49 MB/s, p50 latency: 1.47 ms, p90 latency: 2.31 ms, p95 latency: 2.58 ms, p99 latency: 3.34 ms, p999 latency: 10.22 ms, p9999 latency: 13.16 ms, average latency: 1.62 ms, median latency: 1.47 ms, min: 0.67 ms, max: 14.89 ms, std dev: 0.43 ms, total time: 40.08 s\x1b[0m PR:x1b[32mProducers Results: Total throughput: 249.97 MB/s, 249968 messages/s, average throughput per Producer: 62.49 MB/s, p50 latency: 0.90 ms, p90 latency: 1.39 ms, p95 latency: 1.74 ms, p99 latency: 2.39 ms, p999 latency: 9.49 ms, p9999 latency: 16.87 ms, average latency: 1.02 ms, median latency: 0.90 ms, min: 0.43 ms, max: 11.01 ms, std dev: 0.13 ms, total time: 39.98 s\x1b[0m 2026-03-20T13:15:34.355418Z INFO bench_report::prints: \x1b[32mConsumers Results: Total throughput: 249.92 MB/s, 249916 messages/s, average throughput per Consumer: 62.48 MB/s, p50 latency: 1.27 ms, p90 latency: 1.97 ms, p95 latency: 2.34 ms, p99 latency: 3.16 ms, p999 latency: 10.42 ms, p9999 latency: 17.44 ms, average latency: 1.42 ms, median latency: 1.27 ms, min: 0.65 ms, max: 20.51 ms, std dev: 0.30 ms, total time: 40.04 s\x1b[0m 2026-03-20T13:15:34.355422Z INFO bench_report::prints: \x1b[31mAggregate Results: Total throughput: 499.89 MB/s, 499885 messages/s, average throughput per Actor: 62.49 MB/s, p50 latency: 1.08 ms, p90 latency: 1.68 ms, p95 latency: 2.04 ms, p99 latency: 2.77 ms, p999 latency: 9.95 ms, p9999 latency: 17.16 ms, average latency: 1.22 ms, median latency: 1.08 ms, min: 0.43 ms, max: 20.51 ms, std dev: 0.14 ms, total time: 40.04 s\x1b[0m Producers
Consumers
|
Prepare the memory pool and buffer infrastructure for O_DIRECT I/O. Direct I/O requires buffers to be aligned to the underlying block size (commonly 4096 bytes). This allows the kernel to bypass the page cache, reducing double buffering and giving more predictable I/O latency.
Known Trade-offs: