[Metal] Batched command dispatch and staging buffer pool#18877
[Metal] Batched command dispatch and staging buffer pool#18877tqchen merged 4 commits intoapache:mainfrom
Conversation
…ing pool for CPU->GPU copies Benchmark results (Metal, M4 Max, MLC-LLM serve, temperature=0): 256 decode tokens: Qwen2.5-0.5B-Instruct-q4f16_1: 238 t/s -> 466 t/s (1.95x) Qwen2.5-1.5B-Instruct-q4f16_1: 177 t/s -> 239 t/s (1.35x) Qwen2.5-3B-Instruct-q4f16_1: 114 t/s -> 139 t/s (1.21x) Llama-3.1-8B-Instruct-q4f16_1: 76 t/s -> 89 t/s (1.18x) 1024 decode tokens: Qwen2.5-0.5B-Instruct-q4f16_1: 239 t/s -> 398 t/s (1.67x) Qwen2.5-1.5B-Instruct-q4f16_1: 137 t/s -> 190 t/s (1.38x) Qwen2.5-3B-Instruct-q4f16_1: 92 t/s -> 115 t/s (1.25x) Llama-3.1-8B-Instruct-q4f16_1: 70 t/s -> 80 t/s (1.14x) Baseline and optimized use the same MLC-LLM, same compiled models, only the TVM Metal runtime differs. Servers run sequentially (not parallel) to avoid GPU contention. Each run preceded by 2 warmup requests. The speedup is larger on smaller models because they are dispatch-bound (262 dispatches/token for 0.5B vs 394 for 8B). Larger models spend more time in actual compute, so the per-dispatch overhead is a smaller fraction. At 1024 tokens the 0.5B speedup drops from 1.95x to 1.67x because KV cache growth increases per-token compute, shifting the bottleneck toward memory bandwidth. What changed: 1. Batched compute dispatch. Kernel dispatches are accumulated in a single MTLCommandBuffer via a shared MTLComputeCommandEncoder. Previously each dispatch created its own command buffer and committed immediately. The pending encoder is flushed on GPU->CPU readback, buffer deallocation, or stream sync. 2. Inline blit encoders for copies. CPU->GPU and GPU->GPU copies now use blit encoders on the same pending command buffer instead of creating a separate command buffer per copy. Metal guarantees sequential ordering of encoders within a command buffer, so no explicit sync is needed between compute and copy operations. 3. Staging buffer pool for CPU->GPU copies. Each inlined CPU->GPU copy needs its own staging buffer because the GPU reads them asynchronously from the deferred command buffer. A per-device StagingBufferPool hands out shared-mode buffers and recycles them after flush/sync. 4. Conditional sync in FreeDataSpace. Instead of always calling StreamSync, we check HasPendingWork() first. When the GPU->CPU readback path has already flushed and waited, FreeDataSpace can skip the redundant sync.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances Metal runtime performance by optimizing command dispatch and memory copy operations. By batching compute kernels and inlining blit operations into a single command buffer, it reduces overhead associated with frequent command buffer creation and submission. The introduction of a staging buffer pool further streamlines CPU-to-GPU transfers, while conditional synchronization prevents unnecessary GPU stalls. These changes result in substantial speedups, particularly for smaller models that are dispatch-bound, as demonstrated by the provided benchmarks. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces significant changes to the Metal backend's command buffer management, aiming to improve performance by batching compute dispatches and blit operations into a single MTLCommandBuffer. This involves refactoring the Stream class with new methods like GetPendingComputeEncoder(), GetBlitEncoderOnPendingBuffer(), FlushCommandBuffer(), and Synchronize(), along with internal state to manage pending command buffers and encoders. Copy operations (CopyDataFromTo) are updated to utilize this new batching mechanism, with CPU→GPU copies now employing a StagingBufferPool for efficient handling of multiple inlined transfers. Profiling counters have also been added to track various GPU operations. However, the review comments identify several critical issues: a Use-After-Free vulnerability in the Stream destructor because FlushCommandBuffer() does not wait for completion, a potential process crash if TVM_FFI_ICHECK fails within a Metal background thread's completion handler, a risk of null pointer dereference in StagingBufferPool::GetOrCreate if memory allocation fails, and a memory leak concern due to the StagingBufferPool's unbounded growth if StreamSync is not called periodically.
| ~Stream() { | ||
| FlushCommandBuffer(); | ||
| [queue_ release]; | ||
| } |
There was a problem hiding this comment.
A critical Use-After-Free (UAF) vulnerability exists in the Stream class destructor. The Stream destructor flushes the pending command buffer but does not wait for its completion. This allows the addCompletedHandler callback, which captures this, to access a deleted Stream object if the stream is destroyed while work is pending, leading to a process crash. To prevent this UAF, the destructor must ensure all pending work is completed before destruction. Replacing FlushCommandBuffer() with Synchronize() will resolve this.
~Stream() {
Synchronize();
[queue_ release];
}There was a problem hiding this comment.
The Stream destructor is only called during MetalWorkspace teardown (process exit or ReinitializeDefaultStreams). At that point no GPU work is in flight. FlushCommandBuffer commits the CB but we do not need to wait for completion because the process is tearing down and the completion handler captures this which is about to be freed regardless. Using Synchronize() here would block the main thread on GPU idle for no reason. The original code (before this PR) also did not wait on completion during destruction.
There was a problem hiding this comment.
would be useful to document this via comment for context. This indeed places an implicit requirement that Stream have to be destructed in teardown
| pending_command_buffer_.label = @"TVMBatched"; | ||
| [pending_command_buffer_ addCompletedHandler:^(id<MTLCommandBuffer> buffer) { | ||
| if (buffer.status == MTLCommandBufferStatusError) { | ||
| TVM_FFI_ICHECK(buffer.error != nil); |
There was a problem hiding this comment.
The completedHandler uses TVM_FFI_ICHECK, which throws an exception if the condition fails. Because this handler is executed on a background thread owned by the Metal runtime, an unhandled exception will likely cause the entire process to terminate abruptly (via std::terminate), as there is typically no exception handling logic on the stack of these background threads.
There was a problem hiding this comment.
TVM_FFI_ICHECK is the standard error reporting mechanism used throughout the TVM Metal runtime. The pre-existing code (GetCommandBuffer, which this PR does not change) already uses TVM_FFI_ICHECK in the same completion handler pattern. This is not a regression introduced by this PR.
| pool[next_index].buffer = [dev newBufferWithLength:nbytes options:MTLStorageModeShared]; | ||
| pool[next_index].size = nbytes; | ||
| return pool[next_index++].buffer; |
There was a problem hiding this comment.
In StagingBufferPool::GetOrCreate, the call to newBufferWithLength:options: can return nil if the GPU memory allocation fails (e.g., due to resource exhaustion). The code does not check for nil before returning the buffer. The caller in metal_device_api.mm immediately uses the returned buffer in a memcpy operation: memcpy([staging contents], ...). If staging is nil, [staging contents] will return NULL, leading to a null pointer dereference and a crash.
id<MTLBuffer> buf = [dev newBufferWithLength:nbytes options:MTLStorageModeShared];
TVM_FFI_ICHECK(buf != nil) << "Failed to allocate staging buffer of size " << nbytes;
pool[next_index].buffer = buf;There was a problem hiding this comment.
Fair point. Will add a nil check with ICHECK. Note that AllocDataSpace (the main allocation path) also does not check for nil beyond a simple ICHECK, so this is consistent with the existing codebase pattern, but adding the check is cheap and reasonable.
| struct StagingBufferPool { | ||
| struct Entry { | ||
| id<MTLBuffer> buffer = nil; | ||
| size_t size = 0; | ||
| }; | ||
| std::vector<Entry> pool; | ||
| size_t next_index = 0; // round-robin within current batch |
There was a problem hiding this comment.
The StagingBufferPool implementation has a potential for unbounded memory growth. The next_index is only incremented and never wraps around, and it's only reset to 0 in StreamSync. If StreamSync is not called periodically, the pool vector will grow indefinitely with each CPU->GPU copy, leading to a memory leak.
Additionally, the comment on line 366 // round-robin within current batch is misleading, as the allocation is linear, not round-robin.
A more robust implementation might use a fixed-size ring buffer and a mechanism (like fences or command buffer completion handlers) to recycle buffers, or block if no buffers are available.
There was a problem hiding this comment.
The pool does not grow unboundedly. next_index is reset to 0 on every StreamSync call, which happens at least once per token (GPU->CPU readback). In practice the pool stabilizes at ~20 entries (one per CPU->GPU copy between syncs) and never grows beyond that. The "round-robin" comment is imprecise, will fix to "sequential within current batch".
|
@tqchen give it a check, let me know |
| ~Stream() { | ||
| FlushCommandBuffer(); | ||
| [queue_ release]; | ||
| } |
There was a problem hiding this comment.
would be useful to document this via comment for context. This indeed places an implicit requirement that Stream have to be destructed in teardown
|
Thanks @mitiskuma did a round of review. I am curious the benefit of StageBufferPool in this case. I know that it saves the blit from CPU to GPU and allows these blits to be fused into the compute (mainly thinking about the benefit and extra complexity here. What happens if we just fuse the compute kernels and leave out the copy ones? |
|
Compute-only batching (no staging pool) vs baseline:
Without the staging pool, each CPU->GPU copy flushes the pending compute encoder, so the batching benefit is lost on models with many interleaved copies. The staging pool is what makes batching effective. That said, I have not tested a middle ground where CPU->GPU copies skip the blit encoder entirely and memcpy directly into a shared-mode staging buffer that the compute kernel reads from, which would avoid the pool but require changes to how buffer storage modes are chosen. lmk about the rest |
|
Thanks @mitiskuma i think only thing left is to guard on maximum staging buffer being used and be able to flush to avoid unbounded use-cases. Indeed that most LLM decoding use-cases won't have such issue, but we want runtime implementation to generally be robust to all possible usecase while optimized for usecases we have in mind |
|
@tqchen please let me do some review before a final approval, moving to draft. |
|
ready for final review. |
|
on Iphone 15 Pro: |
|
@tvm-bot rerun |
|
Failed to re-run CI in https://github.com/apache/tvm/actions/runs/22799327431 Detailswith response |
Benchmark results (Metal, M4 Max, MLC-LLM serve, temperature=0):
256 decode tokens:
Qwen2.5-0.5B-Instruct-q4f16_1: 238 t/s -> 466 t/s (1.95x)
Qwen2.5-1.5B-Instruct-q4f16_1: 177 t/s -> 239 t/s (1.35x)
Qwen2.5-3B-Instruct-q4f16_1: 114 t/s -> 139 t/s (1.21x)
Llama-3.1-8B-Instruct-q4f16_1: 76 t/s -> 89 t/s (1.18x)
1024 decode tokens:
Qwen2.5-0.5B-Instruct-q4f16_1: 239 t/s -> 398 t/s (1.67x)
Qwen2.5-1.5B-Instruct-q4f16_1: 137 t/s -> 190 t/s (1.38x)
Qwen2.5-3B-Instruct-q4f16_1: 92 t/s -> 115 t/s (1.25x)
Llama-3.1-8B-Instruct-q4f16_1: 70 t/s -> 80 t/s (1.14x)
Baseline and optimized use the same MLC-LLM, same compiled models, only the
TVM Metal runtime differs. Servers run sequentially (not parallel) to avoid
GPU contention. Each run preceded by 2 warmup requests.
The speedup is larger on smaller models because they are dispatch-bound
(262 dispatches/token for 0.5B vs 394 for 8B). Larger models spend more
time in actual compute, so the per-dispatch overhead is a smaller fraction.
At 1024 tokens the 0.5B speedup drops from 1.95x to 1.67x because KV cache
growth increases per-token compute, shifting the bottleneck toward memory
bandwidth.
What changed:
Batched compute dispatch. Kernel dispatches are accumulated in a single
MTLCommandBuffer via a shared MTLComputeCommandEncoder. Previously each
dispatch created its own command buffer and committed immediately. The
pending encoder is flushed on GPU->CPU readback, buffer deallocation,
or stream sync.
Inline blit encoders for copies. CPU->GPU and GPU->GPU copies now use
blit encoders on the same pending command buffer instead of creating a
separate command buffer per copy. Metal guarantees sequential ordering
of encoders within a command buffer, so no explicit sync is needed
between compute and copy operations.
Staging buffer pool for CPU->GPU copies. Each inlined CPU->GPU copy
needs its own staging buffer because the GPU reads them asynchronously
from the deferred command buffer. A per-device StagingBufferPool hands
out shared-mode buffers and recycles them after flush/sync.
Conditional sync in FreeDataSpace. Instead of always calling StreamSync,
we check HasPendingWork() first. When the GPU->CPU readback path has
already flushed and waited, FreeDataSpace can skip the redundant sync.