[bugfix] fix mamba slot leak when scheduling fails with radix cache (#15840)#16067
Conversation
Summary of ChangesHello @kuafou, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical GPU memory leak identified when serving Mamba models with radix-cache, particularly affecting Qwen3-Next. The leak occurred because Mamba slots allocated through Copy-On-Write were not being released if a subsequent scheduling attempt failed. The changes introduce a mechanism to ensure these slots are properly deallocated in such failure paths, accompanied by a new regression test to validate the fix and prevent future regressions. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a GPU memory leak in Mamba-based models that occurred when scheduling failed after a Copy-On-Write (COW) operation in the radix cache. The fix is implemented by ensuring that any Mamba slot allocated during prefix matching is properly released if the request subsequently fails to be scheduled. The logic is sound and correctly placed within the scheduler's prefill logic. Additionally, a new regression test has been added to specifically cover this cleanup path, verifying that the allocated slot is indeed returned to the pool. The changes are clear, targeted, and improve the robustness of the memory management for hybrid SSM models.
43c6362 to
91e2022
Compare
2564440 to
8028fe1
Compare
|
/tag-and-rerun-ci |
a451b1c to
2f8819d
Compare
|
Looks like one of the tests failed — I’m looking into it now and will push a fix soon. |
|
let me think, without this pr it will not cause actual mamba slot leak, but it will cause check memory failed, do I understand correctly? |
Yes. Without this PR, it’s not necessarily a real long-term “slot leak” in the sense of “lost forever,” but it does create an untracked allocated mamba slot in the failure path (e.g.,
so That’s why the fix explicitly frees the mamba slot when scheduling fails. The request stays in |
I see, I think it will not |
Yes. |
8dca3fd to
76c6a90
Compare
Thanks for the clarification. You’re right — #15840 refers to a runtime GPU memory leak, and this PR does not directly address that issue. The only relationship is that this bug was found while investigating #15840 locally. This PR fixes a separate scheduler failure-path problem: when scheduling fails, a COW-allocated mamba slot is left temporarily untracked, which can cause check_memory() to report a leak and crash the server. If it would be clearer, I’m happy to update the PR title and/or Motivation to remove the direct reference to #15840 and instead describe it as “discovered during investigation of #15840.” Please let me know what you prefer. |
Hi @kuafou , thanks for your insight. Did your local test crash on check_memory() function with the report of "token_to_kv_pool_allocator memory leak", or it crash on "req_to_token_pool memory leak detected!" ? |
Hi, I have the logs from the test and will post them shortly. In the meantime, could you share exactly what the original error message/report was on your side? |
|
Hi, that my test scripts and error log. Hope this helps! Server Side ScriptsBatchmark ScriptsError Log |
hanming-lu
left a comment
There was a problem hiding this comment.
This is a mitigation where server incorrectly thinks the server is idle, triggering memory check. From the server logs, it looks to happen during retractions.
Please update the description, thanks.
#16067 (comment)
Done! I've updated the description and title to reflect the correct cause. Thanks for the review! |
|
Hi @yizhang2077 The CI failures appear unrelated to my changes:
Could you please re-run? Thanks! |
Hi @kuafou, for the issue you mentioned in this PR, I also noticed it when I do my local test. Thx for fixing it. The exact issue I meet is when the qwen3next model receives requests in some coding dataset or structured format, I have updated original issue #15840 with a dummy inputs. You could observe the increment of GPU memory when the inputs. |
|
/rerun-failed-ci |
…ct#15840) When add_one_req fails after init_next_round_input allocates a mamba slot via COW (copy-on-write) during match_prefix, the slot was not released, causing memory leak. This fix releases the mamba slot when scheduling fails.
COW (copy-on-write) mamba slot allocation only happens in MambaRadixCache. Add isinstance check to ensure slot cleanup only runs when using MambaRadixCache, following the pattern in scheduler_runtime_checker_mixin.py.
76c6a90 to
3483615
Compare
|
/rerun-failed-ci |
…gl-project#15840) (sgl-project#16067) Co-authored-by: yizhang2077 <1109276519@qq.com>
Motivation
This PR fixes a server crash caused by a false positive "memory leak detected" error when serving Mamba models with
radix-cacheenabled.The issue was discovered during the investigation of #15840 but is a distinct issue regarding the scheduler's failure path cleanup. It occurs when the scheduler is temporarily resource-blocked (e.g., during request retraction), causing
check_memory()to flag a temporarily untracked Mamba slot as leaked.Modifications
add_one_reqfails (returnsNO_TOKENorIDLE) during the scheduling phase.test_mamba_slot_release_after_match_prefix_cowto verify the cleanup logic.Accuracy Tests
N/A — resource cleanup only, no model output changes.
Benchmarking and Profiling
N/A — failure-path fix, no performance impact.
Checklist
Details
The Problem:
init_next_round_input(), a Mamba slot is allocated via Copy-On-Write (COW) /match_prefix.add_one_req()is called but returnsNO_TOKEN(e.g., because the KV cache pool is full).self_check_during_idle()->check_memory().check_memory()fails because the slot allocated in step 1 is:free_slots(it was allocated).tree_cache(the request didn't finish scheduling).ValueError: token_to_kv_pool_allocator memory leak detected!.The Fix:
When scheduling fails, we now explicitly free the temporary Mamba slot. The request remains in the
waiting_queueand will safely re-allocate (and copy) the Mamba state in the next scheduling round.Error Log
The server crashes with the following error during high load/retraction: