[Fix] data race in req_to_token pool#17850
Merged
merrymercy merged 6 commits intomainfrom Feb 2, 2026
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-run-ci-label |
|
so if a prefill batch contain 2 or more request or request chunk, the accuracy of the mamba state for these req is not right? |
Collaborator
Author
|
Mamba state is correct but full attention can be wrong
|
60ab814 to
30b9b41
Compare
charlesHsuGG
pushed a commit
to charlesHsuGG/sglang
that referenced
this pull request
Feb 5, 2026
sfiisf
pushed a commit
to sfiisf/sglang
that referenced
this pull request
Feb 5, 2026
Collaborator
|
Hi @cctry, |
Collaborator
|
@cctry, Oh, I see. Since the kernel is launched asynchronously, even with overlap scheduling disabled, there could still be a race condition between |
nvcastet
added a commit
to nvcastet/sglang
that referenced
this pull request
Feb 13, 2026
…vent cross-stream data race In overlap scheduling (MTPv2), `process_batch_result(N-1)` runs on the default stream concurrently with `forward(N)` on the forward stream. When a request finishes, `release_kv_cache` immediately returns its `req_pool_idx` to the free list. A new request can then recycle that pool index and `prepare_for_decode` overwrites the `req_to_token` row on the default stream while `forward(N)` still reads it — causing an "index out of bounds" assertion in IndexKernel.cu. Fix: defer the pool-index free by one overlap iteration. - `ReqToTokenPool.deferred_free(req)`: withholds the pool index from the free list (the slot cannot be reallocated). - `ReqToTokenPool.flush_deferred_frees()`: moves deferred slots back to the free list once the forward that read them has completed. - `release_kv_cache(..., defer_pool_free=True)`: used in the decode result-processing path when overlap is enabled. - `process_batch_result_decode`: flushes deferred frees right after `copy_done.synchronize()`, which guarantees the previous forward has finished reading `req_to_token`. This is the overlap-scheduling counterpart of PR sgl-project#17850, which fixed the same class of race for chunked prefill.
nvcastet
added a commit
to nvcastet/sglang
that referenced
this pull request
Feb 13, 2026
…vent cross-stream data race In overlap scheduling (MTPv2), `process_batch_result(N-1)` runs on the default stream concurrently with `forward(N)` on the forward stream. When a request finishes, `release_kv_cache` immediately returns its `req_pool_idx` to the free list. A new request can then recycle that pool index and `prepare_for_decode` overwrites the `req_to_token` row on the default stream while `forward(N)` still reads it — causing an "index out of bounds" assertion in IndexKernel.cu. Fix: defer the pool-index free by one overlap iteration. - `ReqToTokenPool.deferred_free(req)`: withholds the pool index from the free list (the slot cannot be reallocated). - `ReqToTokenPool.flush_deferred_frees()`: moves deferred slots back to the free list once the forward that read them has completed. - `release_kv_cache(..., defer_pool_free=True)`: used in the decode result-processing path when overlap is enabled. - `process_batch_result_decode`: flushes deferred frees right after `copy_done.synchronize()`, which guarantees the previous forward has finished reading `req_to_token`. This is the overlap-scheduling counterpart of PR sgl-project#17850, which fixed the same class of race for chunked prefill.
Johnsonms
pushed a commit
to Johnsonms/sglang
that referenced
this pull request
Feb 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The chunked prefill requests will free its slot in
req_to_token_pooland get allocated again when preparing for its next prefill batch.As a result, if a prefill batch contains multiple requests and
req_to_token_poolis at capacity. The write for matched kv indices for another request will overwrite the slot of the chunked requests which is being read in forward streamExample
Modifications
alloc(reqs: list[Req])- Now takes request list, setsreq.req_pool_idxdirectly, reuses slot if already set. cc @hnyls2002free()withfree_mamba_cache(req, ...)inHybridReqToTokenPool- Only frees mamba state, not req slot cc @hanming-lu @yizhang2077release_kv_cache()- Now callsfree(req)at end; handles early mamba-only free casefree()inprocess_prefill_chunkandcache_finished_reqAccuracy Tests
Benchmarking and Profiling
Checklist