[spec v2]Fix torch gc of future indices by hnyls2002 · Pull Request #18958 · sgl-project/sglang

hnyls2002 · 2026-02-18T05:54:10Z

How to reproduce this on small models (llama3-8b on H200)

diff --git a/python/sglang/srt/managers/scheduler.py b/python/sglang/srt/managers/scheduler.py
index 3435fcaef..526bf04df 100644
--- a/python/sglang/srt/managers/scheduler.py
+++ b/python/sglang/srt/managers/scheduler.py
@@ -2332,6 +2332,7 @@ class Scheduler(
 
                 with self.forward_stream_ctx:
                     self.forward_stream.wait_stream(self.default_stream)
+                    torch.cuda._sleep(1_000_000_000)
                     self.future_map.resolve_future(model_worker_batch)
                     with self.record_forward_metrics(batch):
                         batch_result = self.model_worker.forward_batch_generation(

fix #18744
close #18803

I seriously thought about @nvcastet's and @trevor-m's analysis of the data races, and I concluded that there are actually no data races in the traditional sense. Even though prepare_for_decode and _draft_extend_for_decode can access the shared buffer req_to_token at the same time, there are no conflicts between these two phases — even if there were, they wouldn't cause out-of-bound errors or IMA.
So I started thinking about whether some tensors could be garbage-collected because they were not recorded across streams. Some tensors are created on the scheduler (default) stream but used on the forward_stream, and their reference counts drop to zero during forwarding.
I originally thought the problematic tensor would be a direct field of ScheduleBatch, so I did a full clone of all GPU tensors in ModelWorkerBatch as an ablation. Even after that, the IMA still occurred. Trevor gave me a very useful hint: the indices inside FutureMap had bad values that didn't make sense. That pointed me to the root cause — future_indices.indices is allocated on the default stream, used on the forward_stream, and its Python references are dropped (when model_worker_batch.spec_info and batch.spec_info are replaced) before the GPU finishes reading it. The fix is a record_stream call on the indices tensor.

gemini-code-assist · 2026-02-18T05:54:13Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

hnyls2002 · 2026-02-18T23:40:50Z

/tag-and-rerun-ci

trevor-m · 2026-02-19T01:29:04Z

Confirmed that this fixed the issue on our wideep disagg repro

fix

dd38f15

hnyls2002 requested review from Ying1123, merrymercy and xiezhq-hermann as code owners February 18, 2026 05:54

hnyls2002 added the high priority label Feb 18, 2026

github-actions bot added the run-ci label Feb 18, 2026

hnyls2002 and others added 2 commits February 18, 2026 15:49

refine to trigger CI

3cdf29c

Merge branch 'main' into lsyin/fix-spec-v2-data-race

5a57cbf

nvcastet mentioned this pull request Feb 19, 2026

[Fix] Defer req_to_token pool-index free in overlap scheduling to prevent cross-stream data race #18803

Closed

hnyls2002 merged commit 5ff5aa6 into main Feb 19, 2026
362 of 391 checks passed

hnyls2002 deleted the lsyin/fix-spec-v2-data-race branch February 19, 2026 19:38

alisonshao mentioned this pull request Feb 24, 2026

[Tracking] CI Test Failures and Fixes #17050

Open

trevor-m pushed a commit to trevor-m/sglang that referenced this pull request Mar 6, 2026

[spec v2]Fix torch gc of future indices (sgl-project#18958)

f1955b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spec v2]Fix torch gc of future indices#18958

[spec v2]Fix torch gc of future indices#18958
hnyls2002 merged 3 commits intomainfrom
lsyin/fix-spec-v2-data-race

hnyls2002 commented Feb 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Uh oh!

hnyls2002 commented Feb 18, 2026

Uh oh!

trevor-m commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hnyls2002 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Uh oh!

hnyls2002 commented Feb 18, 2026

Uh oh!

trevor-m commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hnyls2002 commented Feb 18, 2026 •

edited

Loading