common/sampling: reset reasoning budget sampler state between generations#21594
common/sampling: reset reasoning budget sampler state between generations#21594cnsiva wants to merge 4 commits into
Conversation
…ions This ensures multi-turn conversations and parallel inference don't leak reasoning budget state from previous generations. Without this, the budget sampler remains in COUNTING or FORCING state from the prior request, affecting subsequent responses when using --cont-batching with --parallel > 1. The reasoning budget sampler is a state machine tracking thinking tokens. Each generation should start with a clean state (IDLE), but the reset() function was only resetting the main sampling chain, not the rbudget sampler. This is especially important for: - Multi-turn conversations with --parallel 3 --cont-batching - Models with reasoning budget enabled (e.g., Gemma-4 with thinking)
This ensures that grammar state is cleared between generations, preventing state leakage in multi-turn conversations and parallel inference. This follows the same logic as the previous fix for the reasoning budget sampler.
|
@aldehir might wanna look at this as well. |
|
Gemma 4 doesn't even use the reasoning sampler... will look anyway |
|
Everything now uses the reasoning sampler after your changes to the reasoning-aware grammar, remember? :) |
|
Witt this changes, It works everything as expected. Is there anything else I can do to help you get this PR approved? |
aldehir
left a comment
There was a problem hiding this comment.
Yes, it's missing. However, the rest of this report is incorrect. The sampler is initialized every time when one is needed,
https://github.com/ggml-org/llama.cpp/blob/master/tools/server/server-context.cpp#L1184
I'm inclined not to approve these inaccurate claims out of principle, but I'll let this slide considering it's two lines.
pwilkin
left a comment
There was a problem hiding this comment.
There's a regression in the test_chat_completion server test case which suggests the reset doesn't reinitialize the state of the samplers correctly, this needs to be addressed.
This commit ensures that resetting samplers between generations restores them to their intended baseline state rather than a completely empty state. This fixes a regression in the server's chat completion tests. Changes: - common/sampling: Store and re-apply 'prefill_tokens' during reset to maintain grammar state for assistant prefixes (e.g. <|im_start|>assistant). - common/reasoning-budget: Track 'initial_state' to ensure the sampler correctly reverts to IDLE or COUNTING after a reset. - common/reasoning-budget: Refactor cloning to use the copy constructor, ensuring clones are perfectly consistent with the original state. These changes prevent state leakage in multi-turn conversations and continuous batching while maintaining compatibility with chat templates that require prefix pre-filling.
@pwilkin I have updated the PR |
aldehir
left a comment
There was a problem hiding this comment.
Too much risk for something that doesn't matter, IMO.
- Use explicit initialization in clone for better consistency - Add Test 6 to verify that reset() correctly restores initial_state
|
Summary
Fixes a bug where reasoning budget sampler state leaks between generations when using continuous batching (
--cont-batching) with multiple parallel slots (--parallel > 1).Problem
The reasoning budget sampler is a state machine that tracks thinking tokens. Without resetting its state between generations, it remains in the previous generation's state (IDLE, COUNTING, FORCING, or DONE), causing subsequent generations to incorrectly handle reasoning budget.
This is especially problematic with:
--parallel 3 --cont-batchingExample scenario:
<think>tokens: sampler is in DONE state (passthrough mode)Solution
Add
llama_sampler_reset(rbudget)to the sampler's reset() function, ensuring the reasoning budget sampler's state is reset to IDLE at the start of each generation.Testing
Tested with:
--reasoning-budget -1--parallel 3 --cont-batching✅ All 3 parallel requests now properly track thinking tokens
✅ No state leakage between generations