Skip to content

common/sampling: reset reasoning budget sampler state between generations#21594

Open
cnsiva wants to merge 4 commits into
ggml-org:masterfrom
saas-home:fix/reasoning-budget-sampler-reset
Open

common/sampling: reset reasoning budget sampler state between generations#21594
cnsiva wants to merge 4 commits into
ggml-org:masterfrom
saas-home:fix/reasoning-budget-sampler-reset

Conversation

@cnsiva
Copy link
Copy Markdown

@cnsiva cnsiva commented Apr 8, 2026

Summary

Fixes a bug where reasoning budget sampler state leaks between generations when using continuous batching (--cont-batching) with multiple parallel slots (--parallel > 1).

Problem

The reasoning budget sampler is a state machine that tracks thinking tokens. Without resetting its state between generations, it remains in the previous generation's state (IDLE, COUNTING, FORCING, or DONE), causing subsequent generations to incorrectly handle reasoning budget.

This is especially problematic with:

  • Multi-turn conversations with --parallel 3 --cont-batching
  • Models with reasoning budget enabled (e.g., Gemma-4 with thinking)

Example scenario:

  1. Request 1 completes: rbudget state = DONE
  2. Request 2 starts: sampler.reset() called but rbudget NOT reset
  3. Request 2's <think> tokens: sampler is in DONE state (passthrough mode)
  4. Result: Thinking budget not properly tracked for Request 2

Solution

Add llama_sampler_reset(rbudget) to the sampler's reset() function, ensuring the reasoning budget sampler's state is reset to IDLE at the start of each generation.

Testing

Tested with:

  • Gemma-4-26B-A4B with --reasoning-budget -1
  • --parallel 3 --cont-batching
  • Multi-turn conversations

✅ All 3 parallel requests now properly track thinking tokens
✅ No state leakage between generations

…ions

This ensures multi-turn conversations and parallel inference don't leak reasoning
budget state from previous generations. Without this, the budget sampler remains
in COUNTING or FORCING state from the prior request, affecting subsequent
responses when using --cont-batching with --parallel > 1.

The reasoning budget sampler is a state machine tracking thinking tokens. Each
generation should start with a clean state (IDLE), but the reset() function
was only resetting the main sampling chain, not the rbudget sampler.

This is especially important for:
- Multi-turn conversations with --parallel 3 --cont-batching
- Models with reasoning budget enabled (e.g., Gemma-4 with thinking)
@cnsiva cnsiva requested a review from a team as a code owner April 8, 2026 00:01
This ensures that grammar state is cleared between generations,
preventing state leakage in multi-turn conversations and parallel
inference. This follows the same logic as the previous fix for the
reasoning budget sampler.
@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Apr 8, 2026

@aldehir might wanna look at this as well.

@aldehir
Copy link
Copy Markdown
Contributor

aldehir commented Apr 8, 2026

Gemma 4 doesn't even use the reasoning sampler... will look anyway

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Apr 8, 2026

Everything now uses the reasoning sampler after your changes to the reasoning-aware grammar, remember? :)

@cnsiva
Copy link
Copy Markdown
Author

cnsiva commented Apr 8, 2026

Witt this changes, It works everything as expected. Is there anything else I can do to help you get this PR approved?

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Apr 8, 2026

@cnsiva waiting for @aldehir to take a look first.

Copy link
Copy Markdown
Contributor

@aldehir aldehir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's missing. However, the rest of this report is incorrect. The sampler is initialized every time when one is needed,

https://github.com/ggml-org/llama.cpp/blob/master/tools/server/server-context.cpp#L1184

I'm inclined not to approve these inaccurate claims out of principle, but I'll let this slide considering it's two lines.

Copy link
Copy Markdown
Member

@pwilkin pwilkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a regression in the test_chat_completion server test case which suggests the reset doesn't reinitialize the state of the samplers correctly, this needs to be addressed.

This commit ensures that resetting samplers between generations restores them
to their intended baseline state rather than a completely empty state. This
fixes a regression in the server's chat completion tests.

Changes:
- common/sampling: Store and re-apply 'prefill_tokens' during reset to
  maintain grammar state for assistant prefixes (e.g. <|im_start|>assistant).
- common/reasoning-budget: Track 'initial_state' to ensure the sampler
  correctly reverts to IDLE or COUNTING after a reset.
- common/reasoning-budget: Refactor cloning to use the copy constructor,
  ensuring clones are perfectly consistent with the original state.

These changes prevent state leakage in multi-turn conversations and
continuous batching while maintaining compatibility with chat templates
that require prefix pre-filling.
@cnsiva
Copy link
Copy Markdown
Author

cnsiva commented Apr 9, 2026

@pwilkin I have updated the PR

common : fix sampler reset regression and prevent state leakage
This commit ensures that resetting samplers between generations restores them
to their intended baseline state rather than a completely empty state. This
fixes a regression in the server's chat completion tests.

Changes:
- common/sampling: Store and re-apply 'prefill_tokens' during reset to
  maintain grammar state for assistant prefixes (e.g. <|im_start|>assistant).
- common/reasoning-budget: Track 'initial_state' to ensure the sampler
  correctly reverts to IDLE or COUNTING after a reset.
- common/reasoning-budget: Refactor cloning to use the copy constructor,
  ensuring clones are perfectly consistent with the original state.

These changes prevent state leakage in multi-turn conversations and
continuous batching while maintaining compatibility with chat templates
that require prefix pre-filling.

Copy link
Copy Markdown
Contributor

@aldehir aldehir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too much risk for something that doesn't matter, IMO.

- Use explicit initialization in clone for better consistency
- Add Test 6 to verify that reset() correctly restores initial_state
@cnsiva cnsiva requested a review from ggerganov as a code owner April 14, 2026 04:08
@github-actions github-actions Bot added the testing Everything test related label Apr 14, 2026
@cnsiva
Copy link
Copy Markdown
Author

cnsiva commented Apr 14, 2026

Thanks for the feedback. I've updated the PR to address the concerns about risk and idiomatic style, and to provide clearer evidence of the bug.

Key Updates in this Revision:

  1. Refactored Cloning: I’ve updated the clone() logic in common/reasoning-budget.cpp to use the explicit initialization helper instead of a copy
    constructor, matching the project's established patterns.
  2. New Unit Test (Test 6): I’ve added a test case to tests/test-reasoning-budget.cpp that specifically targets llama_sampler_reset. It proves that
    without this PR, the sampler fails to restore its prompt-derived state (e.g., staying in COUNTING if the prompt ends with an open tag) and
    incorrectly defaults to IDLE.

Why this matters for the Server:
The concern that the sampler is always "freshly initialized" doesn't hold in several critical paths. Specifically, in tools/server/server-context.cpp,
the init_sampler() function relies on common_sampler_reset().

In the current master branch, common_sampler_reset() only resets the sampling chain. It leaves the Grammar and Reasoning Budget samplers in their
previous states. This leads to state leakage across requests in continuous batching and causes desyncs whenever a slot state is copied or reset after
prompt processing.

The addition of initial_state is necessary because a "reset" for these samplers isn't a return to a generic empty state—it must return to the state
determined by the prompt prefill. This PR ensures reset() behaves as a true "deep reset" that restores the sampler to its baseline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants