common/sampling: reset reasoning budget sampler state between generations by cnsiva · Pull Request #21594 · ggml-org/llama.cpp

cnsiva · 2026-04-08T00:01:13Z

Summary

Fixes a bug where reasoning budget sampler state leaks between generations when using continuous batching (--cont-batching) with multiple parallel slots (--parallel > 1).

Problem

The reasoning budget sampler is a state machine that tracks thinking tokens. Without resetting its state between generations, it remains in the previous generation's state (IDLE, COUNTING, FORCING, or DONE), causing subsequent generations to incorrectly handle reasoning budget.

This is especially problematic with:

Multi-turn conversations with --parallel 3 --cont-batching
Models with reasoning budget enabled (e.g., Gemma-4 with thinking)

Example scenario:

Request 1 completes: rbudget state = DONE
Request 2 starts: sampler.reset() called but rbudget NOT reset
Request 2's <think> tokens: sampler is in DONE state (passthrough mode)
Result: Thinking budget not properly tracked for Request 2

Solution

Add llama_sampler_reset(rbudget) to the sampler's reset() function, ensuring the reasoning budget sampler's state is reset to IDLE at the start of each generation.

Testing

Tested with:

Gemma-4-26B-A4B with --reasoning-budget -1
--parallel 3 --cont-batching
Multi-turn conversations

✅ All 3 parallel requests now properly track thinking tokens
✅ No state leakage between generations

…ions This ensures multi-turn conversations and parallel inference don't leak reasoning budget state from previous generations. Without this, the budget sampler remains in COUNTING or FORCING state from the prior request, affecting subsequent responses when using --cont-batching with --parallel > 1. The reasoning budget sampler is a state machine tracking thinking tokens. Each generation should start with a clean state (IDLE), but the reset() function was only resetting the main sampling chain, not the rbudget sampler. This is especially important for: - Multi-turn conversations with --parallel 3 --cont-batching - Models with reasoning budget enabled (e.g., Gemma-4 with thinking)

This ensures that grammar state is cleared between generations, preventing state leakage in multi-turn conversations and parallel inference. This follows the same logic as the previous fix for the reasoning budget sampler.

pwilkin · 2026-04-08T14:37:13Z

@aldehir might wanna look at this as well.

aldehir · 2026-04-08T14:46:20Z

Gemma 4 doesn't even use the reasoning sampler... will look anyway

pwilkin · 2026-04-08T15:03:17Z

Everything now uses the reasoning sampler after your changes to the reasoning-aware grammar, remember? :)

cnsiva · 2026-04-08T16:29:10Z

Witt this changes, It works everything as expected. Is there anything else I can do to help you get this PR approved?

pwilkin · 2026-04-08T16:33:21Z

@cnsiva waiting for @aldehir to take a look first.

aldehir

Yes, it's missing. However, the rest of this report is incorrect. The sampler is initialized every time when one is needed,

https://github.com/ggml-org/llama.cpp/blob/master/tools/server/server-context.cpp#L1184

I'm inclined not to approve these inaccurate claims out of principle, but I'll let this slide considering it's two lines.

pwilkin

There's a regression in the test_chat_completion server test case which suggests the reset doesn't reinitialize the state of the samplers correctly, this needs to be addressed.

This commit ensures that resetting samplers between generations restores them to their intended baseline state rather than a completely empty state. This fixes a regression in the server's chat completion tests. Changes: - common/sampling: Store and re-apply 'prefill_tokens' during reset to maintain grammar state for assistant prefixes (e.g. <|im_start|>assistant). - common/reasoning-budget: Track 'initial_state' to ensure the sampler correctly reverts to IDLE or COUNTING after a reset. - common/reasoning-budget: Refactor cloning to use the copy constructor, ensuring clones are perfectly consistent with the original state. These changes prevent state leakage in multi-turn conversations and continuous batching while maintaining compatibility with chat templates that require prefix pre-filling.

cnsiva · 2026-04-09T02:57:55Z

@pwilkin I have updated the PR

common : fix sampler reset regression and prevent state leakage
This commit ensures that resetting samplers between generations restores them
to their intended baseline state rather than a completely empty state. This
fixes a regression in the server's chat completion tests.

Changes:
- common/sampling: Store and re-apply 'prefill_tokens' during reset to
  maintain grammar state for assistant prefixes (e.g. <|im_start|>assistant).
- common/reasoning-budget: Track 'initial_state' to ensure the sampler
  correctly reverts to IDLE or COUNTING after a reset.
- common/reasoning-budget: Refactor cloning to use the copy constructor,
  ensuring clones are perfectly consistent with the original state.

These changes prevent state leakage in multi-turn conversations and
continuous batching while maintaining compatibility with chat templates
that require prefix pre-filling.

aldehir

Too much risk for something that doesn't matter, IMO.

- Use explicit initialization in clone for better consistency - Add Test 6 to verify that reset() correctly restores initial_state

cnsiva · 2026-04-14T04:12:03Z

Thanks for the feedback. I've updated the PR to address the concerns about risk and idiomatic style, and to provide clearer evidence of the bug.

Key Updates in this Revision:

Refactored Cloning: I’ve updated the clone() logic in common/reasoning-budget.cpp to use the explicit initialization helper instead of a copy
constructor, matching the project's established patterns.

New Unit Test (Test 6): I’ve added a test case to tests/test-reasoning-budget.cpp that specifically targets llama_sampler_reset. It proves that
without this PR, the sampler fails to restore its prompt-derived state (e.g., staying in COUNTING if the prompt ends with an open tag) and
incorrectly defaults to IDLE.

Why this matters for the Server:
The concern that the sampler is always "freshly initialized" doesn't hold in several critical paths. Specifically, in tools/server/server-context.cpp,
the init_sampler() function relies on common_sampler_reset().

In the current master branch, common_sampler_reset() only resets the sampling chain. It leaves the Grammar and Reasoning Budget samplers in their
previous states. This leads to state leakage across requests in continuous batching and causes desyncs whenever a slot state is copied or reset after
prompt processing.

The addition of initial_state is necessary because a "reset" for these samplers isn't a return to a generic empty state—it must return to the state
determined by the prompt prefill. This PR ensures reset() behaves as a true "deep reset" that restores the sampler to its baseline.

cnsiva requested a review from a team as a code owner April 8, 2026 00:01

pwilkin approved these changes Apr 8, 2026

View reviewed changes

aldehir approved these changes Apr 8, 2026

View reviewed changes

pwilkin requested changes Apr 8, 2026

View reviewed changes

aldehir requested changes Apr 9, 2026

View reviewed changes

common : refactor reasoning-budget clone and add reset unit test

90907de

- Use explicit initialization in clone for better consistency - Add Test 6 to verify that reset() correctly restores initial_state

cnsiva requested a review from ggerganov as a code owner April 14, 2026 04:08

github-actions Bot added the testing Everything test related label Apr 14, 2026

BruceJillis mentioned this pull request Apr 24, 2026

common : re-arm reasoning budget after DONE on new <think> #22323

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common/sampling: reset reasoning budget sampler state between generations#21594

common/sampling: reset reasoning budget sampler state between generations#21594
cnsiva wants to merge 4 commits into
ggml-org:masterfrom
saas-home:fix/reasoning-budget-sampler-reset

cnsiva commented Apr 8, 2026

Uh oh!

pwilkin commented Apr 8, 2026

Uh oh!

aldehir commented Apr 8, 2026

Uh oh!

pwilkin commented Apr 8, 2026

Uh oh!

cnsiva commented Apr 8, 2026

Uh oh!

pwilkin commented Apr 8, 2026

Uh oh!

aldehir left a comment

Uh oh!

pwilkin left a comment

Uh oh!

cnsiva commented Apr 9, 2026

Uh oh!

aldehir left a comment

Uh oh!

cnsiva commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cnsiva commented Apr 8, 2026

Summary

Problem

Solution

Testing

Uh oh!

pwilkin commented Apr 8, 2026

Uh oh!

aldehir commented Apr 8, 2026

Uh oh!

pwilkin commented Apr 8, 2026

Uh oh!

cnsiva commented Apr 8, 2026

Uh oh!

pwilkin commented Apr 8, 2026

Uh oh!

aldehir left a comment

Choose a reason for hiding this comment

Uh oh!

pwilkin left a comment

Choose a reason for hiding this comment

Uh oh!

cnsiva commented Apr 9, 2026

@pwilkin I have updated the PR

Uh oh!

aldehir left a comment

Choose a reason for hiding this comment

Uh oh!

cnsiva commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants