server-context: fall back to full seq clear when partial KV eviction is refused#23280
Merged
ggerganov merged 3 commits intoMay 19, 2026
Merged
Conversation
…is refused The startup probe in common_context_can_seq_rm only tests a 2 token tail removal on seq 0, it cannot guarantee that every partial eviction will succeed at any position on any live seq. The previous code aborted the process via GGML_ABORT in common_context_seq_rm whenever the backend refused the partial removal, taking down the server on a recoverable condition. On refusal we now clear the whole seq on both target and draft contexts, reset the prompt cache counters, and let update_slots reprefill from zero on the current iteration. The server stays alive, the slot loses its prefix cache and pays a single reprefill, no crash.
Member
|
Could you try the following patch applied to diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index 0f3fb9efa..7b801eac0 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -2583,9 +2583,9 @@ private:
llama_pos pos_next = slot.prompt.tokens.pos_next(n_past);
// the largest pos_min required for a checkpoint to be useful
- const auto pos_min_thold = std::max(0, pos_next - n_swa);
+ const auto pos_min_thold = std::max(0, pos_next - n_swa - 1);
- if (n_past > 0 && n_past < slot.prompt.n_tokens()) {
+ if (n_past > 0 && n_past <= slot.prompt.n_tokens()) {
const auto pos_min = llama_memory_seq_pos_min(llama_get_memory(ctx_tgt), slot.id);
if (pos_min == -1) {
SLT_ERR(slot, "n_past = %d, slot.prompt.tokens.size() = %d, seq_id = %d, pos_min = %d\n", n_past, (int) slot.prompt.tokens.size(), slot.id, pos_min);I think this will fix both this problem and also #23223. |
Contributor
Author
Look better, I try now ! |
Contributor
Author
|
Successfully tested in Master+ on my PR sse-replay-buffer, It fixes the problem at its source, much better than my fallback. thanks ! |
…viction is refused" This reverts commit fa9770c.
Reproduces in master on hybrid models by asking the assistant to continue its last reply on a multi turn conversation: the LCP match is perfect, the deep partial seq_rm is refused by the recurrent backend, common_context_seq_rm aborts the process via GGML_ABORT. Patch by @ggerganov routes the n_past == slot.prompt.n_tokens() case through the existing do_reset path.
Member
|
Ok, let's do some testing also with non-recurrent models to make sure I am not overlooking something and we can merge. |
Contributor
Author
|
Tested with Qwen3 30B A3B, GPT-OSS, and Llama 3.3 (all pure transformers), multi-turn continuation works as expected, no regression. |
kgrama
pushed a commit
to kgrama/llama.cpp
that referenced
this pull request
May 19, 2026
xxmustafacooTR
pushed a commit
to xxPlayground/llama-cpp-turboquant
that referenced
this pull request
May 19, 2026
rsenthilkumar6
pushed a commit
to rsenthilkumar6/llama.cpp
that referenced
this pull request
May 19, 2026
ArberSephirotheca
pushed a commit
to ArberSephirotheca/llama.cpp
that referenced
this pull request
May 19, 2026
fhnmor21
pushed a commit
to fhnmor21/llama-cpp-turboquant
that referenced
this pull request
May 19, 2026
dbrain
pushed a commit
to dbrain/hbd-llama-cpp-turboquant
that referenced
this pull request
May 21, 2026
baramofme
pushed a commit
to baramofme/llama-cpp-turboquant
that referenced
this pull request
May 23, 2026
Jcfunk
added a commit
to Jcfunk/llama.cpp
that referenced
this pull request
May 23, 2026
* upstream/HEAD: ci : install server kleidiai runner dependencies (ggml-org#23259) server-context: guarantee there is at least 1 token to decode (ggml-org#23280) server : print graphs reused in slot timings (ggml-org#23279) save-load-state : refactor tests and improve readability (ggml-org#23196) llama-eval : add per-task summary stats (ggml-org#23151) ggml-webgpu : extend GDN for K>1 (ggml-org#23299) [SCYL] add chapter for performance reference in SYCL.md (ggml-org#23315) convert : filter lora tensor names (ggml-org#23077) sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle (ggml-org#22153) rpc : keep last_graph_uid in the device context (ggml-org#23273)
srossitto79
pushed a commit
to srossitto79/llama.cpp
that referenced
this pull request
May 23, 2026
jimbothigpen
added a commit
to jimbothigpen/llama.cpp
that referenced
this pull request
May 29, 2026
Reverts mainline commit ccee426 (PR ggml-org#23280) in tools/server/server-context.cpp which we picked up via 2026-05-25 forward-sync. The change introduced a KV cache reuse regression on Qwen3.6-35B-A3B (and likely Qwen3.5-35B-A3B-MTP) where a full batch of cached tokens is dropped per turn on multi-turn requests. Mainline issue: ggml-org#23589 RC + reproducer: orangeswim 2026-05-24 §-RISK: This is a naked revert per the issue author's mainline test; it may reintroduce the hybrid-attention crash that ggml-org#23280 was fixing. Build + smoke verify gated on GPU-lockout-clear; follow-up worker required before FF-merge. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
fewtarius
pushed a commit
to fewtarius/llama.cpp
that referenced
this pull request
May 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
To reproduce on master, run llama-server with a recent hybrid attention model such as Qwen3.6-MoE, fill the KV cache with a few conversation turns, then click "Continue" at the end of an assistant reply and watch the server abort on a partial seq_rm refusal.
Additional information
Fix this
Requirements