server: respect per-request enable_thinking toggle via extra_body by pju-hoge · Pull Request #22336 · ggml-org/llama.cpp

pju-hoge · 2026-04-24T21:20:31Z

Overview

Fixes enable_thinking being ignored when set per-request via extra_body or chat_template_kwargs. Previously the server only checked chat_template_kwargs, but OpenAI-compatible clients (and tools like Opencode) send it in extra_body. This caused enable_thinking: false to be silently ignored across all shells.

Changes

`tools/server/server-common.cpp`

Read enable_thinking from extra_body first (OpenAI-compatible path), then fall back to chat_template_kwargs. Pass the value to the slot so it can override the server default.

`common/chat.cpp`

supports_thinking now requires both template-level reasoning detection AND enable_thinking == true. Prevents thinking tags from being injected when the user explicitly toggles thinking off.

`common/chat-auto-parser-generator.cpp`

extract_reasoning also gated by enable_thinking, ensuring the PEG parser does not extract reasoning block markers when thinking is disabled.

Fixes

Misc. bug: enable_thinking param cannot turn off thinking for qwen3.5 #20182 — enable_thinking param cannot turn off thinking for qwen3.5
disabling reasoning does not work anymore on certain models #20196 — disabling reasoning does not work anymore on certain models
Eval bug: Bug: Qwen3.5 enable_thinking=false via --chat-template-kwargs is ignored across all shells (PowerShell/Bash) #20409 — enable_thinking=false ignored across all shells
Misc. bug: Qwen3.5 with Opencode: Assistant response prefill is incompatible with enable_thinking. #20861 — Qwen3.5 with Opencode assistant prefill incompatible

Testing

Build verified (Release, Linux, cmake --build . -j$(nproc))
Existing test-chat and test-chat-peg-parser pass
No new server integration tests added (change is minimal and targeted)

AI Usage Disclosure

YES — AI (OpenCode with Qwen3.6-35B) assisted with code formatting, searching related issues on GitHub, verifying CI compliance against .github/workflows/server.yml, and drafting this PR description. All code changes, logic design, and review were performed by the contributor.

Fix enable_thinking being ignored in llama.cpp server requests. The issue was in three places: - server-common.cpp: read enable_thinking from extra_body directly (not just chat_template_kwargs), and propagate it to chat_template_kwargs for template access - common/chat.cpp: supports_thinking = template_supports_thinking && params.enable_thinking - common/chat-auto-parser-generator.cpp: extract_reasoning depends on inputs.enable_thinking API usage: - reasoning_format='auto' + extra_body.enable_thinking=true -> thinking on - reasoning_format='auto' + extra_body.enable_thinking=false -> thinking off

ggml-gh-bot · 2026-04-24T21:24:36Z

Hi @pju-hoge, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

vanmilleru · 2026-04-25T04:07:07Z

#22323
#22162

isaac-mcfadyen · 2026-04-26T14:31:17Z

As far as I'm aware, extra_body on the OpenAI clients literally adds the fields to the body. There is not a dedicated field called extra_body actually sent in the request. See:

https://github.com/openai/openai-python/blob/e507a4ebeea4c3f93cd48986014a3e2ca79230c2/src/openai/_base_client.py#L2007-L2045

https://github.com/openai/openai-python/blob/e507a4ebeea4c3f93cd48986014a3e2ca79230c2/src/openai/_base_client.py#L502-L509

https://github.com/openai/openai-python/blob/e507a4ebeea4c3f93cd48986014a3e2ca79230c2/src/openai/_base_client.py#L2183-L2192

Also, the general way to "disable" thinking with reasoning models is to add empty <think></think> tags. I suspect that your change in chat.cpp will not add these tags because the model is no longer marked as supporting thinking (and will severely degrade performance as a result).

pju-hoge requested review from a team as code owners April 24, 2026 21:20

github-actions Bot added examples server labels Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: respect per-request enable_thinking toggle via extra_body#22336

server: respect per-request enable_thinking toggle via extra_body#22336
pju-hoge wants to merge 1 commit into
ggml-org:masterfrom
pju-hoge:feat/thinking-toggle

pju-hoge commented Apr 24, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot Bot commented Apr 24, 2026

Uh oh!

vanmilleru commented Apr 25, 2026

Uh oh!

isaac-mcfadyen commented Apr 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pju-hoge commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes

tools/server/server-common.cpp

common/chat.cpp

common/chat-auto-parser-generator.cpp

Fixes

Testing

AI Usage Disclosure

Uh oh!

ggml-gh-bot Bot commented Apr 24, 2026

Uh oh!

vanmilleru commented Apr 25, 2026

Uh oh!

isaac-mcfadyen commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pju-hoge commented Apr 24, 2026 •

edited

Loading

`tools/server/server-common.cpp`

`common/chat.cpp`

`common/chat-auto-parser-generator.cpp`

isaac-mcfadyen commented Apr 26, 2026 •

edited

Loading