Skip to content

server: respect per-request enable_thinking toggle via extra_body#22336

Open
pju-hoge wants to merge 1 commit into
ggml-org:masterfrom
pju-hoge:feat/thinking-toggle
Open

server: respect per-request enable_thinking toggle via extra_body#22336
pju-hoge wants to merge 1 commit into
ggml-org:masterfrom
pju-hoge:feat/thinking-toggle

Conversation

@pju-hoge
Copy link
Copy Markdown

@pju-hoge pju-hoge commented Apr 24, 2026

Overview

Fixes enable_thinking being ignored when set per-request via extra_body or chat_template_kwargs. Previously the server only checked chat_template_kwargs, but OpenAI-compatible clients (and tools like Opencode) send it in extra_body. This caused enable_thinking: false to be silently ignored across all shells.

Changes

tools/server/server-common.cpp

Read enable_thinking from extra_body first (OpenAI-compatible path), then fall back to chat_template_kwargs. Pass the value to the slot so it can override the server default.

common/chat.cpp

supports_thinking now requires both template-level reasoning detection AND enable_thinking == true. Prevents thinking tags from being injected when the user explicitly toggles thinking off.

common/chat-auto-parser-generator.cpp

extract_reasoning also gated by enable_thinking, ensuring the PEG parser does not extract reasoning block markers when thinking is disabled.

Fixes

Testing

  • Build verified (Release, Linux, cmake --build . -j$(nproc))
  • Existing test-chat and test-chat-peg-parser pass
  • No new server integration tests added (change is minimal and targeted)

AI Usage Disclosure

YES — AI (OpenCode with Qwen3.6-35B) assisted with code formatting, searching related issues on GitHub, verifying CI compliance against .github/workflows/server.yml, and drafting this PR description. All code changes, logic design, and review were performed by the contributor.

Fix enable_thinking being ignored in llama.cpp server requests.

The issue was in three places:
- server-common.cpp: read enable_thinking from extra_body directly (not just
  chat_template_kwargs), and propagate it to chat_template_kwargs for template
  access
- common/chat.cpp: supports_thinking = template_supports_thinking &&
  params.enable_thinking
- common/chat-auto-parser-generator.cpp: extract_reasoning depends on
  inputs.enable_thinking

API usage:
- reasoning_format='auto' + extra_body.enable_thinking=true  -> thinking on
- reasoning_format='auto' + extra_body.enable_thinking=false -> thinking off
@pju-hoge pju-hoge requested review from a team as code owners April 24, 2026 21:20
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Apr 24, 2026

Hi @pju-hoge, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@vanmilleru
Copy link
Copy Markdown

#22323
#22162

@isaac-mcfadyen
Copy link
Copy Markdown
Contributor

isaac-mcfadyen commented Apr 26, 2026

As far as I'm aware, extra_body on the OpenAI clients literally adds the fields to the body. There is not a dedicated field called extra_body actually sent in the request. See:

https://github.com/openai/openai-python/blob/e507a4ebeea4c3f93cd48986014a3e2ca79230c2/src/openai/_base_client.py#L2007-L2045

https://github.com/openai/openai-python/blob/e507a4ebeea4c3f93cd48986014a3e2ca79230c2/src/openai/_base_client.py#L502-L509

https://github.com/openai/openai-python/blob/e507a4ebeea4c3f93cd48986014a3e2ca79230c2/src/openai/_base_client.py#L2183-L2192

Also, the general way to "disable" thinking with reasoning models is to add empty <think></think> tags. I suspect that your change in chat.cpp will not add these tags because the model is no longer marked as supporting thinking (and will severely degrade performance as a result).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants