Test harness for validating preserved thinking across clients, SDKs, and backends for GLM-4.7+ models.
GLM-4.7+ models support "preserved thinking" - the ability to maintain reasoning context (reasoning_content) across multi-turn conversations. However, many clients and SDKs fail to properly pass this field through, breaking the model's ability to reference its prior reasoning.
This validator helps identify which component in the stack is failing:
Client (OpenCode, Claude Code) → SDK (Vercel AI, OpenAI) → Backend (llama.cpp, vLLM)
GLM-4.7+ models output their chain-of-thought reasoning in a separate reasoning_content field (OpenAI API) or thinking content blocks (Anthropic API). For the model to reference its prior reasoning in multi-turn conversations, clients must echo this content back.
Client → Server
{
"messages": [{"role": "user", "content": "What is 2+2?"}],
"chat_template_kwargs": {"enable_thinking": true, "clear_thinking": false}
}
Server → Client
{
"message": {
"role": "assistant",
"content": "The answer is 4.",
"reasoning_content": "Let me add 2+2. 2+2=4." ← Model's thinking
}
}
The client MUST include reasoning_content from the prior assistant message:
Client → Server
{
"messages": [
{"role": "user", "content": "What is 2+2?"},
{
"role": "assistant",
"content": "The answer is 4.",
"reasoning_content": "Let me add 2+2. 2+2=4." ← PRESERVED
},
{"role": "user", "content": "Are you sure?"}
],
"chat_template_kwargs": {"enable_thinking": true, "clear_thinking": false}
}
The server's Jinja template renders this as:
<|user|>What is 2+2?
<|assistant|><think>Let me add 2+2. 2+2=4.</think>The answer is 4.
<|user|>Are you sure?
<|assistant|><think>
Now the model can see its prior reasoning and respond coherently.
Many clients strip reasoning_content, breaking the chain:
Client → Server
{
"messages": [
{"role": "user", "content": "What is 2+2?"},
{
"role": "assistant",
"content": "The answer is 4."
← reasoning_content MISSING
},
{"role": "user", "content": "Are you sure?"}
]
}
The model cannot see its prior thinking, leading to inconsistent responses.
┌─────────────────────────────────────────────────────────────────────────────┐
│ TEST SCENARIOS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Test 1: Real Client → Stub Server (Anthropic API) │
│ ┌──────────────┐ ┌──────────────────────────────┐ │
│ │ Claude Code │ ───► │ Stub Anthropic Server │ │
│ │ (real) │ ◄─── │ Generates: [THINK-ANT-T1-*] │ │
│ └──────────────┘ │ Validates: tokens returned │ │
│ └──────────────────────────────┘ │
│ │
│ Test 2: Real Client → Stub Server (OpenAI API) │
│ ┌──────────────┐ ┌──────────────────────────────┐ │
│ │ OpenCode │ ───► │ Stub OpenAI Server │ │
│ │ (real) │ ◄─── │ Generates: [THINK-OAI-T1-*] │ │
│ └──────────────┘ │ Validates: tokens returned │ │
│ └──────────────────────────────┘ │
│ │
│ Test 3: Stub Client → Real Server (llama.cpp) │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Stub Client │ ───► │ llama.cpp │ ───► Model │
│ │ │ ◄─── │ (real) │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Test 4: Stub Client → Stub Server (Harness Validation) │
│ ┌──────────────┐ ┌──────────────────────────────┐ │
│ │ Stub Client │ ───► │ Stub Server │ │
│ │ (controlled) │ ◄─── │ (controlled) │ │
│ └──────────────┘ │ Validates: 100% preservation │ │
│ └──────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
git clone https://github.com/pascual-family/glm-4.7-thinking-validator.git
cd glm-4.7-thinking-validator
uv sync --extra dev
# For automated client testing, install expect
# macOS:
brew install expect
# Ubuntu/Debian:
sudo apt-get install expect
# Also ensure bun is installed for running JS clients
curl -fsSL https://bun.sh/install | bashThe harness embeds unique traceable tokens in responses that MUST be echoed back:
[CATEGORY-API-TURN-UUID8]
Examples:
[THINK-OAI-T1-a1b2c3d4] # Thinking from turn 1, OpenAI API
[CONTENT-ANT-T2-e5f6g7h8] # Content from turn 2, Anthropic API
[TOOL_IN-OAI-T1-i9j0k1l2] # Tool input from turn 1
| Category | OpenAI Location | Anthropic Location | Purpose |
|---|---|---|---|
THINK |
message.reasoning_content |
content[type="thinking"] |
Validates thinking preservation |
CONTENT |
message.content |
content[type="text"] |
Validates content preservation |
TOOL_ID |
tool_calls[].id |
content[type="tool_use"].id |
Validates tool call ID |
TOOL_IN |
tool_calls[].function.arguments |
content[type="tool_use"].input |
Validates tool input |
TOOL_OUT |
Tool result content | content[type="tool_result"] |
Validates tool output |
Test whether an OpenAI-compatible server correctly handles reasoning_content:
# Test against any OpenAI-compatible endpoint
uv run python -m glm_thinking_validator.stubs.stub_openai_client \
--url http://localhost:8080/v1 \
--turns 3
# Output shows which tokens were preserved/lost
# PASS: All tokens preserved
# FAIL: Missing tokens: [THINK-OAI-T1-abc12345]The stub client:
- Sends requests with
chat_template_kwargs: {enable_thinking: true, clear_thinking: false} - Includes
reasoning_contentfor prior assistant turns - Validates tokens from previous turns appear in subsequent requests
# Start OpenAI-compatible stub server
uv run glm-validator-openai --port 8090
# Or Anthropic-compatible stub server
uv run glm-validator-anthropic --port 8092The server prints its URL to stdout:
Stub OpenAI server running at http://127.0.0.1:8090/v1
Validation report: http://127.0.0.1:8090/v1/validation_report
For OpenCode:
Create/edit opencode.json:
{
"providers": {
"test": {
"type": "openai",
"baseURL": "http://127.0.0.1:8090/v1",
"apiKey": "test-key"
}
},
"model": "test/stub-model"
}Then run OpenCode:
opencode
# Send a few messages, then check the validation report
curl http://127.0.0.1:8090/v1/validation_reportFor Claude Code:
# Set environment to point to stub server
ANTHROPIC_BASE_URL=http://127.0.0.1:8092 claude
# Send messages, then check validation
curl http://127.0.0.1:8092/v1/validation_reportcurl http://127.0.0.1:8090/v1/validation_report | jqOutput:
{
"total": 6,
"returned": 6,
"missing": [],
"missing_count": 0,
"assessment": "PASS: All expected tokens were returned",
"by_category": {
"THINK": {"tokens": ["[THINK-OAI-T1-abc12345]", "[THINK-OAI-T2-def67890]"], "missing": []},
"CONTENT": {"tokens": ["[CONTENT-OAI-T1-111]", "[CONTENT-OAI-T2-222]"], "missing": []}
}
}Test a client binary automatically:
# Test OpenCode binary for compliance
uv run python -m glm_thinking_validator.compliance \
--client opencode \
--binary /usr/local/bin/opencode \
--config /path/to/opencode.json
# Test Claude Code binary
uv run python -m glm_thinking_validator.compliance \
--client claude-code \
--binary /usr/local/bin/claude
# Output:
# Starting stub server on port 8090...
# Launching client: /usr/local/bin/opencode
# Sending test prompts...
# Collecting validation report...
#
# === COMPLIANCE REPORT ===
# Client: opencode v1.0.0
# Total tokens: 6
# Preserved: 2 (33%)
# Missing: 4 (67%)
# - [THINK-OAI-T1-abc12345] (reasoning_content not returned)
# - [THINK-OAI-T2-def67890] (reasoning_content not returned)
#
# VERDICT: FAIL - Client does not preserve reasoning_content| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Chat completions with token tracking |
/v1/validation_report |
GET | Get token preservation report |
/v1/reset |
POST | Reset token registry for new test |
/health |
GET | Health check |
| Endpoint | Method | Description |
|---|---|---|
/v1/messages |
POST | Messages API with thinking blocks |
/v1/validation_report |
GET | Get token preservation report |
/v1/reset |
POST | Reset token registry for new test |
/health |
GET | Health check |
For preserved thinking to work correctly, clients MUST:
{
"chat_template_kwargs": {
"enable_thinking": true,
"clear_thinking": false
}
}{
"messages": [
{"role": "user", "content": "Hello"},
{
"role": "assistant",
"content": "Hi there!",
"reasoning_content": "The user greeted me, I should respond friendly."
},
{"role": "user", "content": "What did you think about?"}
]
}{
"messages": [
{"role": "user", "content": "Hello"},
{
"role": "assistant",
"content": [
{"type": "thinking", "thinking": "User greeted me..."},
{"type": "text", "text": "Hi there!"}
]
},
{"role": "user", "content": "Continue"}
]
}# Run all tests (harness validation + unit tests)
uv run pytest -v
# Run only harness validation tests
uv run pytest tests/test_harness_validation.py -v
# Run llama.cpp passthrough tests (requires running llama-server)
LLAMACPP_URL=http://localhost:8080 uv run pytest tests/test_llamacpp_oai_passthrough.py -v| Test | Expected | Notes |
|---|---|---|
| Harness validation | PASS | Validates test infrastructure |
| OpenCode preservation | FAIL | Known issue: Vercel AI SDK drops reasoning_content |
| Claude Code preservation | PASS | Should preserve thinking blocks |
| llama.cpp OAI passthrough | PASS | Server correctly handles reasoning_content |
This section documents how to test specific published versions of clients and backends, and tracks which versions are known to be broken.
| Component | Pinned Version | Latest | Preserved Thinking | Notes |
|---|---|---|---|---|
| OpenCode | 0.3.0 | 0.3.x | BROKEN | Vercel AI SDK strips reasoning_content |
| Claude Code | 1.0.0 | 1.x | UNTESTED | Should work (native Anthropic client) |
| Droid | - | 0.x | UNTESTED | Uses Vercel AI SDK (likely broken) |
| llama.cpp | b4712 | latest | PARTIAL | OAI API works, Anthropic API broken |
| vLLM | 0.6.0 | 0.6.x | UNTESTED | Needs reasoning_content passthrough test |
| MLX | 0.21.0 | 0.21.x | UNTESTED | Needs reasoning_content passthrough test |
# Run automated test with expect scripting
./scripts/expect/run-opencode-test.sh 0.3.0
# Or test latest version
./scripts/expect/run-opencode-test.sh latest# Start stub server
uv run glm-validator-openai --port 8090 &
# Create test config
cat > /tmp/opencode-test.json << 'EOF'
{
"providers": {
"stub": {
"type": "openai",
"baseURL": "http://127.0.0.1:8090/v1",
"apiKey": "test"
}
},
"model": "stub/stub-model"
}
EOF
# Run OpenCode with test config
OPENCODE_CONFIG=/tmp/opencode-test.json bunx opencode-ai@0.3.0
# After sending 2+ messages, check validation
curl -s http://127.0.0.1:8090/v1/validation_report | jq '.assessment'
# Expected: "FAIL: X tokens missing"# Run automated test with expect scripting
./scripts/expect/run-claude-code-test.sh 1.0.0
# Or test latest version
./scripts/expect/run-claude-code-test.sh latest# Start stub Anthropic server
uv run glm-validator-anthropic --port 8092 &
# Run Claude Code against stub server
ANTHROPIC_BASE_URL=http://127.0.0.1:8092 bunx @anthropic-ai/claude-code@1.0.0
# After sending 2+ messages, check validation
curl -s http://127.0.0.1:8092/v1/validation_report | jq '.assessment'
# Expected: "PASS: All expected tokens were returned"# Clone and build specific version
git clone https://github.com/anthropics/droid.git
cd droid && git checkout v0.1.0
bun install && bun run build
# Start stub server
uv run glm-validator-openai --port 8090 &
# Configure droid to use stub server (check droid docs for config format)
# Run with: bun run start
# Then validate
curl -s http://127.0.0.1:8090/v1/validation_report | jq '.assessment'# Build specific version
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && git checkout b4712
cmake -B build && cmake --build build --config Release
# Start server with GLM model
./build/bin/llama-server \
--model /path/to/glm-4.7-flash.gguf \
--host 127.0.0.1 \
--port 8080 \
--jinja
# Test with stub client
uv run python -c "
from glm_thinking_validator.stubs.stub_openai_client import StubOpenAIClient
with StubOpenAIClient('http://127.0.0.1:8080/v1') as client:
client.reset_server() # Won't work on real server, that's OK
# Turn 1
resp1 = client.chat('What is 2+2?')
print(f'Turn 1 reasoning: {resp1.reasoning_content[:50]}...')
# Turn 2 - check if reasoning was preserved in template
resp2 = client.chat('Are you sure?')
print(f'Turn 2 reasoning: {resp2.reasoning_content[:50]}...')
"
# For deeper inspection, check llama.cpp logs for rendered prompt# Install specific version with uv
uv pip install vllm==0.6.0
# Start vLLM server with GLM model
uv run python -m vllm.entrypoints.openai.api_server \
--model /path/to/glm-4.7-flash \
--host 127.0.0.1 \
--port 8080 \
--trust-remote-code
# Test with stub client
uv run python -c "
from glm_thinking_validator.stubs.stub_openai_client import StubOpenAIClient
with StubOpenAIClient('http://127.0.0.1:8080/v1') as client:
resp1 = client.chat('Hello')
print(f'Has reasoning_content: {bool(resp1.reasoning_content)}')
resp2 = client.chat('Continue')
# Check if vLLM passed reasoning_content to template
"# Install specific version with uvx (one-off)
uvx --from mlx-lm==0.21.0 mlx_lm.server \
--model mlx-community/glm-4-9b-chat-4bit \
--port 8080
# Or install into project
uv pip install mlx-lm==0.21.0
uv run mlx_lm.server --model mlx-community/glm-4-9b-chat-4bit --port 8080
# Test with stub client
uv run python -c "
from glm_thinking_validator.stubs.stub_openai_client import StubOpenAIClient
with StubOpenAIClient('http://127.0.0.1:8080/v1') as client:
resp1 = client.chat('Hello')
print(f'Has reasoning_content: {bool(resp1.reasoning_content)}')
"Affected versions: All current versions (0.3.x)
The Vercel AI SDK's @ai-sdk/openai provider strips reasoning_content from messages before sending them to the server. This breaks preserved thinking for any client using this SDK.
Root cause: @ai-sdk/openai uses a strict message schema that doesn't include reasoning_content:
// ai-sdk/packages/openai/src/openai-chat-language-model.ts
// Message type only includes: role, content, name, tool_calls, tool_call_id
// reasoning_content is silently dropped- Issue: https://github.com/vercel/ai/issues/XXXX
- Workaround: Use raw HTTP requests or patch the SDK
Affected versions: b4712 and earlier
The convert_anthropic_to_oai() function in llama.cpp does not convert Anthropic thinking content blocks to OpenAI reasoning_content. Thinking blocks are silently discarded.
Root cause: common/chat.cpp line ~450:
// Only converts "text" and "tool_use" blocks
// "thinking" blocks are ignored- PR needed: Add thinking block → reasoning_content conversion
- Workaround: Use OpenAI API format directly
Status: Needs testing
vLLM's OpenAI-compatible API may not pass reasoning_content through to the chat template. Testing required.
Status: Needs testing
MLX's server implementation needs verification for reasoning_content support.
- Create test file in
tests/ - Use
TokenRegistryfromglm_thinking_validator.stubs.token_registry - Use
StubOpenAIClientorStubAnthropicClientfor controlled testing
src/glm_thinking_validator/
├── __init__.py
├── stubs/
│ ├── token_registry.py # Token generation and tracking
│ ├── stub_openai_server.py # OpenAI-compatible stub server
│ ├── stub_openai_client.py # OpenAI stub clients (correct + broken)
│ ├── stub_anthropic_server.py # Anthropic-compatible stub server
│ ├── stub_anthropic_client.py # Anthropic stub clients
│ └── llamacpp_inspector.py # llama.cpp template inspection
tests/
├── test_harness_validation.py # Validates test infrastructure
├── test_llamacpp_oai_passthrough.py # llama.cpp passthrough tests
└── conftest.py # Pytest fixtures
MIT