GLM-4.7 Thinking Validator

Test harness for validating preserved thinking across clients, SDKs, and backends for GLM-4.7+ models.

The Problem

GLM-4.7+ models support "preserved thinking" - the ability to maintain reasoning context (reasoning_content) across multi-turn conversations. However, many clients and SDKs fail to properly pass this field through, breaking the model's ability to reference its prior reasoning.

This validator helps identify which component in the stack is failing:

Client (OpenCode, Claude Code) → SDK (Vercel AI, OpenAI) → Backend (llama.cpp, vLLM)

GLM-4.7 Preserved Thinking Message Flow

GLM-4.7+ models output their chain-of-thought reasoning in a separate reasoning_content field (OpenAI API) or thinking content blocks (Anthropic API). For the model to reference its prior reasoning in multi-turn conversations, clients must echo this content back.

Turn 1: Initial Request

Client → Server
{
  "messages": [{"role": "user", "content": "What is 2+2?"}],
  "chat_template_kwargs": {"enable_thinking": true, "clear_thinking": false}
}

Server → Client
{
  "message": {
    "role": "assistant",
    "content": "The answer is 4.",
    "reasoning_content": "Let me add 2+2. 2+2=4."   ← Model's thinking
  }
}

Turn 2: Follow-up (CORRECT)

The client MUST include reasoning_content from the prior assistant message:

Client → Server
{
  "messages": [
    {"role": "user", "content": "What is 2+2?"},
    {
      "role": "assistant",
      "content": "The answer is 4.",
      "reasoning_content": "Let me add 2+2. 2+2=4."   ← PRESERVED
    },
    {"role": "user", "content": "Are you sure?"}
  ],
  "chat_template_kwargs": {"enable_thinking": true, "clear_thinking": false}
}

The server's Jinja template renders this as:

<|user|>What is 2+2?
<|assistant|><think>Let me add 2+2. 2+2=4.</think>The answer is 4.
<|user|>Are you sure?
<|assistant|><think>

Now the model can see its prior reasoning and respond coherently.

Turn 2: Follow-up (BROKEN - OpenCode behavior)

Many clients strip reasoning_content, breaking the chain:

Client → Server
{
  "messages": [
    {"role": "user", "content": "What is 2+2?"},
    {
      "role": "assistant",
      "content": "The answer is 4."
                                                      ← reasoning_content MISSING
    },
    {"role": "user", "content": "Are you sure?"}
  ]
}

The model cannot see its prior thinking, leading to inconsistent responses.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                           TEST SCENARIOS                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Test 1: Real Client → Stub Server (Anthropic API)                          │
│  ┌──────────────┐      ┌──────────────────────────────┐                     │
│  │ Claude Code  │ ───► │ Stub Anthropic Server        │                     │
│  │ (real)       │ ◄─── │ Generates: [THINK-ANT-T1-*]  │                     │
│  └──────────────┘      │ Validates: tokens returned   │                     │
│                        └──────────────────────────────┘                     │
│                                                                              │
│  Test 2: Real Client → Stub Server (OpenAI API)                             │
│  ┌──────────────┐      ┌──────────────────────────────┐                     │
│  │ OpenCode     │ ───► │ Stub OpenAI Server           │                     │
│  │ (real)       │ ◄─── │ Generates: [THINK-OAI-T1-*]  │                     │
│  └──────────────┘      │ Validates: tokens returned   │                     │
│                        └──────────────────────────────┘                     │
│                                                                              │
│  Test 3: Stub Client → Real Server (llama.cpp)                              │
│  ┌──────────────┐      ┌──────────────┐                                     │
│  │ Stub Client  │ ───► │ llama.cpp    │ ───► Model                          │
│  │              │ ◄─── │ (real)       │                                     │
│  └──────────────┘      └──────────────┘                                     │
│                                                                              │
│  Test 4: Stub Client → Stub Server (Harness Validation)                     │
│  ┌──────────────┐      ┌──────────────────────────────┐                     │
│  │ Stub Client  │ ───► │ Stub Server                  │                     │
│  │ (controlled) │ ◄─── │ (controlled)                 │                     │
│  └──────────────┘      │ Validates: 100% preservation │                     │
│                        └──────────────────────────────┘                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Installation

git clone https://github.com/pascual-family/glm-4.7-thinking-validator.git
cd glm-4.7-thinking-validator
uv sync --extra dev

# For automated client testing, install expect
# macOS:
brew install expect

# Ubuntu/Debian:
sudo apt-get install expect

# Also ensure bun is installed for running JS clients
curl -fsSL https://bun.sh/install | bash

Traceable Token System

The harness embeds unique traceable tokens in responses that MUST be echoed back:

[CATEGORY-API-TURN-UUID8]

Examples:
[THINK-OAI-T1-a1b2c3d4]      # Thinking from turn 1, OpenAI API
[CONTENT-ANT-T2-e5f6g7h8]    # Content from turn 2, Anthropic API
[TOOL_IN-OAI-T1-i9j0k1l2]    # Tool input from turn 1

Token Categories

Category	OpenAI Location	Anthropic Location	Purpose
`THINK`	`message.reasoning_content`	`content[type="thinking"]`	Validates thinking preservation
`CONTENT`	`message.content`	`content[type="text"]`	Validates content preservation
`TOOL_ID`	`tool_calls[].id`	`content[type="tool_use"].id`	Validates tool call ID
`TOOL_IN`	`tool_calls[].function.arguments`	`content[type="tool_use"].input`	Validates tool input
`TOOL_OUT`	Tool result content	`content[type="tool_result"]`	Validates tool output

Usage

1. Test a Backend Server (Stub Client → Your Server)

Test whether an OpenAI-compatible server correctly handles reasoning_content:

# Test against any OpenAI-compatible endpoint
uv run python -m glm_thinking_validator.stubs.stub_openai_client \
    --url http://localhost:8080/v1 \
    --turns 3

# Output shows which tokens were preserved/lost
# PASS: All tokens preserved
# FAIL: Missing tokens: [THINK-OAI-T1-abc12345]

The stub client:

Sends requests with chat_template_kwargs: {enable_thinking: true, clear_thinking: false}
Includes reasoning_content for prior assistant turns
Validates tokens from previous turns appear in subsequent requests

2. Test a Client (Your Client → Stub Server)

Step 2a: Start the Stub Server

# Start OpenAI-compatible stub server
uv run glm-validator-openai --port 8090

# Or Anthropic-compatible stub server
uv run glm-validator-anthropic --port 8092

The server prints its URL to stdout:

Stub OpenAI server running at http://127.0.0.1:8090/v1
Validation report: http://127.0.0.1:8090/v1/validation_report

Step 2b: Configure Your Client

For OpenCode: Create/edit opencode.json:

{
  "providers": {
    "test": {
      "type": "openai",
      "baseURL": "http://127.0.0.1:8090/v1",
      "apiKey": "test-key"
    }
  },
  "model": "test/stub-model"
}

Then run OpenCode:

opencode
# Send a few messages, then check the validation report
curl http://127.0.0.1:8090/v1/validation_report

For Claude Code:

# Set environment to point to stub server
ANTHROPIC_BASE_URL=http://127.0.0.1:8092 claude

# Send messages, then check validation
curl http://127.0.0.1:8092/v1/validation_report

Step 2c: Check the Validation Report

curl http://127.0.0.1:8090/v1/validation_report | jq

Output:

{
  "total": 6,
  "returned": 6,
  "missing": [],
  "missing_count": 0,
  "assessment": "PASS: All expected tokens were returned",
  "by_category": {
    "THINK": {"tokens": ["[THINK-OAI-T1-abc12345]", "[THINK-OAI-T2-def67890]"], "missing": []},
    "CONTENT": {"tokens": ["[CONTENT-OAI-T1-111]", "[CONTENT-OAI-T2-222]"], "missing": []}
  }
}

3. Automated Compliance Testing

Test a client binary automatically:

# Test OpenCode binary for compliance
uv run python -m glm_thinking_validator.compliance \
    --client opencode \
    --binary /usr/local/bin/opencode \
    --config /path/to/opencode.json

# Test Claude Code binary
uv run python -m glm_thinking_validator.compliance \
    --client claude-code \
    --binary /usr/local/bin/claude

# Output:
# Starting stub server on port 8090...
# Launching client: /usr/local/bin/opencode
# Sending test prompts...
# Collecting validation report...
#
# === COMPLIANCE REPORT ===
# Client: opencode v1.0.0
# Total tokens: 6
# Preserved: 2 (33%)
# Missing: 4 (67%)
#   - [THINK-OAI-T1-abc12345] (reasoning_content not returned)
#   - [THINK-OAI-T2-def67890] (reasoning_content not returned)
#
# VERDICT: FAIL - Client does not preserve reasoning_content

API Endpoints

Stub OpenAI Server (default port 8090)

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completions with token tracking
`/v1/validation_report`	GET	Get token preservation report
`/v1/reset`	POST	Reset token registry for new test
`/health`	GET	Health check

Stub Anthropic Server (default port 8092)

Endpoint	Method	Description
`/v1/messages`	POST	Messages API with thinking blocks
`/v1/validation_report`	GET	Get token preservation report
`/v1/reset`	POST	Reset token registry for new test
`/health`	GET	Health check

Preserved Thinking Requirements

For preserved thinking to work correctly, clients MUST:

1. Send `chat_template_kwargs` in requests

{
  "chat_template_kwargs": {
    "enable_thinking": true,
    "clear_thinking": false
  }
}

2. Include `reasoning_content` for assistant messages

{
  "messages": [
    {"role": "user", "content": "Hello"},
    {
      "role": "assistant",
      "content": "Hi there!",
      "reasoning_content": "The user greeted me, I should respond friendly."
    },
    {"role": "user", "content": "What did you think about?"}
  ]
}

3. For Anthropic API, preserve thinking blocks

{
  "messages": [
    {"role": "user", "content": "Hello"},
    {
      "role": "assistant",
      "content": [
        {"type": "thinking", "thinking": "User greeted me..."},
        {"type": "text", "text": "Hi there!"}
      ]
    },
    {"role": "user", "content": "Continue"}
  ]
}

Running Tests

# Run all tests (harness validation + unit tests)
uv run pytest -v

# Run only harness validation tests
uv run pytest tests/test_harness_validation.py -v

# Run llama.cpp passthrough tests (requires running llama-server)
LLAMACPP_URL=http://localhost:8080 uv run pytest tests/test_llamacpp_oai_passthrough.py -v

Expected Results

Test	Expected	Notes
Harness validation	PASS	Validates test infrastructure
OpenCode preservation	FAIL	Known issue: Vercel AI SDK drops reasoning_content
Claude Code preservation	PASS	Should preserve thinking blocks
llama.cpp OAI passthrough	PASS	Server correctly handles reasoning_content

Testing Specific Versions

This section documents how to test specific published versions of clients and backends, and tracks which versions are known to be broken.

Version Compatibility Matrix (as of 2026-01-24)

Component	Pinned Version	Latest	Preserved Thinking	Notes
OpenCode	0.3.0	0.3.x	BROKEN	Vercel AI SDK strips `reasoning_content`
Claude Code	1.0.0	1.x	UNTESTED	Should work (native Anthropic client)
Droid	-	0.x	UNTESTED	Uses Vercel AI SDK (likely broken)
llama.cpp	b4712	latest	PARTIAL	OAI API works, Anthropic API broken
vLLM	0.6.0	0.6.x	UNTESTED	Needs `reasoning_content` passthrough test
MLX	0.21.0	0.21.x	UNTESTED	Needs `reasoning_content` passthrough test

Testing OpenCode

Automated (using expect)

# Run automated test with expect scripting
./scripts/expect/run-opencode-test.sh 0.3.0

# Or test latest version
./scripts/expect/run-opencode-test.sh latest

Manual

# Start stub server
uv run glm-validator-openai --port 8090 &

# Create test config
cat > /tmp/opencode-test.json << 'EOF'
{
  "providers": {
    "stub": {
      "type": "openai",
      "baseURL": "http://127.0.0.1:8090/v1",
      "apiKey": "test"
    }
  },
  "model": "stub/stub-model"
}
EOF

# Run OpenCode with test config
OPENCODE_CONFIG=/tmp/opencode-test.json bunx opencode-ai@0.3.0

# After sending 2+ messages, check validation
curl -s http://127.0.0.1:8090/v1/validation_report | jq '.assessment'
# Expected: "FAIL: X tokens missing"

Testing Claude Code

Automated (using expect)

# Run automated test with expect scripting
./scripts/expect/run-claude-code-test.sh 1.0.0

# Or test latest version
./scripts/expect/run-claude-code-test.sh latest

Manual

# Start stub Anthropic server
uv run glm-validator-anthropic --port 8092 &

# Run Claude Code against stub server
ANTHROPIC_BASE_URL=http://127.0.0.1:8092 bunx @anthropic-ai/claude-code@1.0.0

# After sending 2+ messages, check validation
curl -s http://127.0.0.1:8092/v1/validation_report | jq '.assessment'
# Expected: "PASS: All expected tokens were returned"

Testing Droid

# Clone and build specific version
git clone https://github.com/anthropics/droid.git
cd droid && git checkout v0.1.0
bun install && bun run build

# Start stub server
uv run glm-validator-openai --port 8090 &

# Configure droid to use stub server (check droid docs for config format)
# Run with: bun run start
# Then validate
curl -s http://127.0.0.1:8090/v1/validation_report | jq '.assessment'

Testing llama.cpp

# Build specific version
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && git checkout b4712
cmake -B build && cmake --build build --config Release

# Start server with GLM model
./build/bin/llama-server \
    --model /path/to/glm-4.7-flash.gguf \
    --host 127.0.0.1 \
    --port 8080 \
    --jinja

# Test with stub client
uv run python -c "
from glm_thinking_validator.stubs.stub_openai_client import StubOpenAIClient

with StubOpenAIClient('http://127.0.0.1:8080/v1') as client:
    client.reset_server()  # Won't work on real server, that's OK

    # Turn 1
    resp1 = client.chat('What is 2+2?')
    print(f'Turn 1 reasoning: {resp1.reasoning_content[:50]}...')

    # Turn 2 - check if reasoning was preserved in template
    resp2 = client.chat('Are you sure?')
    print(f'Turn 2 reasoning: {resp2.reasoning_content[:50]}...')
"

# For deeper inspection, check llama.cpp logs for rendered prompt

Testing vLLM

# Install specific version with uv
uv pip install vllm==0.6.0

# Start vLLM server with GLM model
uv run python -m vllm.entrypoints.openai.api_server \
    --model /path/to/glm-4.7-flash \
    --host 127.0.0.1 \
    --port 8080 \
    --trust-remote-code

# Test with stub client
uv run python -c "
from glm_thinking_validator.stubs.stub_openai_client import StubOpenAIClient

with StubOpenAIClient('http://127.0.0.1:8080/v1') as client:
    resp1 = client.chat('Hello')
    print(f'Has reasoning_content: {bool(resp1.reasoning_content)}')

    resp2 = client.chat('Continue')
    # Check if vLLM passed reasoning_content to template
"

Testing MLX (Apple Silicon)

# Install specific version with uvx (one-off)
uvx --from mlx-lm==0.21.0 mlx_lm.server \
    --model mlx-community/glm-4-9b-chat-4bit \
    --port 8080

# Or install into project
uv pip install mlx-lm==0.21.0
uv run mlx_lm.server --model mlx-community/glm-4-9b-chat-4bit --port 8080

# Test with stub client
uv run python -c "
from glm_thinking_validator.stubs.stub_openai_client import StubOpenAIClient

with StubOpenAIClient('http://127.0.0.1:8080/v1') as client:
    resp1 = client.chat('Hello')
    print(f'Has reasoning_content: {bool(resp1.reasoning_content)}')
"

Known Issues

OpenCode / Vercel AI SDK (BROKEN as of 2026-01-24)

Affected versions: All current versions (0.3.x)

The Vercel AI SDK's @ai-sdk/openai provider strips reasoning_content from messages before sending them to the server. This breaks preserved thinking for any client using this SDK.

Root cause: @ai-sdk/openai uses a strict message schema that doesn't include reasoning_content:

// ai-sdk/packages/openai/src/openai-chat-language-model.ts
// Message type only includes: role, content, name, tool_calls, tool_call_id
// reasoning_content is silently dropped

Issue: https://github.com/vercel/ai/issues/XXXX
Workaround: Use raw HTTP requests or patch the SDK

llama.cpp Anthropic API (BROKEN as of 2026-01-24)

Affected versions: b4712 and earlier

The convert_anthropic_to_oai() function in llama.cpp does not convert Anthropic thinking content blocks to OpenAI reasoning_content. Thinking blocks are silently discarded.

Root cause: common/chat.cpp line ~450:

// Only converts "text" and "tool_use" blocks
// "thinking" blocks are ignored

PR needed: Add thinking block → reasoning_content conversion
Workaround: Use OpenAI API format directly

vLLM (UNTESTED as of 2026-01-24)

Status: Needs testing

vLLM's OpenAI-compatible API may not pass reasoning_content through to the chat template. Testing required.

MLX (UNTESTED as of 2026-01-24)

Status: Needs testing

MLX's server implementation needs verification for reasoning_content support.

Development

Adding a New Test

Create test file in tests/
Use TokenRegistry from glm_thinking_validator.stubs.token_registry
Use StubOpenAIClient or StubAnthropicClient for controlled testing

Project Structure

src/glm_thinking_validator/
├── __init__.py
├── stubs/
│   ├── token_registry.py         # Token generation and tracking
│   ├── stub_openai_server.py     # OpenAI-compatible stub server
│   ├── stub_openai_client.py     # OpenAI stub clients (correct + broken)
│   ├── stub_anthropic_server.py  # Anthropic-compatible stub server
│   ├── stub_anthropic_client.py  # Anthropic stub clients
│   └── llamacpp_inspector.py     # llama.cpp template inspection

tests/
├── test_harness_validation.py    # Validates test infrastructure
├── test_llamacpp_oai_passthrough.py  # llama.cpp passthrough tests
└── conftest.py                   # Pytest fixtures

License

MIT

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

GLM-4.7 Thinking Validator

The Problem

GLM-4.7 Preserved Thinking Message Flow

Turn 1: Initial Request

Turn 2: Follow-up (CORRECT)

Turn 2: Follow-up (BROKEN - OpenCode behavior)

Architecture Overview

Installation

Traceable Token System

Token Categories

Usage

1. Test a Backend Server (Stub Client → Your Server)

2. Test a Client (Your Client → Stub Server)

Step 2a: Start the Stub Server

Step 2b: Configure Your Client

Step 2c: Check the Validation Report

3. Automated Compliance Testing

API Endpoints

Stub OpenAI Server (default port 8090)

Stub Anthropic Server (default port 8092)

Preserved Thinking Requirements

1. Send chat_template_kwargs in requests

2. Include reasoning_content for assistant messages

3. For Anthropic API, preserve thinking blocks

Running Tests

Expected Results

Testing Specific Versions

Version Compatibility Matrix (as of 2026-01-24)

Testing OpenCode

Automated (using expect)

Manual

Testing Claude Code

Automated (using expect)

Manual

Testing Droid

Testing llama.cpp

Testing vLLM

Testing MLX (Apple Silicon)

Known Issues

OpenCode / Vercel AI SDK (BROKEN as of 2026-01-24)

llama.cpp Anthropic API (BROKEN as of 2026-01-24)

vLLM (UNTESTED as of 2026-01-24)

MLX (UNTESTED as of 2026-01-24)

Development

Adding a New Test

Project Structure

License

1. Send `chat_template_kwargs` in requests

2. Include `reasoning_content` for assistant messages