Skip to content

pascual-family/glm-4.7-thinking-validator

Repository files navigation

GLM-4.7 Thinking Validator

Test harness for validating preserved thinking across clients, SDKs, and backends for GLM-4.7+ models.

The Problem

GLM-4.7+ models support "preserved thinking" - the ability to maintain reasoning context (reasoning_content) across multi-turn conversations. However, many clients and SDKs fail to properly pass this field through, breaking the model's ability to reference its prior reasoning.

This validator helps identify which component in the stack is failing:

Client (OpenCode, Claude Code) → SDK (Vercel AI, OpenAI) → Backend (llama.cpp, vLLM)

GLM-4.7 Preserved Thinking Message Flow

GLM-4.7+ models output their chain-of-thought reasoning in a separate reasoning_content field (OpenAI API) or thinking content blocks (Anthropic API). For the model to reference its prior reasoning in multi-turn conversations, clients must echo this content back.

Turn 1: Initial Request

Client → Server
{
  "messages": [{"role": "user", "content": "What is 2+2?"}],
  "chat_template_kwargs": {"enable_thinking": true, "clear_thinking": false}
}

Server → Client
{
  "message": {
    "role": "assistant",
    "content": "The answer is 4.",
    "reasoning_content": "Let me add 2+2. 2+2=4."   ← Model's thinking
  }
}

Turn 2: Follow-up (CORRECT)

The client MUST include reasoning_content from the prior assistant message:

Client → Server
{
  "messages": [
    {"role": "user", "content": "What is 2+2?"},
    {
      "role": "assistant",
      "content": "The answer is 4.",
      "reasoning_content": "Let me add 2+2. 2+2=4."   ← PRESERVED
    },
    {"role": "user", "content": "Are you sure?"}
  ],
  "chat_template_kwargs": {"enable_thinking": true, "clear_thinking": false}
}

The server's Jinja template renders this as:

<|user|>What is 2+2?
<|assistant|><think>Let me add 2+2. 2+2=4.</think>The answer is 4.
<|user|>Are you sure?
<|assistant|><think>

Now the model can see its prior reasoning and respond coherently.

Turn 2: Follow-up (BROKEN - OpenCode behavior)

Many clients strip reasoning_content, breaking the chain:

Client → Server
{
  "messages": [
    {"role": "user", "content": "What is 2+2?"},
    {
      "role": "assistant",
      "content": "The answer is 4."
                                                      ← reasoning_content MISSING
    },
    {"role": "user", "content": "Are you sure?"}
  ]
}

The model cannot see its prior thinking, leading to inconsistent responses.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                           TEST SCENARIOS                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Test 1: Real Client → Stub Server (Anthropic API)                          │
│  ┌──────────────┐      ┌──────────────────────────────┐                     │
│  │ Claude Code  │ ───► │ Stub Anthropic Server        │                     │
│  │ (real)       │ ◄─── │ Generates: [THINK-ANT-T1-*]  │                     │
│  └──────────────┘      │ Validates: tokens returned   │                     │
│                        └──────────────────────────────┘                     │
│                                                                              │
│  Test 2: Real Client → Stub Server (OpenAI API)                             │
│  ┌──────────────┐      ┌──────────────────────────────┐                     │
│  │ OpenCode     │ ───► │ Stub OpenAI Server           │                     │
│  │ (real)       │ ◄─── │ Generates: [THINK-OAI-T1-*]  │                     │
│  └──────────────┘      │ Validates: tokens returned   │                     │
│                        └──────────────────────────────┘                     │
│                                                                              │
│  Test 3: Stub Client → Real Server (llama.cpp)                              │
│  ┌──────────────┐      ┌──────────────┐                                     │
│  │ Stub Client  │ ───► │ llama.cpp    │ ───► Model                          │
│  │              │ ◄─── │ (real)       │                                     │
│  └──────────────┘      └──────────────┘                                     │
│                                                                              │
│  Test 4: Stub Client → Stub Server (Harness Validation)                     │
│  ┌──────────────┐      ┌──────────────────────────────┐                     │
│  │ Stub Client  │ ───► │ Stub Server                  │                     │
│  │ (controlled) │ ◄─── │ (controlled)                 │                     │
│  └──────────────┘      │ Validates: 100% preservation │                     │
│                        └──────────────────────────────┘                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Installation

git clone https://github.com/pascual-family/glm-4.7-thinking-validator.git
cd glm-4.7-thinking-validator
uv sync --extra dev

# For automated client testing, install expect
# macOS:
brew install expect

# Ubuntu/Debian:
sudo apt-get install expect

# Also ensure bun is installed for running JS clients
curl -fsSL https://bun.sh/install | bash

Traceable Token System

The harness embeds unique traceable tokens in responses that MUST be echoed back:

[CATEGORY-API-TURN-UUID8]

Examples:
[THINK-OAI-T1-a1b2c3d4]      # Thinking from turn 1, OpenAI API
[CONTENT-ANT-T2-e5f6g7h8]    # Content from turn 2, Anthropic API
[TOOL_IN-OAI-T1-i9j0k1l2]    # Tool input from turn 1

Token Categories

Category OpenAI Location Anthropic Location Purpose
THINK message.reasoning_content content[type="thinking"] Validates thinking preservation
CONTENT message.content content[type="text"] Validates content preservation
TOOL_ID tool_calls[].id content[type="tool_use"].id Validates tool call ID
TOOL_IN tool_calls[].function.arguments content[type="tool_use"].input Validates tool input
TOOL_OUT Tool result content content[type="tool_result"] Validates tool output

Usage

1. Test a Backend Server (Stub Client → Your Server)

Test whether an OpenAI-compatible server correctly handles reasoning_content:

# Test against any OpenAI-compatible endpoint
uv run python -m glm_thinking_validator.stubs.stub_openai_client \
    --url http://localhost:8080/v1 \
    --turns 3

# Output shows which tokens were preserved/lost
# PASS: All tokens preserved
# FAIL: Missing tokens: [THINK-OAI-T1-abc12345]

The stub client:

  1. Sends requests with chat_template_kwargs: {enable_thinking: true, clear_thinking: false}
  2. Includes reasoning_content for prior assistant turns
  3. Validates tokens from previous turns appear in subsequent requests

2. Test a Client (Your Client → Stub Server)

Step 2a: Start the Stub Server

# Start OpenAI-compatible stub server
uv run glm-validator-openai --port 8090

# Or Anthropic-compatible stub server
uv run glm-validator-anthropic --port 8092

The server prints its URL to stdout:

Stub OpenAI server running at http://127.0.0.1:8090/v1
Validation report: http://127.0.0.1:8090/v1/validation_report

Step 2b: Configure Your Client

For OpenCode: Create/edit opencode.json:

{
  "providers": {
    "test": {
      "type": "openai",
      "baseURL": "http://127.0.0.1:8090/v1",
      "apiKey": "test-key"
    }
  },
  "model": "test/stub-model"
}

Then run OpenCode:

opencode
# Send a few messages, then check the validation report
curl http://127.0.0.1:8090/v1/validation_report

For Claude Code:

# Set environment to point to stub server
ANTHROPIC_BASE_URL=http://127.0.0.1:8092 claude

# Send messages, then check validation
curl http://127.0.0.1:8092/v1/validation_report

Step 2c: Check the Validation Report

curl http://127.0.0.1:8090/v1/validation_report | jq

Output:

{
  "total": 6,
  "returned": 6,
  "missing": [],
  "missing_count": 0,
  "assessment": "PASS: All expected tokens were returned",
  "by_category": {
    "THINK": {"tokens": ["[THINK-OAI-T1-abc12345]", "[THINK-OAI-T2-def67890]"], "missing": []},
    "CONTENT": {"tokens": ["[CONTENT-OAI-T1-111]", "[CONTENT-OAI-T2-222]"], "missing": []}
  }
}

3. Automated Compliance Testing

Test a client binary automatically:

# Test OpenCode binary for compliance
uv run python -m glm_thinking_validator.compliance \
    --client opencode \
    --binary /usr/local/bin/opencode \
    --config /path/to/opencode.json

# Test Claude Code binary
uv run python -m glm_thinking_validator.compliance \
    --client claude-code \
    --binary /usr/local/bin/claude

# Output:
# Starting stub server on port 8090...
# Launching client: /usr/local/bin/opencode
# Sending test prompts...
# Collecting validation report...
#
# === COMPLIANCE REPORT ===
# Client: opencode v1.0.0
# Total tokens: 6
# Preserved: 2 (33%)
# Missing: 4 (67%)
#   - [THINK-OAI-T1-abc12345] (reasoning_content not returned)
#   - [THINK-OAI-T2-def67890] (reasoning_content not returned)
#
# VERDICT: FAIL - Client does not preserve reasoning_content

API Endpoints

Stub OpenAI Server (default port 8090)

Endpoint Method Description
/v1/chat/completions POST Chat completions with token tracking
/v1/validation_report GET Get token preservation report
/v1/reset POST Reset token registry for new test
/health GET Health check

Stub Anthropic Server (default port 8092)

Endpoint Method Description
/v1/messages POST Messages API with thinking blocks
/v1/validation_report GET Get token preservation report
/v1/reset POST Reset token registry for new test
/health GET Health check

Preserved Thinking Requirements

For preserved thinking to work correctly, clients MUST:

1. Send chat_template_kwargs in requests

{
  "chat_template_kwargs": {
    "enable_thinking": true,
    "clear_thinking": false
  }
}

2. Include reasoning_content for assistant messages

{
  "messages": [
    {"role": "user", "content": "Hello"},
    {
      "role": "assistant",
      "content": "Hi there!",
      "reasoning_content": "The user greeted me, I should respond friendly."
    },
    {"role": "user", "content": "What did you think about?"}
  ]
}

3. For Anthropic API, preserve thinking blocks

{
  "messages": [
    {"role": "user", "content": "Hello"},
    {
      "role": "assistant",
      "content": [
        {"type": "thinking", "thinking": "User greeted me..."},
        {"type": "text", "text": "Hi there!"}
      ]
    },
    {"role": "user", "content": "Continue"}
  ]
}

Running Tests

# Run all tests (harness validation + unit tests)
uv run pytest -v

# Run only harness validation tests
uv run pytest tests/test_harness_validation.py -v

# Run llama.cpp passthrough tests (requires running llama-server)
LLAMACPP_URL=http://localhost:8080 uv run pytest tests/test_llamacpp_oai_passthrough.py -v

Expected Results

Test Expected Notes
Harness validation PASS Validates test infrastructure
OpenCode preservation FAIL Known issue: Vercel AI SDK drops reasoning_content
Claude Code preservation PASS Should preserve thinking blocks
llama.cpp OAI passthrough PASS Server correctly handles reasoning_content

Testing Specific Versions

This section documents how to test specific published versions of clients and backends, and tracks which versions are known to be broken.

Version Compatibility Matrix (as of 2026-01-24)

Component Pinned Version Latest Preserved Thinking Notes
OpenCode 0.3.0 0.3.x BROKEN Vercel AI SDK strips reasoning_content
Claude Code 1.0.0 1.x UNTESTED Should work (native Anthropic client)
Droid - 0.x UNTESTED Uses Vercel AI SDK (likely broken)
llama.cpp b4712 latest PARTIAL OAI API works, Anthropic API broken
vLLM 0.6.0 0.6.x UNTESTED Needs reasoning_content passthrough test
MLX 0.21.0 0.21.x UNTESTED Needs reasoning_content passthrough test

Testing OpenCode

Automated (using expect)

# Run automated test with expect scripting
./scripts/expect/run-opencode-test.sh 0.3.0

# Or test latest version
./scripts/expect/run-opencode-test.sh latest

Manual

# Start stub server
uv run glm-validator-openai --port 8090 &

# Create test config
cat > /tmp/opencode-test.json << 'EOF'
{
  "providers": {
    "stub": {
      "type": "openai",
      "baseURL": "http://127.0.0.1:8090/v1",
      "apiKey": "test"
    }
  },
  "model": "stub/stub-model"
}
EOF

# Run OpenCode with test config
OPENCODE_CONFIG=/tmp/opencode-test.json bunx opencode-ai@0.3.0

# After sending 2+ messages, check validation
curl -s http://127.0.0.1:8090/v1/validation_report | jq '.assessment'
# Expected: "FAIL: X tokens missing"

Testing Claude Code

Automated (using expect)

# Run automated test with expect scripting
./scripts/expect/run-claude-code-test.sh 1.0.0

# Or test latest version
./scripts/expect/run-claude-code-test.sh latest

Manual

# Start stub Anthropic server
uv run glm-validator-anthropic --port 8092 &

# Run Claude Code against stub server
ANTHROPIC_BASE_URL=http://127.0.0.1:8092 bunx @anthropic-ai/claude-code@1.0.0

# After sending 2+ messages, check validation
curl -s http://127.0.0.1:8092/v1/validation_report | jq '.assessment'
# Expected: "PASS: All expected tokens were returned"

Testing Droid

# Clone and build specific version
git clone https://github.com/anthropics/droid.git
cd droid && git checkout v0.1.0
bun install && bun run build

# Start stub server
uv run glm-validator-openai --port 8090 &

# Configure droid to use stub server (check droid docs for config format)
# Run with: bun run start
# Then validate
curl -s http://127.0.0.1:8090/v1/validation_report | jq '.assessment'

Testing llama.cpp

# Build specific version
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && git checkout b4712
cmake -B build && cmake --build build --config Release

# Start server with GLM model
./build/bin/llama-server \
    --model /path/to/glm-4.7-flash.gguf \
    --host 127.0.0.1 \
    --port 8080 \
    --jinja

# Test with stub client
uv run python -c "
from glm_thinking_validator.stubs.stub_openai_client import StubOpenAIClient

with StubOpenAIClient('http://127.0.0.1:8080/v1') as client:
    client.reset_server()  # Won't work on real server, that's OK

    # Turn 1
    resp1 = client.chat('What is 2+2?')
    print(f'Turn 1 reasoning: {resp1.reasoning_content[:50]}...')

    # Turn 2 - check if reasoning was preserved in template
    resp2 = client.chat('Are you sure?')
    print(f'Turn 2 reasoning: {resp2.reasoning_content[:50]}...')
"

# For deeper inspection, check llama.cpp logs for rendered prompt

Testing vLLM

# Install specific version with uv
uv pip install vllm==0.6.0

# Start vLLM server with GLM model
uv run python -m vllm.entrypoints.openai.api_server \
    --model /path/to/glm-4.7-flash \
    --host 127.0.0.1 \
    --port 8080 \
    --trust-remote-code

# Test with stub client
uv run python -c "
from glm_thinking_validator.stubs.stub_openai_client import StubOpenAIClient

with StubOpenAIClient('http://127.0.0.1:8080/v1') as client:
    resp1 = client.chat('Hello')
    print(f'Has reasoning_content: {bool(resp1.reasoning_content)}')

    resp2 = client.chat('Continue')
    # Check if vLLM passed reasoning_content to template
"

Testing MLX (Apple Silicon)

# Install specific version with uvx (one-off)
uvx --from mlx-lm==0.21.0 mlx_lm.server \
    --model mlx-community/glm-4-9b-chat-4bit \
    --port 8080

# Or install into project
uv pip install mlx-lm==0.21.0
uv run mlx_lm.server --model mlx-community/glm-4-9b-chat-4bit --port 8080

# Test with stub client
uv run python -c "
from glm_thinking_validator.stubs.stub_openai_client import StubOpenAIClient

with StubOpenAIClient('http://127.0.0.1:8080/v1') as client:
    resp1 = client.chat('Hello')
    print(f'Has reasoning_content: {bool(resp1.reasoning_content)}')
"

Known Issues

OpenCode / Vercel AI SDK (BROKEN as of 2026-01-24)

Affected versions: All current versions (0.3.x)

The Vercel AI SDK's @ai-sdk/openai provider strips reasoning_content from messages before sending them to the server. This breaks preserved thinking for any client using this SDK.

Root cause: @ai-sdk/openai uses a strict message schema that doesn't include reasoning_content:

// ai-sdk/packages/openai/src/openai-chat-language-model.ts
// Message type only includes: role, content, name, tool_calls, tool_call_id
// reasoning_content is silently dropped

llama.cpp Anthropic API (BROKEN as of 2026-01-24)

Affected versions: b4712 and earlier

The convert_anthropic_to_oai() function in llama.cpp does not convert Anthropic thinking content blocks to OpenAI reasoning_content. Thinking blocks are silently discarded.

Root cause: common/chat.cpp line ~450:

// Only converts "text" and "tool_use" blocks
// "thinking" blocks are ignored
  • PR needed: Add thinking block → reasoning_content conversion
  • Workaround: Use OpenAI API format directly

vLLM (UNTESTED as of 2026-01-24)

Status: Needs testing

vLLM's OpenAI-compatible API may not pass reasoning_content through to the chat template. Testing required.

MLX (UNTESTED as of 2026-01-24)

Status: Needs testing

MLX's server implementation needs verification for reasoning_content support.


Development

Adding a New Test

  1. Create test file in tests/
  2. Use TokenRegistry from glm_thinking_validator.stubs.token_registry
  3. Use StubOpenAIClient or StubAnthropicClient for controlled testing

Project Structure

src/glm_thinking_validator/
├── __init__.py
├── stubs/
│   ├── token_registry.py         # Token generation and tracking
│   ├── stub_openai_server.py     # OpenAI-compatible stub server
│   ├── stub_openai_client.py     # OpenAI stub clients (correct + broken)
│   ├── stub_anthropic_server.py  # Anthropic-compatible stub server
│   ├── stub_anthropic_client.py  # Anthropic stub clients
│   └── llamacpp_inspector.py     # llama.cpp template inspection

tests/
├── test_harness_validation.py    # Validates test infrastructure
├── test_llamacpp_oai_passthrough.py  # llama.cpp passthrough tests
└── conftest.py                   # Pytest fixtures

License

MIT

About

Validator for end-to-end conformity for clients and servers against GLM-4.7 Thinking message protocol

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors