Step verification monitors improve reasoning accuracy by checking the model's intermediate steps in real time and providing corrective feedback when errors are detected. Unlike early stopping monitors (which only decide when to stop), these monitors actively verify what the model is producing and steer it back on track when it goes wrong.
Each monitor uses domain-specific verifiers — arithmetic checks for Game of 24, grid navigation for Maze, and Z3 constraint solving for SpatialMap — to catch mistakes as they happen rather than after the full response has been generated.
All step verification monitors are used via stream_completion:
from interwhen import stream_completion
from interwhen.monitors import StepVerifierGame24Monitor, StepVerifierMazeMonitor, StepVerifierSpatialMapMonitor
answer = await stream_completion(
prompt,
llm_server=llm_server,
monitors=(your_monitor,),
add_delay=False,
termination_requires_validation=False,
async_execution=True
)Verifies Game of 24 solutions step by step. The model is prompted to output a structured format where each step declares available numbers, a suggested arithmetic operation, and the remaining numbers. The monitor verifies that each operation is mathematically correct, uses valid available numbers, and produces correct remaining numbers. On error, it appends feedback prompting the model to retry that step.
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str | — | Unique identifier for this monitor |
answer_start_token |
str | — | Token marking the transition from reasoning to answer (typically "</think>") |
original_numbers |
list | — | The four input numbers for the Game of 24 problem |
max_corrections |
int | 5 |
Maximum number of correction attempts before giving up |
async_execution |
bool | True |
Whether to run verification asynchronously |
StepVerifierGame24Monitor(
name="game24_step_verifier",
answer_start_token="</think>",
original_numbers=[1, 2, 6, 8],
max_corrections=5,
)The model is prompted to produce structured steps using a meta prompt:
>Step1
available numbers: [1, 2, 6, 8]
suggested operation: 8 / 2 = 4
remaining numbers: [4, 1, 6]
>Step2
available numbers: [4, 1, 6]
suggested operation: 6 * 4 = 24
remaining numbers: [24, 1]
Verifies maze navigation step by step against the actual maze grid. The model outputs structured steps declaring movement direction, from/to positions, turn type, and running turn counts. The monitor checks all of these against the ground-truth maze grid — verifying that moves are valid, positions are walkable, turn classifications are correct, and running counts are accurate.
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str | — | Unique identifier for this monitor |
answer_start_token |
str | — | Token marking the transition from reasoning to answer |
grid |
list | — | The maze grid (2D list of characters) |
start_pos |
tuple | — | Starting position (row, col) |
exit_pos |
tuple | — | Exit position (row, col) |
max_corrections |
int | 5 |
Maximum correction attempts per example |
question_type |
str | "right_turns" |
One of "right_turns", "total_turns", or "relative_position" |
async_execution |
bool | True |
Whether to run verification asynchronously |
You can also create the monitor directly from a prompt using the factory method:
monitor = StepVerifierMazeMonitor.from_prompt(
prompt_text=user_prompt,
max_corrections=5,
name="maze_step_verifier"
)This automatically parses the maze grid, start/exit positions, and auto-detects the question type from the prompt text.
from interwhen.utils.maze_verifier import parse_maze_from_prompt
grid, start_pos, exit_pos = parse_maze_from_prompt(user_prompt)
question_type = StepVerifierMazeMonitor.detect_question_type(user_prompt)
StepVerifierMazeMonitor(
name="maze_step_verifier",
answer_start_token="</think>",
grid=grid,
start_pos=start_pos,
exit_pos=exit_pos,
max_corrections=5,
question_type=question_type,
)| Type | What it verifies |
|---|---|
"right_turns" |
Full step-by-step navigation: direction, positions, walkability, turn type, and right-turn count |
"total_turns" |
Same as above, but tracks total turns (left + right) |
"relative_position" |
Only verifies the LOCATE section where the model identifies S and E positions |
Verifies spatial/directional reasoning using Z3 constraint solving. The model reasons about objects on a map and their relative positions (e.g., "A is northwest of B"). The monitor initializes Z3 constraints from the problem description, then incrementally verifies each directional claim the model makes. Valid claims strengthen the constraint set; contradictory claims trigger feedback.
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str | — | Unique identifier for this monitor |
answer_start_token |
str | — | Token marking the transition from reasoning to answer |
problem_text |
str | — | The spatial reasoning problem description |
max_corrections |
int | 5 |
Maximum correction attempts per example |
async_execution |
bool | True |
Whether to run verification asynchronously |
monitor = StepVerifierSpatialMapMonitor.from_prompt(
problem_text=user_prompt,
max_corrections=5,
name="spatialmap_step_verifier"
)This parses the spatial relationships from the problem text and initializes the Z3 solver automatically.
The Z3 solver handles diagonal directions (Northwest, Northeast, Southwest, Southeast) and cardinal directions (North, South, East, West), including transitivity (if A is NW of B and B is NW of C, then A is NW of C) and reversibility (if A is NW of B, then B is SE of A).
Each script runs a full evaluation: loading a dataset, building structured prompts, running inference with step verification, and computing accuracy/token statistics.
# Game of 24 with step verification
python ./examples/TTSwithVerification/game24_stepverifier.py -n 1
# Maze with step verification
python ./examples/TTSwithVerification/maze_stepverifier.py -n 1
# SpatialMap with step verification
python ./examples/TTSwithVerification/spatialmap_stepverifier.py -n 1| Argument | Description | Default |
|---|---|---|
-n, --num_examples |
Number of examples to run | varies by task |
--max_corrections |
Max correction attempts per example | 5 |
-d, --debug |
Enable debug logging | False |
--model / --main_model |
Main generation model | Qwen/Qwen3-30B-A3B-Thinking-2507 |
--port |
vLLM server port | 8000 |