Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Verifier-guided Reasoning in Three Lines

Running verifier-guided inference requires only a few lines of code: just specify the list of monitors to be used with a target LLM. Each monitor requires specifying the kind of verifier, when it should be invoked (e.g., each step or after a reflection token like 'Wait'), and the text pattern to intervene with.

Set up target LLM server

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-30B-A3B-Thinking-2507 \
  --port 8000 \
  --tensor-parallel-size 4

Generate answer enabled with given monitors

llm_server = init_llm_server("Qwen/Qwen3-30B-A3B-Thinking-2507", max_tokens=32768, port=8000)
stream_completion(
    prompt,
    llm_server=llm_server,
    monitors=(SimpleTextReplaceMonitor("IsCheck", "</think>", async_execution=True),),
    async_execution=True
)

The above code implements a simple monitor that watches the model's output stream and replaces all occurences of "is" with "isn't". It can be replaced with your custom monitor, e.g., for checking logical correctness or domain-specific constraints. You can run the full example

python ./examples/text_replacement_example.py
text_latest.mp4

The table below shows the latency impact of the monitor. When the stream contains the target word ("is"), the monitor activates and performs the replacement, adding some overhead. When the target word is absent, the monitor has negligible impact on latency.

Stream content Monitor Latency (s)
Contains "is" (monitor activates) enabled 12.97 ± 2.97
Contains "is" (monitor activates) disabled 8.36 ± 0.01
Does not contain "is" (monitor idle) enabled 7.31 ± 1.16
Does not contain "is" (monitor idle) disabled 7.35 ± 1.17