Reimplement guided decoding with xgrammar for PyTorch Engine by windreamer · Pull Request #4028 · InternLM/lmdeploy

windreamer · 2025-10-09T07:38:34Z

Motivation

The original outlines-based guided decoding in the PyTorch engine has three major problems:

Token-level mismatch & poor performance
Outlines works on characters, not on tokens.
- It builds the FSM on the string vocabulary and then maps every character transition back to the token space at each forward pass.
- This character-token conversion dominates the latency; for long schemas or large vocabularies the overhead is often larger than the model forward itself.
Outdated & incompatible dependency
- We are pinned to outlines<0.1.0 which is more than one year old and hard-coupled to numpy 1.x.
- Upgrading to the newest outlines requires a full rewrite of our logits-processor layer because the internal FSM and tokenizer APIs have been redesigned and no longer expose the hooks we rely on.
Life-cycle bug
- The global LRU cache (size=32) keeps processors alive across sessions.
- When the 33-rd different guide appears, an old but still running processor is evicted and its matcher state is lost; the next request with the same guide re-uses a dirty matcher and generates illegal tokens.

xgrammar is a token-level GPU-native grammar engine:

The FSM is compiled directly on the tokenizer vocabulary, eliminating the character-token round trip.
Bit-mask generation and excellent performance.

Modification

guided_process.py
- New GuidedDecodingManager that wraps xgrammar.
- compile_json_schema() / compile_regex_grammar()
- allocate_token_bitmask() / apply_token_bitmask_inplace()
- per-session_id + seq_id processor cache
logits_process.py
- Remove all outlines glue code (_guided_sampling, guided_input_ids, …)
- FusedLogitsProcessor receives a GuidedDecodingManager instance
- Inside forward():
  – batch-allocate one bitmask tensor
  – fill it for every guided sequence
  – apply in-place on GPU
- After sampling: accept_token() advances each matcher
model_agent.py / sampling.py
- model_agent keeps the singleton GuidedDecodingManager
- ARSamplingStrategy builds session_ctx (session/seq IDs) and session_to_cleanup list
- SamplingInputs carries the two new fields instead of guided_input_ids
engine.py
- end_session() now calls sampling_strategy.on_session_end() → session_to_cleanup → next forward deletes the processors, guaranteeing immediate release.
requirements
- Drop outlines<0.1.0, add xgrammar for all backends (cuda/rocm/ascend/camb/maca).
tests
- Re-enable PyTorch-backend grammar tests that were previously skipped.

…guided processors

lmdeploy/pytorch/engine/engine.py

…SamplingInputs and refater the code

grimoire

LGTM

windreamer changed the title ~~Reimplement guided decoding with xgrammar for TurboMind~~ Reimplement guided decoding with xgrammar for Pytorch Oct 9, 2025

windreamer force-pushed the guided_decoding_with_xgrammar_pt branch 2 times, most recently from 893563c to 0b2c106 Compare October 10, 2025 07:33

windreamer changed the title ~~Reimplement guided decoding with xgrammar for Pytorch~~ Reimplement guided decoding with xgrammar for PyTorch Engine Oct 11, 2025

windreamer mentioned this pull request Oct 11, 2025

Guided decoding with xgrammar for TurboMind #3965

Merged

2 tasks

windreamer force-pushed the guided_decoding_with_xgrammar_pt branch from 2a3bbdf to 18628b4 Compare October 11, 2025 07:58

windreamer added 4 commits October 13, 2025 19:39

feat: replace outlines with xgrammar in pytorch engine

7e8a7a4

fix: add a session status synchronization to help model_agent manage …

b2cf5d9

…guided processors

fix: use cpu apply kernel for npu and others

16d92cf

fix: speedup with batched allocate and apply

443db85

windreamer force-pushed the guided_decoding_with_xgrammar_pt branch from 65afd9b to 443db85 Compare October 13, 2025 11:39

windreamer marked this pull request as ready for review October 13, 2025 11:40

windreamer requested a review from grimoire October 13, 2025 11:41

grimoire reviewed Oct 14, 2025

View reviewed changes

lmdeploy/pytorch/engine/engine.py Outdated Show resolved Hide resolved

fix: fix potential processor leakage, move session related fields to …

b7c5426

…SamplingInputs and refater the code

windreamer force-pushed the guided_decoding_with_xgrammar_pt branch from d4b6ddb to b7c5426 Compare October 15, 2025 05:04

windreamer requested a review from grimoire October 15, 2025 06:03

grimoire approved these changes Oct 15, 2025

View reviewed changes

windreamer requested a review from lvhan028 October 15, 2025 07:36

lvhan028 added the enhancement New feature or request label Oct 15, 2025

lvhan028 merged commit 1d20160 into InternLM:main Oct 15, 2025
22 checks passed

windreamer deleted the guided_decoding_with_xgrammar_pt branch October 16, 2025 02:21

This was referenced Oct 16, 2025

[Bug] pytorch后端结构化输出报错 #3581

Closed

使用TurboMind 推理 + Python 代码集成的方式报错 #1835

Closed

[Bug] structed_output cannot be used in cu118 with the lated docker images #3120

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement guided decoding with xgrammar for PyTorch Engine#4028

Reimplement guided decoding with xgrammar for PyTorch Engine#4028
lvhan028 merged 5 commits intoInternLM:mainfrom
windreamer:guided_decoding_with_xgrammar_pt

windreamer commented Oct 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

grimoire left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

windreamer commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

Uh oh!

Uh oh!

grimoire left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

windreamer commented Oct 9, 2025 •

edited

Loading