Skip to content

Guided decoding with xgrammar for TurboMind#3965

Merged
lvhan028 merged 18 commits intoInternLM:mainfrom
windreamer:guided_decoding_with_xgrammar
Oct 13, 2025
Merged

Guided decoding with xgrammar for TurboMind#3965
lvhan028 merged 18 commits intoInternLM:mainfrom
windreamer:guided_decoding_with_xgrammar

Conversation

@windreamer
Copy link
Copy Markdown
Collaborator

@windreamer windreamer commented Sep 12, 2025

Motivation

LMDeploy’s TurboMind backend is the fastest inference stack in the ecosystem, yet it still lacks Guided Decoding – a feature that is already available in the PyTorch backend and heavily requested by the community.
This PR closes the gap by bringing token-level, C++ native Guided Decoding to TurboMind while keeping the API 100 % compatible with the existing PyTorch backend.
The implementation is built on xGrammar (Apache-2.0), a high-performance C++ library that compiles JSON / Choice / Regex grammars into token FSMs and applies them with negligible overhead.

Modification

  1. Build-system

    • Add xgrammar as a header-only dependency via CMake FetchContent (CUDA & Python bindings disabled).
    • Export xgrammar::tokenizer_info and xgrammar::grammar_compiler symbols under lmdeploy::xgrammar.
  2. Core C++ changes

    • DynamicDecodeLayer pipeline extended with two new layers:
      • GuidedDecodeMaskLayer: in setup() compiles / reuses grammar → builds per-request token bitmask; in forward() launches a light CUDA kernel to mask disallowed logits to -INF.
      • GuidedDecodeUpdateLayer: in forward() calls matcher->AcceptToken(output_id) to advance the FSM.
    • Grammar compiler cache (LRU, keyed by schema hash) shared across all sessions to avoid re-compilation.
  3. Python frontend

    • Re-use existing guided_decoding utilities from PyTorch backend; no new API surface.
    • turbo.TurboMindEngine now accepts the same response_format= / guided_json= / guided_choice= arguments.

Checklist

  • Pre-commit hooks (clang-format, flake8, mypy) passed.
  • Document updated

@windreamer windreamer changed the title Guided decoding with xgrammar [WIP] Guided decoding with xgrammar Sep 12, 2025
@windreamer windreamer force-pushed the guided_decoding_with_xgrammar branch 3 times, most recently from 8b3e766 to 8fd6d05 Compare September 12, 2025 09:44
@shell-nlp
Copy link
Copy Markdown
Contributor

good job!

@windreamer windreamer force-pushed the guided_decoding_with_xgrammar branch 25 times, most recently from 0362250 to 8bcbfff Compare September 22, 2025 12:41
@windreamer windreamer force-pushed the guided_decoding_with_xgrammar branch from 9817089 to 4516ac7 Compare October 9, 2025 07:35
@windreamer windreamer changed the title Guided decoding with xgrammar Guided decoding with xgrammar for TurboMind Oct 9, 2025
@windreamer windreamer marked this pull request as ready for review October 9, 2025 07:36
@windreamer
Copy link
Copy Markdown
Collaborator Author

Could we split this PR into two separate ones? One for the TurboMind engine and another for the PyTorch engine.

Done

@windreamer
Copy link
Copy Markdown
Collaborator Author

I don't know much about guided decoding. But I think there are bugs in the pytorch implementation (from the main branch)
The matcher is maintained in instances of RegexLogitsProcessor or JSONLogitsProcessor. And _get_guided_logits_processor will only cache 32 instances. Different request with same guide would get the same processor and old processor would be removed if more than 32 guide request comes in.

You are right! It is a bit tough...

Should tenatively solved in #4028

@windreamer windreamer requested a review from lzhangzz October 13, 2025 09:01
@lvhan028
Copy link
Copy Markdown
Collaborator

May update the "structed_output.md"

@windreamer
Copy link
Copy Markdown
Collaborator Author

May update the "structed_output.md"

Done

@lvhan028 lvhan028 merged commit aef6363 into InternLM:main Oct 13, 2025
9 checks passed
@windreamer windreamer deleted the guided_decoding_with_xgrammar branch October 13, 2025 11:39
Skyseaee pushed a commit to Skyseaee/lmdeploy that referenced this pull request Jan 4, 2026
* feat(turbomind): bring xGrammar into build

* feat(turbomind): add skeleton for guided decoding layers

* feat(turbomind): add implementation for naive bitmap mask with a loop

* add ModelRequest support for xgrammar

* feat: enable grammar init in turbomind

* fix: fix some bug and add initial tests

* feat: restructure the interface

* feat: speedup with cuda inplace kernel

* fix: fix test case

* fix: use stream from context instead of the default stream

* test: add matrix grammar test

* fix: simplify the bitmap apply kernel

* feat: move tensor allocation to ctor

* test: temporarily disable pytorch engine tests as it is faulty

* test: move timm to test requirements

* fix: enable openai guided decoding function for turbomind

* fix: fix `schema` not found issue by enforce pydantic serialize_by_alias

* docs: modify docs for structured output
Skyseaee pushed a commit to Skyseaee/lmdeploy that referenced this pull request Jan 4, 2026
Guided decoding with xgrammar for TurboMind (InternLM#3965)

See merge request shopee/MLP/aip/llm/generater/lmdeploy!110
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] turbomind后端是否会支持guided_decoding

5 participants