Model: Support IBM Granite (Dense/Mamba + MoE) by blazingbhavneek · Pull Request #18040 · sgl-project/sglang

blazingbhavneek · 2026-01-31T15:39:42Z

Motivation

Add Support for ibm-granite/granite-4.0-h-micro and its Dense variant

When I tried to run this model, i got the message:

❯ python3 -m sglang.bench_one_batch --correct --model ibm-granite/granite-4.0-h-micro

...
...

[rank0]:   File "/run/media/blazingbhavneek/Common/Code/sglang/python/sglang/srt/model_loader/utils.py", line 74, in resolve_transformers_arch
[rank0]:     raise ValueError(
[rank0]: ValueError: GraniteMoeHybridForCausalLM has no SGlang implementation and the Transformers implementation is not compatible with SGLang.

This PR ports IBM Granite model from its vllm implementation

Modifications

Accuracy Tests

Command for SGLang Server:
python -m sglang.launch_server --model-path ibm-granite/granite-4.0-h-micro --port 30000

Command for vLLM Server:
python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model ibm-granite/granite-4.0-h-micro --disable-log-requests --port 21000

Benchmark	Model	Engine	Accuracy
MMLU	granite-4.0-h-micro	SGLang	0.636
		vLLM	0.636
	granite-4.0-micro	SGLang	0.618
		vLLM	0.625
GSM8K	granite-4.0-h-micro	SGLang	0.805
		vLLM	0.790
	granite-4.0-micro	SGLang	0.800
		vLLM	0.815

MMLU

granite-4.0-h-micro

vLLM

❯ python3 bench_other.py --nsub 10 --backend vllm

  0%|                                                                                                                                  | 0/10 [00:00<?, ?it/s]Average accuracy 0.510, latency 69.94, #q: 100 - abstract_algebra
 10%|████████████▏                                                                                                             | 1/10 [01:09<10:29, 69.96s/it]Average accuracy 0.578, latency 7.04, #q: 135 - anatomy
 20%|████████████████████████▍                                                                                                 | 2/10 [01:17<04:23, 32.96s/it]Average accuracy 0.770, latency 13.12, #q: 152 - astronomy
 30%|████████████████████████████████████▌                                                                                     | 3/10 [01:30<02:47, 23.91s/it]Average accuracy 0.620, latency 8.53, #q: 100 - business_ethics
 40%|████████████████████████████████████████████████▊                                                                         | 4/10 [01:38<01:47, 17.85s/it]Average accuracy 0.702, latency 15.63, #q: 265 - clinical_knowledge
 50%|█████████████████████████████████████████████████████████████                                                             | 5/10 [01:54<01:25, 17.06s/it]Average accuracy 0.757, latency 10.11, #q: 144 - college_biology
 60%|█████████████████████████████████████████████████████████████████████████▏                                                | 6/10 [02:04<00:58, 14.70s/it]Average accuracy 0.450, latency 8.17, #q: 100 - college_chemistry
 70%|█████████████████████████████████████████████████████████████████████████████████████▍                                    | 7/10 [02:12<00:37, 12.58s/it]Average accuracy 0.610, latency 12.45, #q: 100 - college_computer_science
 80%|█████████████████████████████████████████████████████████████████████████████████████████████████▌                        | 8/10 [02:25<00:25, 12.54s/it]Average accuracy 0.530, latency 8.92, #q: 100 - college_mathematics
 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▊            | 9/10 [02:34<00:11, 11.42s/it]Average accuracy 0.630, latency 12.90, #q: 173 - college_medicine
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [02:47<00:00, 16.71s/it]
Total latency: 166.797
Average accuracy: 0.636

SGLang

❯ python3 bench_sglang.py --nsub 10
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1369/1369 [01:07<00:00, 20.26it/s]
subject: abstract_algebra, #q:100, acc: 0.480
subject: anatomy, #q:135, acc: 0.593
subject: astronomy, #q:152, acc: 0.757
subject: business_ethics, #q:100, acc: 0.620
subject: clinical_knowledge, #q:265, acc: 0.717
subject: college_biology, #q:144, acc: 0.729
subject: college_chemistry, #q:100, acc: 0.480
subject: college_computer_science, #q:100, acc: 0.570
subject: college_mathematics, #q:100, acc: 0.520
subject: college_medicine, #q:173, acc: 0.653
Total latency: 67.571
Average accuracy: 0.636

granite-4.0-micro

vLLM

❯ python3 bench_other.py --nsub 10 --backend vllm
  0%|                                                                                                                                  | 0/10 [00:00<?, ?it/s]Average accuracy 0.380, latency 1.23, #q: 100 - abstract_algebra
 10%|████████████▏                                                                                                             | 1/10 [00:01<00:11,  1.24s/it]Average accuracy 0.541, latency 1.45, #q: 135 - anatomy
 20%|████████████████████████▍                                                                                                 | 2/10 [00:02<00:11,  1.38s/it]Average accuracy 0.770, latency 2.03, #q: 152 - astronomy
 30%|████████████████████████████████████▌                                                                                     | 3/10 [00:04<00:11,  1.69s/it]Average accuracy 0.680, latency 1.41, #q: 100 - business_ethics
 40%|████████████████████████████████████████████████▊                                                                         | 4/10 [00:06<00:09,  1.59s/it]Average accuracy 0.672, latency 2.70, #q: 265 - clinical_knowledge
 50%|█████████████████████████████████████████████████████████████                                                             | 5/10 [00:08<00:09,  2.00s/it]Average accuracy 0.778, latency 2.06, #q: 144 - college_biology
 60%|█████████████████████████████████████████████████████████████████████████▏                                                | 6/10 [00:11<00:08,  2.03s/it]Average accuracy 0.430, latency 1.40, #q: 100 - college_chemistry
 70%|█████████████████████████████████████████████████████████████████████████████████████▍                                    | 7/10 [00:12<00:05,  1.83s/it]Average accuracy 0.630, latency 1.80, #q: 100 - college_computer_science
 80%|█████████████████████████████████████████████████████████████████████████████████████████████████▌                        | 8/10 [00:14<00:03,  1.83s/it]Average accuracy 0.460, latency 1.39, #q: 100 - college_mathematics
 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▊            | 9/10 [00:15<00:01,  1.69s/it]Average accuracy 0.676, latency 2.73, #q: 173 - college_medicine
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:18<00:00,  1.84s/it]
Total latency: 18.226
Average accuracy: 0.625

SGLang

❯ python3 bench_sglang.py --nsub 10
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1369/1369 [00:19<00:00, 71.07it/s]
subject: abstract_algebra, #q:100, acc: 0.340
subject: anatomy, #q:135, acc: 0.541
subject: astronomy, #q:152, acc: 0.757
subject: business_ethics, #q:100, acc: 0.680
subject: clinical_knowledge, #q:265, acc: 0.668
subject: college_biology, #q:144, acc: 0.771
subject: college_chemistry, #q:100, acc: 0.430
subject: college_computer_science, #q:100, acc: 0.610
subject: college_mathematics, #q:100, acc: 0.470
subject: college_medicine, #q:173, acc: 0.676
Total latency: 19.266
Average accuracy: 0.618

GSM8K

granite-4.0-h-micro

vLLM

❯ python3 bench_other.py --num-questions 200 --backend vllm
Downloading from https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl to /tmp/test.jsonl
/tmp/test.jsonl: 732kB [00:00, 2.65MB/s]                                                                                                                      
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:46<00:00,  4.33it/s]
Accuracy: 0.790
Invalid: 0.000
Latency: 46.257 s

SGLang

❯ python3 bench_sglang.py --num-questions 200
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:42<00:00,  4.65it/s]
Accuracy: 0.805
Invalid: 0.000
Latency: 43.329 s
Output throughput: 431.677 token/s

granite-4.0-h-micro

vllm

❯ python3 bench_other.py --num-questions 200 --backend vllm
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:18<00:00, 10.76it/s]
Accuracy: 0.815
Invalid: 0.000
Latency: 18.629 s

SGLang

❯ python3 bench_sglang.py --num-questions 200
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:17<00:00, 11.49it/s]
Accuracy: 0.800

Benchmarking and Profiling

❯ python -m sglang.bench_one_batch --model-path ibm-granite/granite-4.0-h-micro --batch 4
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
[2026-02-01 00:05:01 TP0] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-02-01 00:05:01 TP0] Init torch distributed ends. elapsed=0.12 s, mem usage=0.02 GB
[2026-02-01 00:05:01 TP0] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-02-01 00:05:01 TP0] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-02-01 00:05:01 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/home/blazingbhavneek/miniconda3/envs/sglang/lib/python3.11/site-packages/transformers/__init__.py)
[2026-02-01 00:05:02 TP0] Load weight begin. avail mem=15.24 GB
[2026-02-01 00:05:02 TP0] Beginning to load weights
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.84it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.72it/s]

[2026-02-01 00:05:03 TP0] Loading weights took 1.21 seconds
[2026-02-01 00:05:03 TP0] Load weight end. elapsed=1.25 s, type=GraniteMoeHybridForCausalLM, dtype=torch.bfloat16, avail mem=9.26 GB, mem usage=5.98 GB.
[2026-02-01 00:05:03 TP0] Using KV cache dtype: torch.bfloat16
[2026-02-01 00:05:03 TP0] Mamba Cache is allocated. max_mamba_cache_size: 38, conv_state size: 0.03GB, ssm_state size: 2.74GB 
[2026-02-01 00:05:03 TP0] KV Cache is allocated. #tokens: 399740, K size: 1.52 GB, V size: 1.52 GB
[2026-02-01 00:05:03 TP0] Memory pool end. avail mem=3.42 GB
[2026-02-01 00:05:03 TP0] Init attention backend begin.
[2026-02-01 00:05:03 TP0] Init attention backend end. elapsed=0.02 s
[2026-02-01 00:05:03 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=2.98 GB
[2026-02-01 00:05:03 TP0] Capture cuda graph bs [1, 2, 4, 8]
Capturing batches (bs=1 avail_mem=2.91 GB): 100%|███████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.37it/s]
[2026-02-01 00:05:04 TP0] Capture cuda graph end. Time elapsed: 1.24 s. mem usage=0.09 GB. avail mem=2.89 GB.
max_total_num_tokens=399740
Warmup ...
[2026-02-01 00:05:04 TP0] Reset HybridReqToTokenPool
Prefill. latency: 1.02554 s, throughput:   3994.01 token/s
Decode 0. Batch size: 4, latency: 0.19385 s, throughput:     20.63 token/s
Decode 1. Batch size: 4, latency: 0.01851 s, throughput:    216.12 token/s
Decode 2. Batch size: 4, latency: 0.01837 s, throughput:    217.70 token/s
Decode 3. Batch size: 4, latency: 0.01846 s, throughput:    216.68 token/s
Decode 4. Batch size: 4, latency: 0.01839 s, throughput:    217.48 token/s
Decode.  median latency: 0.01837 s, median throughput:    217.70 token/s
Total. latency:  1.477 s, throughput:   2817.07 token/s
Benchmark ...
[2026-02-01 00:05:06 TP0] Reset HybridReqToTokenPool
Prefill. latency: 0.72352 s, throughput:   5661.19 token/s
Decode 0. Batch size: 4, latency: 0.01841 s, throughput:    217.27 token/s
Decode 1. Batch size: 4, latency: 0.01824 s, throughput:    219.30 token/s
Decode 2. Batch size: 4, latency: 0.01823 s, throughput:    219.43 token/s
Decode 3. Batch size: 4, latency: 0.01800 s, throughput:    222.25 token/s
Decode 4. Batch size: 4, latency: 0.01815 s, throughput:    220.42 token/s
Decode.  median latency: 0.01821 s, median throughput:    219.63 token/s
Total. latency:  0.996 s, throughput:   4177.08 token/s
[rank0]:[W201 00:05:08.318920119 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-31T15:40:01Z

Summary of Changes

Hello @blazingbhavneek, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands SGLang's model compatibility by integrating IBM Granite 4.0 models, which previously lacked native support. The core of the work involves defining the model's hybrid architecture, which combines Mamba and attention layers with Mixture-of-Experts, and adapting SGLang's execution pipeline to efficiently handle these complex models. The provided benchmarks validate the functional correctness and performance parity with existing implementations, making these powerful models accessible to SGLang users.

Highlights

IBM Granite 4.0 Model Support: This pull request introduces comprehensive support for IBM Granite 4.0 models, including both the hybrid Mamba-MoE (h-micro) and dense (micro) variants, enabling them to run within the SGLang framework.
New Model Configuration and Implementation: A new configuration class, GraniteMoeHybridConfig, has been added, along with the implementation of GraniteMoeHybridModel and GraniteMoeHybridForCausalLM. This includes specialized decoder layers for both Mamba and attention mechanisms, and handling for Mixture-of-Experts (MoE) components.
Performance and Accuracy Benchmarks: Extensive accuracy tests on MMLU and GSM8K benchmarks demonstrate that SGLang's implementation of IBM Granite 4.0 models achieves comparable results to vLLM. Detailed benchmarking and profiling logs are provided for both prefill and decode operations.
Integration with SGLang Infrastructure: The changes integrate the new model into SGLang's model loading, server argument handling, and utility functions, ensuring proper recognition and configuration of the hybrid architecture, including Mamba-Radix cache management.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for IBM Granite models, including both dense and hybrid Mamba-MoE variants. The changes are comprehensive, covering model configuration, implementation, server arguments, and documentation. The code is well-structured and appears to be a correct port from the vLLM implementation. I have one suggestion for refactoring duplicated code in the model implementation to improve maintainability.

python/sglang/srt/models/granitemoehybrid.py

Kangyan-Zhou · 2026-02-02T21:54:18Z

/tag-and-rerun-ci

Kangyan-Zhou · 2026-02-03T20:50:47Z

/tag-and-rerun-ci

Kangyan-Zhou · 2026-02-14T08:23:02Z

@blazingbhavneek can you resolve the conflict? I can help bypass CI and merge the PR

…vllm

blazingbhavneek · 2026-02-15T01:09:32Z

@Kangyan-Zhou Done! Re-based on recommendation of Alison Shao

blazingbhavneek requested review from Fridge003, Ying1123, hnyls2002, ispobock and merrymercy as code owners January 31, 2026 15:39

github-actions bot added the documentation Improvements or additions to documentation label Jan 31, 2026

gemini-code-assist bot reviewed Jan 31, 2026

View reviewed changes

python/sglang/srt/models/granitemoehybrid.py Show resolved Hide resolved

Kangyan-Zhou approved these changes Feb 2, 2026

View reviewed changes

github-actions bot added the run-ci label Feb 3, 2026

blazingbhavneek added 5 commits February 15, 2026 09:35

working, but gibberish output, needs cleaning too, will compare with …

580ea2d

…vllm

working now, need cleaning

fc5297e

ready to push, may need cleaning

aefbd76

added to docs

86d1dea

added unit test

097924f

blazingbhavneek force-pushed the feature/support-granite-hybrid branch from c98f144 to 097924f Compare February 15, 2026 00:39

Kangyan-Zhou merged commit 1ce3420 into sgl-project:main Feb 15, 2026
87 of 98 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model: Support IBM Granite (Dense/Mamba + MoE) #18040

Model: Support IBM Granite (Dense/Mamba + MoE) #18040
Kangyan-Zhou merged 5 commits intosgl-project:mainfrom
blazingbhavneek:feature/support-granite-hybrid

blazingbhavneek commented Jan 31, 2026

Uh oh!

gemini-code-assist bot commented Jan 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Kangyan-Zhou commented Feb 2, 2026

Uh oh!

Kangyan-Zhou commented Feb 3, 2026

Uh oh!

Kangyan-Zhou commented Feb 14, 2026

Uh oh!

blazingbhavneek commented Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

blazingbhavneek commented Jan 31, 2026

Motivation

Modifications

Accuracy Tests

MMLU

granite-4.0-h-micro

vLLM

SGLang

granite-4.0-micro

vLLM

SGLang

GSM8K

granite-4.0-h-micro

vLLM

SGLang

granite-4.0-h-micro

vllm

SGLang

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 31, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Kangyan-Zhou commented Feb 2, 2026

Uh oh!

Kangyan-Zhou commented Feb 3, 2026

Uh oh!

Kangyan-Zhou commented Feb 14, 2026

Uh oh!

blazingbhavneek commented Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants