Add Qwen3.5-2B contrib model by jimburtoft · Pull Request #141 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-24T03:03:59Z

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

Adds Qwen3.5-2B, a 2B parameter dense hybrid DeltaNet + GQA decoder from Alibaba Cloud, to the contrib directory. This model features 18 DeltaNet linear recurrent attention layers and 6 standard GQA layers in a [3 DeltaNet + 1 GQA] x 6 pattern, requiring custom NKI kernels for the DeltaNet forward passes on Neuron.

Key implementation details:

Fused NKI kernel for DeltaNet context encoding (CTE) — single-kernel chunked forward
Per-token NKI recurrent kernel for DeltaNet token generation (TKG)
Standard GQA attention for the 6 non-DeltaNet layers
Tied embeddings support (tie_word_embeddings=true)
Partial RoPE position encoding (25% of head_dim)

Model Information

Model Name: Qwen3.5-2B

Model Architecture: Decoder-only hybrid DeltaNet/GQA transformer (24 layers: 18 DeltaNet + 6 GQA), dense SwiGLU MLP, 2048 hidden size, 248K vocabulary

Purpose: Text generation (chat model with <|im_start|>/<|im_end|> format)

Checklist

Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.

Required Components

Accuracy Test (ex. test/integration/test_model.py)
- 9 integration tests validating model accuracy on Neuron
- First-token logit validation against pre-computed CPU BF16 reference logits (cosine similarity 0.9156, top-1 match, top-5 overlap 4/5)
- Multi-prompt coherence tests (factual Q&A, code generation, knowledge, list generation)
- Test can compile and run the model on Neuron
README.md with the following sections:
- Usage Example: Clear code example showing how to use the model
- Compatibility Matrix: Table showing tested Neuron SDK versions and instance types
- Example Checkpoints: Links to compatible model checkpoints (HuggingFace Hub)
- Testing Instructions: Commands to run unit and integration test suites, including CPU reference logit generation
Source Code (src/)
- modeling_qwen35.py — Main text decoder with NKI DeltaNet kernels
- modeling_qwen35_vision.py — Vision encoder (for future VL support)
- modeling_qwen35_vl.py — VL orchestrator (for future VL support)
- nki_kernels/ — DeltaNet NKI kernel implementations (fused CTE, per-token TKG, chunked)

Optional Components

Unit Tests (CPU or Neuron-based)
- 42 unit tests for config parsing, weight conversion, and architecture validation
- Located in test/unit/ directory

Folder Structure

Confirm your contribution follows this structure:

/contrib/models/Qwen3.5-2B/
  README.md
  /src
    __init__.py
    modeling_qwen35.py
    modeling_qwen35_vision.py
    modeling_qwen35_vl.py
    /nki_kernels
      __init__.py
      nki_deltanet.py
      nki_deltanet_chunked.py
      nki_deltanet_fused.py
  /test
    __init__.py
    /unit
      __init__.py
      test_config.py
      test_weight_conversion.py
    /integration
      __init__.py
      test_model.py

Testing

How did you test this change?

All tests were run on a trn2.3xlarge instance with TP=4, LNC=2, SDK 2.29 (NKI 0.3.0, PyTorch 2.9, neuronx-distributed-inference). The model was compiled from Qwen/Qwen3.5-2B HuggingFace weights.

Test Results:

42 unit tests PASSED (CPU)
9/9 integration tests PASSED (Neuron, trn2.3xlarge TP=4):
  - test_model_loads: PASS
  - test_model_generates: PASS (20 tokens generated)
  - test_output_coherence: PASS
  - test_top_token_valid: PASS
  - test_capital_of_france: PASS ("Paris" in output)
  - test_logit_accuracy: PASS (cosine=0.9156, top-1 match, top-5 overlap 4/5)
  - test_performance_ttft: PASS (157.8 ms)
  - test_performance_throughput: PASS (114.5 tok/s)
  - test_multi_prompt_generation: PASS (4/4 prompts coherent)

Benchmark results (BF16, seq_len=128):

Batch Size	TTFT (ms)	Throughput (tok/s)
1	157.8	114.5
2	72.0	233.1
4	104.4	329.6
8	185.6	409.5

Note on logit validation approach: DeltaNet layers (18 of 24) use NKI linear recurrent kernels that produce higher BF16 numerical divergence than standard GQA. Autoregressive sequences diverge after the first generated token, making multi-token logit_validation() inapplicable. The first-token logits are validated where CPU and Neuron process identical input prefixes. The model outputs TP-sharded logits (vocab/tp_degree) because ModelWrapper does not call _gather_along_dim, so comparison uses the TP shard 0 slice.

Compatibility

Tested with:

Neuron SDK Version(s): 2.29 (neuronx-cc 2.24, NKI 0.3.0)
Instance Type(s): trn2.3xlarge (TP=4, LNC=2)
PyTorch Version: 2.9.0
Python Version: 3.12

Additional Information

SDK 2.29+ is required due to NKI 0.3.0 API requirements for the DeltaNet kernels
The PyTorch _chunk_forward path creates 5D tensors that trigger a neuronx-cc codegen crash (NCC_INLA001); the fused NKI kernel is the default and required CTE path
No mini model test is possible because DeltaNet layers require NKI kernels that only execute on Neuron devices
Qwen3.5-2B is a chat model; raw text prompts produce echoey output — tokenizer.apply_chat_template() is required for quality output
The qwen3_5 model type requires transformers>=5.0 for CPU reference generation, but the NxDI-pinned transformers==4.57.* works for Neuron inference since the model is loaded via manual config.json parsing

Related Issues

N/A

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

Hybrid DeltaNet + GQA decoder with custom NKI kernels for Neuron. - 24 layers: 18 DeltaNet (linear recurrent) + 6 standard GQA - Custom NKI fused kernel for context encoding (CTE) - Custom NKI per-token kernel for token generation (TKG) - First-token logit validation against CPU BF16 reference - 42 unit tests (CPU) + 9 integration tests (Neuron) - Validated on trn2.3xlarge TP=4 LNC=2 SDK 2.29 - BS=1-8, seq_len=128-4096, all configurations pass - 114.5 tok/s BS=1, up to 409.5 tok/s BS=8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3.5-2B contrib model#141

Add Qwen3.5-2B contrib model#141
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/qwen3.5-2b-pr

jimburtoft commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented Apr 24, 2026

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant