Skip to content

Add Qwen3.5-2B contrib model#141

Open
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/qwen3.5-2b-pr
Open

Add Qwen3.5-2B contrib model#141
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/qwen3.5-2b-pr

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

Adds Qwen3.5-2B, a 2B parameter dense hybrid DeltaNet + GQA decoder from Alibaba Cloud, to the contrib directory. This model features 18 DeltaNet linear recurrent attention layers and 6 standard GQA layers in a [3 DeltaNet + 1 GQA] x 6 pattern, requiring custom NKI kernels for the DeltaNet forward passes on Neuron.

Key implementation details:

  • Fused NKI kernel for DeltaNet context encoding (CTE) — single-kernel chunked forward
  • Per-token NKI recurrent kernel for DeltaNet token generation (TKG)
  • Standard GQA attention for the 6 non-DeltaNet layers
  • Tied embeddings support (tie_word_embeddings=true)
  • Partial RoPE position encoding (25% of head_dim)

Model Information

Model Name: Qwen3.5-2B

Model Architecture: Decoder-only hybrid DeltaNet/GQA transformer (24 layers: 18 DeltaNet + 6 GQA), dense SwiGLU MLP, 2048 hidden size, 248K vocabulary

Purpose: Text generation (chat model with <|im_start|>/<|im_end|> format)

Checklist

Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.

Required Components

  • Accuracy Test (ex. test/integration/test_model.py)

    • 9 integration tests validating model accuracy on Neuron
    • First-token logit validation against pre-computed CPU BF16 reference logits (cosine similarity 0.9156, top-1 match, top-5 overlap 4/5)
    • Multi-prompt coherence tests (factual Q&A, code generation, knowledge, list generation)
    • Test can compile and run the model on Neuron
  • README.md with the following sections:

    • Usage Example: Clear code example showing how to use the model
    • Compatibility Matrix: Table showing tested Neuron SDK versions and instance types
    • Example Checkpoints: Links to compatible model checkpoints (HuggingFace Hub)
    • Testing Instructions: Commands to run unit and integration test suites, including CPU reference logit generation
  • Source Code (src/)

    • modeling_qwen35.py — Main text decoder with NKI DeltaNet kernels
    • modeling_qwen35_vision.py — Vision encoder (for future VL support)
    • modeling_qwen35_vl.py — VL orchestrator (for future VL support)
    • nki_kernels/ — DeltaNet NKI kernel implementations (fused CTE, per-token TKG, chunked)

Optional Components

  • Unit Tests (CPU or Neuron-based)
    • 42 unit tests for config parsing, weight conversion, and architecture validation
    • Located in test/unit/ directory

Folder Structure

Confirm your contribution follows this structure:

/contrib/models/Qwen3.5-2B/
  README.md
  /src
    __init__.py
    modeling_qwen35.py
    modeling_qwen35_vision.py
    modeling_qwen35_vl.py
    /nki_kernels
      __init__.py
      nki_deltanet.py
      nki_deltanet_chunked.py
      nki_deltanet_fused.py
  /test
    __init__.py
    /unit
      __init__.py
      test_config.py
      test_weight_conversion.py
    /integration
      __init__.py
      test_model.py

Testing

How did you test this change?

All tests were run on a trn2.3xlarge instance with TP=4, LNC=2, SDK 2.29 (NKI 0.3.0, PyTorch 2.9, neuronx-distributed-inference). The model was compiled from Qwen/Qwen3.5-2B HuggingFace weights.

Test Results:

42 unit tests PASSED (CPU)
9/9 integration tests PASSED (Neuron, trn2.3xlarge TP=4):
  - test_model_loads: PASS
  - test_model_generates: PASS (20 tokens generated)
  - test_output_coherence: PASS
  - test_top_token_valid: PASS
  - test_capital_of_france: PASS ("Paris" in output)
  - test_logit_accuracy: PASS (cosine=0.9156, top-1 match, top-5 overlap 4/5)
  - test_performance_ttft: PASS (157.8 ms)
  - test_performance_throughput: PASS (114.5 tok/s)
  - test_multi_prompt_generation: PASS (4/4 prompts coherent)

Benchmark results (BF16, seq_len=128):

Batch Size TTFT (ms) Throughput (tok/s)
1 157.8 114.5
2 72.0 233.1
4 104.4 329.6
8 185.6 409.5

Note on logit validation approach: DeltaNet layers (18 of 24) use NKI linear recurrent kernels that produce higher BF16 numerical divergence than standard GQA. Autoregressive sequences diverge after the first generated token, making multi-token logit_validation() inapplicable. The first-token logits are validated where CPU and Neuron process identical input prefixes. The model outputs TP-sharded logits (vocab/tp_degree) because ModelWrapper does not call _gather_along_dim, so comparison uses the TP shard 0 slice.

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.29 (neuronx-cc 2.24, NKI 0.3.0)
  • Instance Type(s): trn2.3xlarge (TP=4, LNC=2)
  • PyTorch Version: 2.9.0
  • Python Version: 3.12

Additional Information

  • SDK 2.29+ is required due to NKI 0.3.0 API requirements for the DeltaNet kernels
  • The PyTorch _chunk_forward path creates 5D tensors that trigger a neuronx-cc codegen crash (NCC_INLA001); the fused NKI kernel is the default and required CTE path
  • No mini model test is possible because DeltaNet layers require NKI kernels that only execute on Neuron devices
  • Qwen3.5-2B is a chat model; raw text prompts produce echoey output — tokenizer.apply_chat_template() is required for quality output
  • The qwen3_5 model type requires transformers>=5.0 for CPU reference generation, but the NxDI-pinned transformers==4.57.* works for Neuron inference since the model is loaded via manual config.json parsing

Related Issues

N/A

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

Hybrid DeltaNet + GQA decoder with custom NKI kernels for Neuron.

- 24 layers: 18 DeltaNet (linear recurrent) + 6 standard GQA
- Custom NKI fused kernel for context encoding (CTE)
- Custom NKI per-token kernel for token generation (TKG)
- First-token logit validation against CPU BF16 reference
- 42 unit tests (CPU) + 9 integration tests (Neuron)
- Validated on trn2.3xlarge TP=4 LNC=2 SDK 2.29
- BS=1-8, seq_len=128-4096, all configurations pass
- 114.5 tok/s BS=1, up to 409.5 tok/s BS=8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant