Add sarvam-m contrib model (Mistral head_dim fix) by jimburtoft · Pull Request #139 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-23T15:19:14Z

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

Adds contrib support for sarvam-m, a 24B Mistral-architecture decoder-only LLM optimized for Indian languages and English.

This model exposes a general issue in NeuronMixtralAttention: it hardcodes head_dim = config.hidden_size // config.num_attention_heads, ignoring the explicit head_dim in the config. For sarvam-m, this computes head_dim=160 when the actual value is 128, causing XLA shape mismatches. The fix applies the same getattr(config, "head_dim", ...) pattern already used by NeuronLlamaAttention, NeuronQwen3Attention, NeuronGemma3Attention, and others.

The contrib includes:

src/setup_patches.py: Applies head_dim fix to modeling_mixtral.py + NKI eps guards for QKV CTE kernels
test/integration/test_model.py: Comprehensive integration tests (English/Hindi generation, greedy determinism, throughput)
README.md: Full documentation with benchmarks, patch explanations, and usage instructions

Model Information

Model Name: sarvam-m (sarvamai/sarvam-m)

Model Architecture: Decoder-only transformer (MistralForCausalLM) — GQA 32Q/8KV, head_dim=128, hidden_size=5120, 40 layers, vocab 131K, 32K context

Purpose: Text generation in English and Indian languages (Hindi, Tamil, Telugu, etc.)

Checklist

Required Components

Accuracy Test (test/integration/test_model.py)
- Validates English and Hindi generation accuracy
- Greedy determinism verification
- Token-level match rate against CPU reference
- Compiles and runs model on Neuron via vLLM
README.md with the following sections:
- Usage Example: Full setup including patch application, LNC configuration, vLLM launch, and curl query
- Compatibility Matrix: SDK 2.29 on trn2.3xlarge (TP=8 LNC=1 and TP=4 LNC=2)
- Example Checkpoints: Link to sarvamai/sarvam-m on HuggingFace
- Testing Instructions: pytest and standalone runner with environment variable overrides
Source Code (src/)
- setup_patches.py: Applies 3 patches (Mistral head_dim, nkilib eps, neuronxcc eps)
- Properly structured in contrib folder hierarchy

Optional Components

Unit Tests — Not applicable (patches are integration-level)

Folder Structure

/contrib/models/sarvam-m/
  README.md
  /src
    __init__.py
    setup_patches.py
  /test
    __init__.py
    /integration
      __init__.py
      test_model.py
    /unit
      __init__.py

Testing

How did you test this change?

Tested on trn2.3xlarge with SDK 2.29 (DLAMI 20260410). Applied patches, launched vLLM with TP=8 (LNC=1) and TP=4 (LNC=2), validated English and Hindi generation, greedy determinism, and throughput benchmarks across 4 workloads (128/128, 128/512, 2048/128, 2048/512 input/output lengths) at concurrency 1 and 4.

Test Results:

Config	Single-stream	Peak (conc=4)	TPOT
TP=8, LNC=1	42.3 tok/s	160.1 tok/s	23.6 ms
TP=4, LNC=2	36.1 tok/s	132.8 tok/s	27.7 ms

Greedy determinism: PASS
English accuracy: PASS
Hindi accuracy: PASS

Compatibility

Tested with:

Neuron SDK Version(s): 2.29
Instance Type(s): trn2.3xlarge
PyTorch Version: 2.9
Python Version: 3.12

Additional Information

The head_dim fix is a general improvement that benefits any Mistral-family model where head_dim != hidden_size // num_attention_heads. Consider upstreaming this fix to NeuronMixtralAttention directly (1-line change to use getattr), which would eliminate the need for this patch.

The NKI eps guards are also model-agnostic and could be upstreamed to nkilib/neuronxcc.

Related Issues

None

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions

The model works with standard vLLM serving (no custom registration needed) — it uses the existing NeuronMixtralForCausalLM code path after patches are applied.

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

Add sarvam-m contrib model (Mistral head_dim fix + NKI eps patches)

16116ce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sarvam-m contrib model (Mistral head_dim fix)#139

Add sarvam-m contrib model (Mistral head_dim fix)#139
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/sarvam-m

jimburtoft commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented Apr 23, 2026

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant