Skip to content

Add sarvam-m contrib model (Mistral head_dim fix)#139

Open
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/sarvam-m
Open

Add sarvam-m contrib model (Mistral head_dim fix)#139
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/sarvam-m

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

Adds contrib support for sarvam-m, a 24B Mistral-architecture decoder-only LLM optimized for Indian languages and English.

This model exposes a general issue in NeuronMixtralAttention: it hardcodes head_dim = config.hidden_size // config.num_attention_heads, ignoring the explicit head_dim in the config. For sarvam-m, this computes head_dim=160 when the actual value is 128, causing XLA shape mismatches. The fix applies the same getattr(config, "head_dim", ...) pattern already used by NeuronLlamaAttention, NeuronQwen3Attention, NeuronGemma3Attention, and others.

The contrib includes:

  • src/setup_patches.py: Applies head_dim fix to modeling_mixtral.py + NKI eps guards for QKV CTE kernels
  • test/integration/test_model.py: Comprehensive integration tests (English/Hindi generation, greedy determinism, throughput)
  • README.md: Full documentation with benchmarks, patch explanations, and usage instructions

Model Information

Model Name: sarvam-m (sarvamai/sarvam-m)

Model Architecture: Decoder-only transformer (MistralForCausalLM) — GQA 32Q/8KV, head_dim=128, hidden_size=5120, 40 layers, vocab 131K, 32K context

Purpose: Text generation in English and Indian languages (Hindi, Tamil, Telugu, etc.)

Checklist

Required Components

  • Accuracy Test (test/integration/test_model.py)

    • Validates English and Hindi generation accuracy
    • Greedy determinism verification
    • Token-level match rate against CPU reference
    • Compiles and runs model on Neuron via vLLM
  • README.md with the following sections:

    • Usage Example: Full setup including patch application, LNC configuration, vLLM launch, and curl query
    • Compatibility Matrix: SDK 2.29 on trn2.3xlarge (TP=8 LNC=1 and TP=4 LNC=2)
    • Example Checkpoints: Link to sarvamai/sarvam-m on HuggingFace
    • Testing Instructions: pytest and standalone runner with environment variable overrides
  • Source Code (src/)

    • setup_patches.py: Applies 3 patches (Mistral head_dim, nkilib eps, neuronxcc eps)
    • Properly structured in contrib folder hierarchy

Optional Components

  • Unit Tests — Not applicable (patches are integration-level)

Folder Structure

/contrib/models/sarvam-m/
  README.md
  /src
    __init__.py
    setup_patches.py
  /test
    __init__.py
    /integration
      __init__.py
      test_model.py
    /unit
      __init__.py

Testing

How did you test this change?

Tested on trn2.3xlarge with SDK 2.29 (DLAMI 20260410). Applied patches, launched vLLM with TP=8 (LNC=1) and TP=4 (LNC=2), validated English and Hindi generation, greedy determinism, and throughput benchmarks across 4 workloads (128/128, 128/512, 2048/128, 2048/512 input/output lengths) at concurrency 1 and 4.

Test Results:

Config Single-stream Peak (conc=4) TPOT
TP=8, LNC=1 42.3 tok/s 160.1 tok/s 23.6 ms
TP=4, LNC=2 36.1 tok/s 132.8 tok/s 27.7 ms
  • Greedy determinism: PASS
  • English accuracy: PASS
  • Hindi accuracy: PASS

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.29
  • Instance Type(s): trn2.3xlarge
  • PyTorch Version: 2.9
  • Python Version: 3.12

Additional Information

The head_dim fix is a general improvement that benefits any Mistral-family model where head_dim != hidden_size // num_attention_heads. Consider upstreaming this fix to NeuronMixtralAttention directly (1-line change to use getattr), which would eliminate the need for this patch.

The NKI eps guards are also model-agnostic and could be upstreamed to nkilib/neuronxcc.

Related Issues

None

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

The model works with standard vLLM serving (no custom registration needed) — it uses the existing NeuronMixtralForCausalLM code path after patches are applied.


By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant