[NV] Add MiniMax-M2.5 FP4 GB200 Dynamo vLLM recipes#1642
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "minimaxm2.5" ]]; then | ||
| git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR" | ||
| cd "$SRT_REPO_DIR" | ||
| git checkout main | ||
| mkdir -p recipes/vllm/minimax-m2.5-gb200 | ||
| cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-gb200" recipes/vllm/minimax-m2.5-gb200 |
There was a problem hiding this comment.
🔴 The new minimaxm2.5 elif branch runs git checkout main against NVIDIA/srt-slurm, while every other branch in this same chain pins to a stable ref (aflowers/vllm-gb200-v0.20.0, sa-submission-q2-2026). Tracking upstream main makes this benchmark non-reproducible — any merge to NVIDIA/srt-slurm's main can change srtctl semantics, container aliases, or recipe schema and silently break the run. Pin to a specific commit/tag/branch like the surrounding code.
Extended reasoning...
What the bug is
At runners/launch_gb200-nv.sh:241-246, the new minimaxm2.5 elif branch added by this PR is:
elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "minimaxm2.5" ]]; then
git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
cd "$SRT_REPO_DIR"
git checkout main
mkdir -p recipes/vllm/minimax-m2.5-gb200
cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-gb200" recipes/vllm/minimax-m2.5-gb200
fiThe git checkout main line tracks the moving tip of NVIDIA/srt-slurm's default branch rather than a stable ref.
Why this is inconsistent with the surrounding code
Every other branch of this same elif-chain explicitly pins to a stable named ref:
dsv4+dynamo-vllm→git checkout aflowers/vllm-gb200-v0.20.0dsv4+dynamo-sglang→git checkout sa-submission-q2-2026- generic
dynamo-vllmfallback →git checkout sa-submission-q2-2026 kimik2.5/dynamo-trt→git checkout sa-submission-q2-2026IS_AGENTIC/else→--branch cam/sa-submission-q2-2026 --single-branch
The minimax branch is the only one in this file that points at main. The pinning convention is clearly the project's deliberate design choice — it's how the benchmark numbers in perf-changelog.yaml are kept reproducible.
Step-by-step proof of how this breaks reproducibility
- Day 1: PR [NV] Add MiniMax-M2.5 FP4 GB200 Dynamo vLLM recipes #1642 merges. A sweep is run, recipes overlay onto whatever
NVIDIA/srt-slurm@mainlooked like that day. Numbers get added toperf-changelog.yamlkeyed byminimaxm2.5-fp4-gb200-dynamo-vllm. - Day 30: someone (anyone) merges to
NVIDIA/srt-slurm's main — e.g., renames a flag insrtctl apply, changes thecontainers:schema insrtslurm.yaml, or refactors theMakefilesetup target. - Day 31: an InferenceX rerun of the same
minimaxm2.5-fp4-gb200-dynamo-vllmconfig now picks up the new upstream tip. One of the following happens:- The recipe overlay copy still lands fine, but
srtctl apply -f "$CONFIG_FILE"fails at runtime with a new flag/key/schema error. make setup ARCH=aarch64(launch_gb200-nv.sh:323) fails or behaves differently.- The container alias keys in
srtslurm.yaml(dynamo-trtllm,dynamo-sglang, etc.) no longer match what upstream expects. - Worse: nothing fails, but the benchmark silently runs under a different srtctl behaviour, producing different numbers from the same config — invalidating the perf-changelog entry without anyone noticing.
- The recipe overlay copy still lands fine, but
Failure mode (3d) is the most concerning: silent drift in benchmark numbers undermines the entire purpose of perf-changelog.yaml.
Why existing code doesn't prevent it
Nothing else in the launcher constrains the srt-slurm tree state — make setup, srtctl apply, and the recipe overlay all consume whatever was checked out. The only line that determines reproducibility is the git checkout itself.
How to fix
Replace git checkout main with a pinned ref, mirroring the dsv4 branch. Either:
git checkout sa-submission-q2-2026(matches the genericdynamo-vllmanddsv4-sglangbranches), orgit checkout <specific-commit-sha>if a particular commit is known to work, or- create a
minimax-*branch onNVIDIA/srt-slurmand pin to that.
If the author intentionally wants to track main (e.g., minimax depends on an upstream feature not yet on sa-submission-q2-2026), an inline comment explaining why would resolve the concern — but at minimum a pin to a specific commit SHA on main would preserve reproducibility.
| for cand in \ | ||
| /mnt/lustre01/users/slurm-shared/squash \ | ||
| /mnt/lustre01/users-public/slurm-shared/squash \ | ||
| /mnt/lustre01/groups/slurm-shared/squash \ | ||
| /mnt/lustre01/users-public/sa-shared \ | ||
| /nfs/slurm-shared/squash \ | ||
| /home/slurm-shared/gharunners/squash | ||
| do | ||
| if mkdir -p "$cand" 2>/dev/null && touch "$cand/.write-probe.$$" 2>/dev/null; then | ||
| rm -f "$cand/.write-probe.$$" 2>/dev/null | ||
| SQUASH_DIR="$cand" | ||
| echo "Selected SQUASH_DIR=$SQUASH_DIR (first writable candidate)" | ||
| break | ||
| else | ||
| echo " not writable: $cand" | ||
| fi | ||
| done | ||
| if [ -z "$SQUASH_DIR" ]; then | ||
| echo "Error: no writable squash dir candidate found on this cluster" >&2 | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
🟡 The SQUASH_DIR probe at lines 103-119 selects the first head-node-writable candidate, but the list includes /home/slurm-shared/gharunners/squash as a final fallback — directly contradicting the comment block at lines 198-204 which states '/home/slurm-shared isn't cross-mounted to compute nodes' (the very motivation for this probe). If all Lustre/NFS candidates fail the writability test, the probe silently selects the /home path and downstream slurm jobs will fail with an obscure 'image not found' error rather than the explicit 'no writable squash dir found' exit at line 121. Drop the /home candidate or add a compute-side visibility check.
Extended reasoning...
What the bug is
The squash directory probe at runners/launch_gb200-nv.sh:103-119 iterates through six candidate paths and selects the first one that is writable from the launcher (head node). The probe uses mkdir -p + touch to verify writability — but writability from the head node is necessary but not sufficient for the actual requirement, which is that the squash file must be readable by compute nodes at job-launch time.
The probe list ends with /home/slurm-shared/gharunners/squash. This is contradicted by the same diff at lines 198-204 (the SRT_REPO_DIR block):
On the watchtower (Oracle) gb200 cluster, /home/slurm-shared isn't cross-mounted to compute nodes — slurm jobs cd'ing into WorkDir and opening StdOut on /home/... crash with ExitCode=1:0 in 0s and the log file never gets created.
The whole reason this probe exists is that /home/ paths fail on watchtower. Yet the probe keeps a /home/ candidate as the fallback.
How it manifests
Step-by-step proof:
- The launcher runs on the head node, which has access to
/home/slurm-shared/gharunners/squash(it's the GitHub Actions runner's home FS). - Assume a fresh or misconfigured cluster where the four Lustre candidates fail the probe — e.g.,
/mnt/lustre01/users/slurm-shared/squashdoesn't exist and the user lacksmkdirperms on/mnt/lustre01/groups, and/nfs/slurm-shared/squashisn't present. - The loop reaches
/home/slurm-shared/gharunners/squash.mkdir -psucceeds (head-node /home is writable).touchsucceeds. The loop selects it and breaks. SQUASH_FILE=/home/slurm-shared/gharunners/squash/...sqshis set, andenroot importwrites the squash there successfully (head-node-local).- Slurm submits the job; compute nodes try to read
SQUASH_FILEvia pyxis. The path doesn't exist on compute. Job fails with 'image not found' or similar obscure error, far from the actual root cause.
The user then has to dig through compute-side logs, correlate with the squash-import success on the head node, and re-derive the cross-mount limitation already documented in the comment block 80 lines below.
Why existing code doesn't prevent it
The probe only does mkdir -p + touch from the head node (lines 112-113). There is no compute-side check — no srun --nodes=1 test -d "$cand" or equivalent. And the explicit failure exit at line 121 (no writable squash dir found) only triggers if every candidate fails head-node writability, so the /home fallback masks the very situation the probe was designed to surface.
Impact
Limited blast radius — this code only triggers for MODEL_PREFIX == minimaxm2.5 (gated at line 78), and in practice one of the four Lustre candidates is likely to succeed on watchtower. But the contradiction with the script's own stated design intent is real, and if it does trigger, the failure mode is exactly the obscure downstream error the probe was built to avoid.
How to fix it
Either:
- Drop the /home candidate — remove
/home/slurm-shared/gharunners/squashfrom the probe list. The explicitexit 1at line 121 will then correctly surface the 'no writable squash dir' condition. - Add compute-side visibility check — after selecting a candidate, do
srun --partition="$SLURM_PARTITION" --account="$SLURM_ACCOUNT" -N1 --time=1 test -d "$cand"(or similar) before committing to it, and skip on failure.
Option 1 is the simpler preemptive fix given that this is diagnostic / transitional code.
| - minimaxm2.5-fp4-gb200-dynamo-vllm | ||
| description: | ||
| - "Add MiniMax-M2.5 NVFP4 GB200 disaggregated multinode vLLM benchmarks via Dynamo" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD |
There was a problem hiding this comment.
🟡 The new minimaxm2.5-fp4-gb200-dynamo-vllm entry has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD, which is a placeholder that renders as a dead 404 link in the changelog. All other entries in this file use real PR numbers (1626, 1630, 1631, 1588, 1627, etc.). Please replace TBD with 1642 before merging.
Extended reasoning...
What the bug is
perf-changelog.yaml is a user-facing log mapping config-key additions/changes to the PR that introduced them. Each entry has a pr-link pointing to the GitHub PR. The new minimaxm2.5-fp4-gb200-dynamo-vllm entry added by this PR has:
- config-keys:
- minimaxm2.5-fp4-gb200-dynamo-vllm
description:
- "Add MiniMax-M2.5 NVFP4 GB200 disaggregated multinode vLLM benchmarks via Dynamo"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBDThe TBD is a placeholder the author intended to replace once the PR was opened and given a number, but it was left as-is.
The convention is established
Every other entry in the surrounding context uses a real numeric PR id, e.g. lines 3373/3379/3385/3391/3397 reference 1626, 1630, 1631, 1588, 1627. None use a placeholder. The same convention holds throughout the file.
Impact
Clicking the link in the rendered changelog will hit https://github.com/SemiAnalysisAI/InferenceX/pull/TBD which is a 404. Anyone reading the changelog to find the originating PR for this minimax recipe is left without a working pointer to the discussion, review history, and context. It's purely a documentation/metadata defect — no runtime behaviour is affected — but the changelog is the primary discoverability surface for performance changes.
Fix
Replace TBD with 1642 (this PR's number):
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1642Step-by-step proof
- Open
perf-changelog.yamlat line 3403 → see literal stringpr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD. - The URL https://github.com/SemiAnalysisAI/InferenceX/pull/TBD is not a valid GitHub PR URL (PR ids are numeric), so the GitHub web UI returns 404.
- The current PR's number (from preloaded metadata) is 1642, which is the value that should have been substituted.
- Compare to the immediately preceding entry in the diff:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1627— same key, real number. The pattern is unambiguous.
Severity rationale
Nit — it's a metadata/link typo rather than a code defect, easy to fix in a one-character change, but worth catching pre-merge because once merged the broken link ships to anyone consuming the changelog.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26842044655 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26843662304 |
|
/reuse-sweep-run |
Summary
Add GB200 MiniMax-M2.5 FP4 Dynamo vLLM recipes.
Note
Low Risk
Benchmark and CI launcher configuration only; minimax-specific launcher logic is gated and does not change other models’ paths.
Overview
Adds MiniMax-M2.5 NVFP4 disaggregated Dynamo + vLLM GB200 benchmarking: a new
minimaxm2.5-fp4-gb200-dynamo-vllmentry innvidia-master.yamlwithfixed-seq-lensweeps at 1k/1k and 8k/1k, each search point wiring prefill/decode worker counts and parallelism to aCONFIG_FILEunder the new recipe tree.Introduces srt-slurm recipe YAMLs under
benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-gb200/(TP4/TP4+EP and data-parallel decode variants, rate-matched prefill/decode scaling, Nixl KV transfer,sa-benchconcurrencies). Documents the addition inperf-changelog.yaml.Extends
launch_gb200-nv.shforminimaxm2.5only: Lustre model path and alias, clone NVIDIA/srt-slurmmainwith recipe overlay, and watchtower/multi-cluster handling (diagnostics, auto-pick writable squash and shared run dirs, shared-FSsrt-slurm+ rsync’dINFMAX_WORKSPACE,/usr/bin/python3venv). TightensCONFIG_FILEhandling with existence checks; other dynamo models are unchanged aside from shared path variables when minimax isn’t selected.Reviewed by Cursor Bugbot for commit 9f1003e. Bugbot is set up for automated code reviews on this repo. Configure here.