[NV] Add MiniMax-M2.5 FP4 GB200 Dynamo vLLM recipes by jasonlizhengjian · Pull Request #1642 · SemiAnalysisAI/InferenceX

jasonlizhengjian · 2026-06-02T18:58:37Z

Summary

Add GB200 MiniMax-M2.5 FP4 Dynamo vLLM recipes.

Note

Low Risk
Benchmark and CI launcher configuration only; minimax-specific launcher logic is gated and does not change other models’ paths.

Overview
Adds MiniMax-M2.5 NVFP4 disaggregated Dynamo + vLLM GB200 benchmarking: a new minimaxm2.5-fp4-gb200-dynamo-vllm entry in nvidia-master.yaml with fixed-seq-len sweeps at 1k/1k and 8k/1k, each search point wiring prefill/decode worker counts and parallelism to a CONFIG_FILE under the new recipe tree.

Introduces srt-slurm recipe YAMLs under benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-gb200/ (TP4/TP4+EP and data-parallel decode variants, rate-matched prefill/decode scaling, Nixl KV transfer, sa-bench concurrencies). Documents the addition in perf-changelog.yaml.

Extends launch_gb200-nv.sh for minimaxm2.5 only: Lustre model path and alias, clone NVIDIA/srt-slurm main with recipe overlay, and watchtower/multi-cluster handling (diagnostics, auto-pick writable squash and shared run dirs, shared-FS srt-slurm + rsync’d INFMAX_WORKSPACE, /usr/bin/python3 venv). Tightens CONFIG_FILE handling with existence checks; other dynamo models are unchanged aside from shared path variables when minimax isn’t selected.

^{Reviewed by Cursor Bugbot for commit 9f1003e. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-02T18:58:47Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-06-02T19:07:00Z

+elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "minimaxm2.5" ]]; then
+    git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
+    cd "$SRT_REPO_DIR"
+    git checkout main
+    mkdir -p recipes/vllm/minimax-m2.5-gb200
+    cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-gb200" recipes/vllm/minimax-m2.5-gb200


🔴 The new minimaxm2.5 elif branch runs git checkout main against NVIDIA/srt-slurm, while every other branch in this same chain pins to a stable ref (aflowers/vllm-gb200-v0.20.0, sa-submission-q2-2026). Tracking upstream main makes this benchmark non-reproducible — any merge to NVIDIA/srt-slurm's main can change srtctl semantics, container aliases, or recipe schema and silently break the run. Pin to a specific commit/tag/branch like the surrounding code.

Extended reasoning...

What the bug is

At runners/launch_gb200-nv.sh:241-246, the new minimaxm2.5 elif branch added by this PR is:

elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "minimaxm2.5" ]]; then git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR" cd "$SRT_REPO_DIR" git checkout main mkdir -p recipes/vllm/minimax-m2.5-gb200 cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-gb200" recipes/vllm/minimax-m2.5-gb200 fi

The git checkout main line tracks the moving tip of NVIDIA/srt-slurm's default branch rather than a stable ref.

Why this is inconsistent with the surrounding code

Every other branch of this same elif-chain explicitly pins to a stable named ref:

dsv4 + dynamo-vllm → git checkout aflowers/vllm-gb200-v0.20.0

dsv4 + dynamo-sglang → git checkout sa-submission-q2-2026

generic dynamo-vllm fallback → git checkout sa-submission-q2-2026

kimik2.5 / dynamo-trt → git checkout sa-submission-q2-2026

IS_AGENTIC / else → --branch cam/sa-submission-q2-2026 --single-branch

The minimax branch is the only one in this file that points at main. The pinning convention is clearly the project's deliberate design choice — it's how the benchmark numbers in perf-changelog.yaml are kept reproducible.

Step-by-step proof of how this breaks reproducibility

Day 1: PR [NV] Add MiniMax-M2.5 FP4 GB200 Dynamo vLLM recipes #1642 merges. A sweep is run, recipes overlay onto whatever NVIDIA/srt-slurm@main looked like that day. Numbers get added to perf-changelog.yaml keyed by minimaxm2.5-fp4-gb200-dynamo-vllm.

Day 30: someone (anyone) merges to NVIDIA/srt-slurm's main — e.g., renames a flag in srtctl apply, changes the containers: schema in srtslurm.yaml, or refactors the Makefile setup target.

Day 31: an InferenceX rerun of the same minimaxm2.5-fp4-gb200-dynamo-vllm config now picks up the new upstream tip. One of the following happens:

The recipe overlay copy still lands fine, but srtctl apply -f "$CONFIG_FILE" fails at runtime with a new flag/key/schema error.

make setup ARCH=aarch64 (launch_gb200-nv.sh:323) fails or behaves differently.

The container alias keys in srtslurm.yaml (dynamo-trtllm, dynamo-sglang, etc.) no longer match what upstream expects.

Worse: nothing fails, but the benchmark silently runs under a different srtctl behaviour, producing different numbers from the same config — invalidating the perf-changelog entry without anyone noticing.

Failure mode (3d) is the most concerning: silent drift in benchmark numbers undermines the entire purpose of perf-changelog.yaml.

Why existing code doesn't prevent it

Nothing else in the launcher constrains the srt-slurm tree state — make setup, srtctl apply, and the recipe overlay all consume whatever was checked out. The only line that determines reproducibility is the git checkout itself.

How to fix

Replace git checkout main with a pinned ref, mirroring the dsv4 branch. Either:

git checkout sa-submission-q2-2026 (matches the generic dynamo-vllm and dsv4-sglang branches), or

git checkout <specific-commit-sha> if a particular commit is known to work, or

create a minimax-* branch on NVIDIA/srt-slurm and pin to that.

If the author intentionally wants to track main (e.g., minimax depends on an upstream feature not yet on sa-submission-q2-2026), an inline comment explaining why would resolve the concern — but at minimum a pin to a specific commit SHA on main would preserve reproducibility.

claude · 2026-06-02T19:07:00Z

+    for cand in \
+        /mnt/lustre01/users/slurm-shared/squash \
+        /mnt/lustre01/users-public/slurm-shared/squash \
+        /mnt/lustre01/groups/slurm-shared/squash \
+        /mnt/lustre01/users-public/sa-shared \
+        /nfs/slurm-shared/squash \
+        /home/slurm-shared/gharunners/squash
+    do
+        if mkdir -p "$cand" 2>/dev/null && touch "$cand/.write-probe.$$" 2>/dev/null; then
+            rm -f "$cand/.write-probe.$$" 2>/dev/null
+            SQUASH_DIR="$cand"
+            echo "Selected SQUASH_DIR=$SQUASH_DIR (first writable candidate)"
+            break
+        else
+            echo "  not writable: $cand"
+        fi
+    done
+    if [ -z "$SQUASH_DIR" ]; then
+        echo "Error: no writable squash dir candidate found on this cluster" >&2
+        exit 1
+    fi


🟡 The SQUASH_DIR probe at lines 103-119 selects the first head-node-writable candidate, but the list includes /home/slurm-shared/gharunners/squash as a final fallback — directly contradicting the comment block at lines 198-204 which states '/home/slurm-shared isn't cross-mounted to compute nodes' (the very motivation for this probe). If all Lustre/NFS candidates fail the writability test, the probe silently selects the /home path and downstream slurm jobs will fail with an obscure 'image not found' error rather than the explicit 'no writable squash dir found' exit at line 121. Drop the /home candidate or add a compute-side visibility check.

Extended reasoning...

What the bug is

The squash directory probe at runners/launch_gb200-nv.sh:103-119 iterates through six candidate paths and selects the first one that is writable from the launcher (head node). The probe uses mkdir -p + touch to verify writability — but writability from the head node is necessary but not sufficient for the actual requirement, which is that the squash file must be readable by compute nodes at job-launch time.

The probe list ends with /home/slurm-shared/gharunners/squash. This is contradicted by the same diff at lines 198-204 (the SRT_REPO_DIR block):

On the watchtower (Oracle) gb200 cluster, /home/slurm-shared isn't cross-mounted to compute nodes — slurm jobs cd'ing into WorkDir and opening StdOut on /home/... crash with ExitCode=1:0 in 0s and the log file never gets created.

The whole reason this probe exists is that /home/ paths fail on watchtower. Yet the probe keeps a /home/ candidate as the fallback.

How it manifests

Step-by-step proof:

The launcher runs on the head node, which has access to /home/slurm-shared/gharunners/squash (it's the GitHub Actions runner's home FS).

Assume a fresh or misconfigured cluster where the four Lustre candidates fail the probe — e.g., /mnt/lustre01/users/slurm-shared/squash doesn't exist and the user lacks mkdir perms on /mnt/lustre01/groups, and /nfs/slurm-shared/squash isn't present.

The loop reaches /home/slurm-shared/gharunners/squash. mkdir -p succeeds (head-node /home is writable). touch succeeds. The loop selects it and breaks.

SQUASH_FILE=/home/slurm-shared/gharunners/squash/...sqsh is set, and enroot import writes the squash there successfully (head-node-local).

Slurm submits the job; compute nodes try to read SQUASH_FILE via pyxis. The path doesn't exist on compute. Job fails with 'image not found' or similar obscure error, far from the actual root cause.

The user then has to dig through compute-side logs, correlate with the squash-import success on the head node, and re-derive the cross-mount limitation already documented in the comment block 80 lines below.

Why existing code doesn't prevent it

The probe only does mkdir -p + touch from the head node (lines 112-113). There is no compute-side check — no srun --nodes=1 test -d "$cand" or equivalent. And the explicit failure exit at line 121 (no writable squash dir found) only triggers if every candidate fails head-node writability, so the /home fallback masks the very situation the probe was designed to surface.

Impact

Limited blast radius — this code only triggers for MODEL_PREFIX == minimaxm2.5 (gated at line 78), and in practice one of the four Lustre candidates is likely to succeed on watchtower. But the contradiction with the script's own stated design intent is real, and if it does trigger, the failure mode is exactly the obscure downstream error the probe was built to avoid.

How to fix it

Either:

Drop the /home candidate — remove /home/slurm-shared/gharunners/squash from the probe list. The explicit exit 1 at line 121 will then correctly surface the 'no writable squash dir' condition.

Add compute-side visibility check — after selecting a candidate, do srun --partition="$SLURM_PARTITION" --account="$SLURM_ACCOUNT" -N1 --time=1 test -d "$cand" (or similar) before committing to it, and skip on failure.

Option 1 is the simpler preemptive fix given that this is diagnostic / transitional code.

claude · 2026-06-02T19:07:00Z

+    - minimaxm2.5-fp4-gb200-dynamo-vllm
+  description:
+    - "Add MiniMax-M2.5 NVFP4 GB200 disaggregated multinode vLLM benchmarks via Dynamo"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD


🟡 The new minimaxm2.5-fp4-gb200-dynamo-vllm entry has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD, which is a placeholder that renders as a dead 404 link in the changelog. All other entries in this file use real PR numbers (1626, 1630, 1631, 1588, 1627, etc.). Please replace TBD with 1642 before merging.

Extended reasoning...

What the bug is

perf-changelog.yaml is a user-facing log mapping config-key additions/changes to the PR that introduced them. Each entry has a pr-link pointing to the GitHub PR. The new minimaxm2.5-fp4-gb200-dynamo-vllm entry added by this PR has:

- config-keys: - minimaxm2.5-fp4-gb200-dynamo-vllm description: - "Add MiniMax-M2.5 NVFP4 GB200 disaggregated multinode vLLM benchmarks via Dynamo" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD

The TBD is a placeholder the author intended to replace once the PR was opened and given a number, but it was left as-is.

The convention is established

Every other entry in the surrounding context uses a real numeric PR id, e.g. lines 3373/3379/3385/3391/3397 reference 1626, 1630, 1631, 1588, 1627. None use a placeholder. The same convention holds throughout the file.

Impact

Clicking the link in the rendered changelog will hit https://github.com/SemiAnalysisAI/InferenceX/pull/TBD which is a 404. Anyone reading the changelog to find the originating PR for this minimax recipe is left without a working pointer to the discussion, review history, and context. It's purely a documentation/metadata defect — no runtime behaviour is affected — but the changelog is the primary discoverability surface for performance changes.

Fix

Replace TBD with 1642 (this PR's number):

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1642

Step-by-step proof

Open perf-changelog.yaml at line 3403 → see literal string pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD.

The URL https://github.com/SemiAnalysisAI/InferenceX/pull/TBD is not a valid GitHub PR URL (PR ids are numeric), so the GitHub web UI returns 404.

The current PR's number (from preloaded metadata) is 1642, which is the value that should have been substituted.

Compare to the immediately preceding entry in the diff: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1627 — same key, real number. The pattern is unambiguous.

Severity rationale

Nit — it's a metadata/link typo rather than a code defect, easy to fix in a one-character change, but worth catching pre-merge because once merged the broken link ships to anyone consuming the changelog.

github-actions · 2026-06-02T19:18:13Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26842044655
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26842044655

github-actions · 2026-06-02T23:17:56Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26843662304
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26843662304

functionstackx · 2026-06-02T23:19:34Z

/reuse-sweep-run

Add MiniMax-M2.5 FP4 GB200 Dynamo vLLM recipes

90868c1

jasonlizhengjian requested a review from a team June 2, 2026 18:58

jasonlizhengjian requested review from jgangani and kedarpotdar-nv as code owners June 2, 2026 18:58

github-project-automation Bot added this to InferenceMAX Board Jun 2, 2026

Update GB200 MiniMax changelog PR link

6024d75

claude Bot reviewed Jun 2, 2026

View reviewed changes

jasonlizhengjian added the full-sweep-enabled label Jun 2, 2026

Fix GB200 MiniMax shared workspace setup

9f1003e

functionstackx merged commit 0fc4f86 into main Jun 2, 2026
41 checks passed

functionstackx deleted the nv/jasonli/minimaxm2.5-fp4-gb200-dynamo-vllm branch June 2, 2026 23:19

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 2, 2026

claude Bot mentioned this pull request Jun 3, 2026

[B300][vLLM] Add MiniMax-M2.5 FP4 disagg Dynamo configs #1652

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NV] Add MiniMax-M2.5 FP4 GB200 Dynamo vLLM recipes#1642

[NV] Add MiniMax-M2.5 FP4 GB200 Dynamo vLLM recipes#1642
functionstackx merged 3 commits into
mainfrom
nv/jasonli/minimaxm2.5-fp4-gb200-dynamo-vllm

jasonlizhengjian commented Jun 2, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

claude Bot Jun 2, 2026

Uh oh!

claude Bot Jun 2, 2026

Uh oh!

claude Bot Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

functionstackx commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jasonlizhengjian commented Jun 2, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

claude Bot Jun 2, 2026

Choose a reason for hiding this comment

What the bug is

Why this is inconsistent with the surrounding code

Step-by-step proof of how this breaks reproducibility

Why existing code doesn't prevent it

How to fix

Uh oh!

claude Bot Jun 2, 2026

Choose a reason for hiding this comment

What the bug is

How it manifests

Why existing code doesn't prevent it

Impact

How to fix it

Uh oh!

claude Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

functionstackx commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jasonlizhengjian commented Jun 2, 2026 •

edited by cursor Bot

Loading