Skip to content

[NV] Add MiniMax-M2.5 FP4 GB200 Dynamo vLLM recipes#1642

Merged
functionstackx merged 3 commits into
mainfrom
nv/jasonli/minimaxm2.5-fp4-gb200-dynamo-vllm
Jun 2, 2026
Merged

[NV] Add MiniMax-M2.5 FP4 GB200 Dynamo vLLM recipes#1642
functionstackx merged 3 commits into
mainfrom
nv/jasonli/minimaxm2.5-fp4-gb200-dynamo-vllm

Conversation

@jasonlizhengjian
Copy link
Copy Markdown
Collaborator

@jasonlizhengjian jasonlizhengjian commented Jun 2, 2026

Summary

Add GB200 MiniMax-M2.5 FP4 Dynamo vLLM recipes.


Note

Low Risk
Benchmark and CI launcher configuration only; minimax-specific launcher logic is gated and does not change other models’ paths.

Overview
Adds MiniMax-M2.5 NVFP4 disaggregated Dynamo + vLLM GB200 benchmarking: a new minimaxm2.5-fp4-gb200-dynamo-vllm entry in nvidia-master.yaml with fixed-seq-len sweeps at 1k/1k and 8k/1k, each search point wiring prefill/decode worker counts and parallelism to a CONFIG_FILE under the new recipe tree.

Introduces srt-slurm recipe YAMLs under benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-gb200/ (TP4/TP4+EP and data-parallel decode variants, rate-matched prefill/decode scaling, Nixl KV transfer, sa-bench concurrencies). Documents the addition in perf-changelog.yaml.

Extends launch_gb200-nv.sh for minimaxm2.5 only: Lustre model path and alias, clone NVIDIA/srt-slurm main with recipe overlay, and watchtower/multi-cluster handling (diagnostics, auto-pick writable squash and shared run dirs, shared-FS srt-slurm + rsync’d INFMAX_WORKSPACE, /usr/bin/python3 venv). Tightens CONFIG_FILE handling with existence checks; other dynamo models are unchanged aside from shared path variables when minimax isn’t selected.

Reviewed by Cursor Bugbot for commit 9f1003e. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment thread runners/launch_gb200-nv.sh Outdated
Comment on lines +241 to +246
elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "minimaxm2.5" ]]; then
git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
cd "$SRT_REPO_DIR"
git checkout main
mkdir -p recipes/vllm/minimax-m2.5-gb200
cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-gb200" recipes/vllm/minimax-m2.5-gb200
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new minimaxm2.5 elif branch runs git checkout main against NVIDIA/srt-slurm, while every other branch in this same chain pins to a stable ref (aflowers/vllm-gb200-v0.20.0, sa-submission-q2-2026). Tracking upstream main makes this benchmark non-reproducible — any merge to NVIDIA/srt-slurm's main can change srtctl semantics, container aliases, or recipe schema and silently break the run. Pin to a specific commit/tag/branch like the surrounding code.

Extended reasoning...

What the bug is

At runners/launch_gb200-nv.sh:241-246, the new minimaxm2.5 elif branch added by this PR is:

elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "minimaxm2.5" ]]; then
    git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
    cd "$SRT_REPO_DIR"
    git checkout main
    mkdir -p recipes/vllm/minimax-m2.5-gb200
    cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-gb200" recipes/vllm/minimax-m2.5-gb200
fi

The git checkout main line tracks the moving tip of NVIDIA/srt-slurm's default branch rather than a stable ref.

Why this is inconsistent with the surrounding code

Every other branch of this same elif-chain explicitly pins to a stable named ref:

  • dsv4 + dynamo-vllmgit checkout aflowers/vllm-gb200-v0.20.0
  • dsv4 + dynamo-sglanggit checkout sa-submission-q2-2026
  • generic dynamo-vllm fallback → git checkout sa-submission-q2-2026
  • kimik2.5 / dynamo-trtgit checkout sa-submission-q2-2026
  • IS_AGENTIC / else--branch cam/sa-submission-q2-2026 --single-branch

The minimax branch is the only one in this file that points at main. The pinning convention is clearly the project's deliberate design choice — it's how the benchmark numbers in perf-changelog.yaml are kept reproducible.

Step-by-step proof of how this breaks reproducibility

  1. Day 1: PR [NV] Add MiniMax-M2.5 FP4 GB200 Dynamo vLLM recipes #1642 merges. A sweep is run, recipes overlay onto whatever NVIDIA/srt-slurm@main looked like that day. Numbers get added to perf-changelog.yaml keyed by minimaxm2.5-fp4-gb200-dynamo-vllm.
  2. Day 30: someone (anyone) merges to NVIDIA/srt-slurm's main — e.g., renames a flag in srtctl apply, changes the containers: schema in srtslurm.yaml, or refactors the Makefile setup target.
  3. Day 31: an InferenceX rerun of the same minimaxm2.5-fp4-gb200-dynamo-vllm config now picks up the new upstream tip. One of the following happens:
    • The recipe overlay copy still lands fine, but srtctl apply -f "$CONFIG_FILE" fails at runtime with a new flag/key/schema error.
    • make setup ARCH=aarch64 (launch_gb200-nv.sh:323) fails or behaves differently.
    • The container alias keys in srtslurm.yaml (dynamo-trtllm, dynamo-sglang, etc.) no longer match what upstream expects.
    • Worse: nothing fails, but the benchmark silently runs under a different srtctl behaviour, producing different numbers from the same config — invalidating the perf-changelog entry without anyone noticing.

Failure mode (3d) is the most concerning: silent drift in benchmark numbers undermines the entire purpose of perf-changelog.yaml.

Why existing code doesn't prevent it

Nothing else in the launcher constrains the srt-slurm tree state — make setup, srtctl apply, and the recipe overlay all consume whatever was checked out. The only line that determines reproducibility is the git checkout itself.

How to fix

Replace git checkout main with a pinned ref, mirroring the dsv4 branch. Either:

  • git checkout sa-submission-q2-2026 (matches the generic dynamo-vllm and dsv4-sglang branches), or
  • git checkout <specific-commit-sha> if a particular commit is known to work, or
  • create a minimax-* branch on NVIDIA/srt-slurm and pin to that.

If the author intentionally wants to track main (e.g., minimax depends on an upstream feature not yet on sa-submission-q2-2026), an inline comment explaining why would resolve the concern — but at minimum a pin to a specific commit SHA on main would preserve reproducibility.

Comment on lines +103 to +123
for cand in \
/mnt/lustre01/users/slurm-shared/squash \
/mnt/lustre01/users-public/slurm-shared/squash \
/mnt/lustre01/groups/slurm-shared/squash \
/mnt/lustre01/users-public/sa-shared \
/nfs/slurm-shared/squash \
/home/slurm-shared/gharunners/squash
do
if mkdir -p "$cand" 2>/dev/null && touch "$cand/.write-probe.$$" 2>/dev/null; then
rm -f "$cand/.write-probe.$$" 2>/dev/null
SQUASH_DIR="$cand"
echo "Selected SQUASH_DIR=$SQUASH_DIR (first writable candidate)"
break
else
echo " not writable: $cand"
fi
done
if [ -z "$SQUASH_DIR" ]; then
echo "Error: no writable squash dir candidate found on this cluster" >&2
exit 1
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The SQUASH_DIR probe at lines 103-119 selects the first head-node-writable candidate, but the list includes /home/slurm-shared/gharunners/squash as a final fallback — directly contradicting the comment block at lines 198-204 which states '/home/slurm-shared isn't cross-mounted to compute nodes' (the very motivation for this probe). If all Lustre/NFS candidates fail the writability test, the probe silently selects the /home path and downstream slurm jobs will fail with an obscure 'image not found' error rather than the explicit 'no writable squash dir found' exit at line 121. Drop the /home candidate or add a compute-side visibility check.

Extended reasoning...

What the bug is

The squash directory probe at runners/launch_gb200-nv.sh:103-119 iterates through six candidate paths and selects the first one that is writable from the launcher (head node). The probe uses mkdir -p + touch to verify writability — but writability from the head node is necessary but not sufficient for the actual requirement, which is that the squash file must be readable by compute nodes at job-launch time.

The probe list ends with /home/slurm-shared/gharunners/squash. This is contradicted by the same diff at lines 198-204 (the SRT_REPO_DIR block):

On the watchtower (Oracle) gb200 cluster, /home/slurm-shared isn't cross-mounted to compute nodes — slurm jobs cd'ing into WorkDir and opening StdOut on /home/... crash with ExitCode=1:0 in 0s and the log file never gets created.

The whole reason this probe exists is that /home/ paths fail on watchtower. Yet the probe keeps a /home/ candidate as the fallback.

How it manifests

Step-by-step proof:

  1. The launcher runs on the head node, which has access to /home/slurm-shared/gharunners/squash (it's the GitHub Actions runner's home FS).
  2. Assume a fresh or misconfigured cluster where the four Lustre candidates fail the probe — e.g., /mnt/lustre01/users/slurm-shared/squash doesn't exist and the user lacks mkdir perms on /mnt/lustre01/groups, and /nfs/slurm-shared/squash isn't present.
  3. The loop reaches /home/slurm-shared/gharunners/squash. mkdir -p succeeds (head-node /home is writable). touch succeeds. The loop selects it and breaks.
  4. SQUASH_FILE=/home/slurm-shared/gharunners/squash/...sqsh is set, and enroot import writes the squash there successfully (head-node-local).
  5. Slurm submits the job; compute nodes try to read SQUASH_FILE via pyxis. The path doesn't exist on compute. Job fails with 'image not found' or similar obscure error, far from the actual root cause.

The user then has to dig through compute-side logs, correlate with the squash-import success on the head node, and re-derive the cross-mount limitation already documented in the comment block 80 lines below.

Why existing code doesn't prevent it

The probe only does mkdir -p + touch from the head node (lines 112-113). There is no compute-side check — no srun --nodes=1 test -d "$cand" or equivalent. And the explicit failure exit at line 121 (no writable squash dir found) only triggers if every candidate fails head-node writability, so the /home fallback masks the very situation the probe was designed to surface.

Impact

Limited blast radius — this code only triggers for MODEL_PREFIX == minimaxm2.5 (gated at line 78), and in practice one of the four Lustre candidates is likely to succeed on watchtower. But the contradiction with the script's own stated design intent is real, and if it does trigger, the failure mode is exactly the obscure downstream error the probe was built to avoid.

How to fix it

Either:

  1. Drop the /home candidate — remove /home/slurm-shared/gharunners/squash from the probe list. The explicit exit 1 at line 121 will then correctly surface the 'no writable squash dir' condition.
  2. Add compute-side visibility check — after selecting a candidate, do srun --partition="$SLURM_PARTITION" --account="$SLURM_ACCOUNT" -N1 --time=1 test -d "$cand" (or similar) before committing to it, and skip on failure.

Option 1 is the simpler preemptive fix given that this is diagnostic / transitional code.

Comment thread perf-changelog.yaml Outdated
- minimaxm2.5-fp4-gb200-dynamo-vllm
description:
- "Add MiniMax-M2.5 NVFP4 GB200 disaggregated multinode vLLM benchmarks via Dynamo"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new minimaxm2.5-fp4-gb200-dynamo-vllm entry has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD, which is a placeholder that renders as a dead 404 link in the changelog. All other entries in this file use real PR numbers (1626, 1630, 1631, 1588, 1627, etc.). Please replace TBD with 1642 before merging.

Extended reasoning...

What the bug is

perf-changelog.yaml is a user-facing log mapping config-key additions/changes to the PR that introduced them. Each entry has a pr-link pointing to the GitHub PR. The new minimaxm2.5-fp4-gb200-dynamo-vllm entry added by this PR has:

- config-keys:
    - minimaxm2.5-fp4-gb200-dynamo-vllm
  description:
    - "Add MiniMax-M2.5 NVFP4 GB200 disaggregated multinode vLLM benchmarks via Dynamo"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD

The TBD is a placeholder the author intended to replace once the PR was opened and given a number, but it was left as-is.

The convention is established

Every other entry in the surrounding context uses a real numeric PR id, e.g. lines 3373/3379/3385/3391/3397 reference 1626, 1630, 1631, 1588, 1627. None use a placeholder. The same convention holds throughout the file.

Impact

Clicking the link in the rendered changelog will hit https://github.com/SemiAnalysisAI/InferenceX/pull/TBD which is a 404. Anyone reading the changelog to find the originating PR for this minimax recipe is left without a working pointer to the discussion, review history, and context. It's purely a documentation/metadata defect — no runtime behaviour is affected — but the changelog is the primary discoverability surface for performance changes.

Fix

Replace TBD with 1642 (this PR's number):

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1642

Step-by-step proof

  1. Open perf-changelog.yaml at line 3403 → see literal string pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD.
  2. The URL https://github.com/SemiAnalysisAI/InferenceX/pull/TBD is not a valid GitHub PR URL (PR ids are numeric), so the GitHub web UI returns 404.
  3. The current PR's number (from preloaded metadata) is 1642, which is the value that should have been substituted.
  4. Compare to the immediately preceding entry in the diff: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1627 — same key, real number. The pattern is unambiguous.

Severity rationale

Nit — it's a metadata/link typo rather than a code defect, easy to fix in a one-character change, but worth catching pre-merge because once merged the broken link ships to anyone consuming the changelog.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

@functionstackx
Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@functionstackx functionstackx merged commit 0fc4f86 into main Jun 2, 2026
41 checks passed
@functionstackx functionstackx deleted the nv/jasonli/minimaxm2.5-fp4-gb200-dynamo-vllm branch June 2, 2026 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants