Use official TRT-LLM image (1.3.0rc15.post1) for DSv4 B300 TRT (non-MTP + MTP)#1636
Use official TRT-LLM image (1.3.0rc15.post1) for DSv4 B300 TRT (non-MTP + MTP)#1636Oseltamivir wants to merge 12 commits into
Conversation
…03e6 Bumps the TensorRT-LLM DeepSeek-V4-Pro image for dsv4-fp4-b200-trt and dsv4-fp4-b300-trt to ghcr.io#semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-2dd03e6. The -mtp variants are intentionally left on feat-deepseek_v4-9aa3715. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26783090679 |
| - config-keys: | ||
| - dsv4-fp4-b200-trt | ||
| - dsv4-fp4-b300-trt | ||
| description: | ||
| - "Update the TensorRT-LLM DeepSeek-V4-Pro image to ghcr.io/semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-2dd03e6" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX |
There was a problem hiding this comment.
🟡 The new perf-changelog.yaml entry leaves the pr-link as the unfilled template placeholder https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX. It should be pull/1636 to match the actual PR number and the convention of every preceding entry, otherwise the pr-link is broken for both humans and any tooling that consumes the changelog.
Extended reasoning...
What the bug is. The diff appends a new entry to perf-changelog.yaml (lines 3358-3363) for the DSv4 TRT image bump on dsv4-fp4-b200-trt and dsv4-fp4-b300-trt. The last field of that entry is pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX. XXXX is the literal template placeholder — it was never filled in with the actual PR number.
Why this is a real issue. Every other recent entry in the same file follows the convention of using the real PR number — the five entries immediately above this one link to pull/1602, pull/1624, pull/1616, pull/1626, and pull/1630 respectively. The PR metadata for this change shows it is PR #1636, so the value should be https://github.com/SemiAnalysisAI/InferenceX/pull/1636. With XXXX left in place, the link does not resolve to any PR, breaking the traceability that the pr-link field exists to provide.
Impact. This does not affect the actual image bump or any sweep behavior — the runtime is unchanged. The damage is to the changelog's documentation/audit value: anyone trying to find the originating PR for these two config-key changes from the changelog hits a dead 404, and any tooling that parses pr-link (e.g., to cross-link sweep results back to PRs, generate release notes, or validate entries) will either fail or produce a broken link.
Why existing checks didn't prevent it. There appears to be no schema validation that rejects XXXX as a PR number; the YAML is valid syntactically and the link is a syntactically valid URL — it just points nowhere meaningful. The placeholder is the kind of thing only a reviewer or a numeric-PR-id linter would catch.
Fix. Replace the placeholder with the real PR number:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1636Step-by-step proof.
- Open
perf-changelog.yamlat line 3363. - Observe the literal line:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX. - Compare to the five entries immediately above (lines ending around 3332, 3338, 3344, 3350, 3356), which read
pull/1602,pull/1624,pull/1616,pull/1626,pull/1630— all real PR numbers. - Check the PR metadata in this review: PR number is
1636. - Click (or curl)
https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX— it does not resolve to a PR. Clickhttps://github.com/SemiAnalysisAI/InferenceX/pull/1636— it resolves to this PR. The placeholder thus makes the field useless for its stated purpose.
|
|
||
| dsv4-fp4-b200-trt: | ||
| image: ghcr.io#semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-9aa3715 | ||
| image: ghcr.io#semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-2dd03e6 |
There was a problem hiding this comment.
is there any official nvidia RC that works...
There was a problem hiding this comment.
Image is from dsv4 branch: https://github.com/NVIDIA/TensorRT-LLM/tree/feat/deepseek_v4
Main dsv4 failing DPA: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26786937394
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26783097365 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26786056973 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26786107993 |
… (non-MTP) Swap dsv4-fp4-b200-trt and dsv4-fp4-b300-trt from the custom ghcr.io semianalysis feat/deepseek_v4 build to the official nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc15.post1 to test whether the official RC can serve DeepSeek-V4-Pro. The -mtp variants are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26786937394 |
…on-MTP) The official nvcr.io tensorrt-llm/release:1.3.0rc15.post1 loads DSv4-Pro but its DP-attention path deadlocks/crashes under concurrent load (every dpa=true job hung or failed; only pure-TP conc-1 points passed). Revert to the stable custom build until upstream fixes DSv4 + attention-DP (NVIDIA/TensorRT-LLM#13431). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4bc5592. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26803566770 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26803566770 |
Bump dsv4-fp4-b200-trt and dsv4-fp4-b300-trt to ghcr.io#semianalysisai/trtllm-deepseek-v4:fix-dsv4-swa-scratch-revert-shrink-c914d6d (TRT-LLM feat/deepseek_v4 @ 084cf2ba + kv_cache_manager_v2 fix). This resolves the engine crash on attention-DP context/generation reverts at high concurrency (the b300 8k1k conc>=512 "LLM is shutting down" hang). The -mtp variants stay on feat-deepseek_v4-9aa3715. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26803566770 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26811531104 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26811681728 |
…reuse The c914d6d image's kv_cache_manager_v2 patch was wrong: freeing SWA scratch slots on the attention-DP revert->resize(shrink) path hits finish_event=None (a deferred request never forwarded), crashing every dpa=true job and hanging the engine. Root cause is a V2-scheduler / SWA-scratch-reuse conflict: the V2 scheduler grows a context request's KV cache (incl. SWA scratch) before delay batching can defer it, so revert_allocate_context -> resize(shrink) must release scratch slots that have no finish_event. Revert both non-MTP images to feat-deepseek_v4-2dd03e6 and set TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 in the launchers so no scratch slots are allocated and the revert shrinks cleanly. MTP configs untouched (9aa3715). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1f70cac to
e23a541
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26843313476 |
3 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26843313476 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26843313476 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26843313476 |
B200 reverts to feat-deepseek_v4-9aa3715: the 2dd03e6 image OOMs on B200's smaller HBM at conc-256 once SWA scratch reuse is disabled. Only B300 moves to 2dd03e6 + TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 in its launcher. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26912996470 |
…TP + MTP) Point dsv4-fp4-b300-trt and dsv4-fp4-b300-trt-mtp at the official nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc15.post1 (from the custom feat/deepseek_v4 builds 2dd03e6 / 9aa3715) and drop the TRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0 launcher workaround so the official image runs with native behavior. B200 TRT unchanged (9aa3715). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26914210927 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26914210927 |

Points both B300 DSv4 TRT configs at the official NVIDIA release image and adds the MTP sibling to the sweep:
dsv4-fp4-b300-trt(non-MTP):feat-deepseek_v4-2dd03e6→nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc15.post1dsv4-fp4-b300-trt-mtp(MTP):feat-deepseek_v4-9aa3715→nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc15.post1This drops the custom
ghcr.iosemianalysisfeat/deepseek_v4builds in favor of the official RC, to evaluate whether the official image can serve DeepSeek-V4-Pro (non-MTP and MTP). The non-MTP launcher'sTRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0workaround (specific to the custom build) is removed so the official image runs with its native behavior, matching the MTP launcher which never had it.Known risk
A prior run of
1.3.0rc15.post1with attention-DP (dpa=true) served a couple of iterations and then crashed withCUDA_ERROR_ILLEGAL_ADDRESSinkv_cache_manager.free_resources(run 26786937394) — a different failure from the custom build's SWA-scratch-revert crash. Sodpa=truejobs may still fail on the official image; the pure-TP (dpa=false) cases are more likely to pass. MTP on the official RC is untested. This sweep is what tells us where it stands.Scope
B200 TRT is unchanged (stays on
feat-deepseek_v4-9aa3715); its OOM follow-up is tracked separately.🤖 Generated with Claude Code
Note
Medium Risk
Benchmark-only image swap for a large model on TRT-LLM; official RC may still hit known CUDA failures with dp-attn, affecting sweep stability rather than production services.
Overview
Switches B300 DeepSeek-V4-Pro FP4 TensorRT-LLM benchmark configs
dsv4-fp4-b300-trtanddsv4-fp4-b300-trt-mtpfrom customghcr.io/semianalysisai/trtllm-deepseek-v4feature builds to the official NVIDIA imagenvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15.post1, so sweeps evaluate the official RC for both standard and MTP runs.Documents the change in
perf-changelog.yaml(including dropping the custom-build-onlyTRTLLM_DSV4_ENABLE_SWA_SCRATCH_REUSE=0launcher workaround for the non-MTP path). B200 TRT configs are not changed in this diff.Reviewed by Cursor Bugbot for commit ad529fb. Bugbot is set up for automated code reviews on this repo. Configure here.