Skip to content

[https://nvbugs/6111076][fix] ulysses+sage#13440

Merged
zhenhuaw-me merged 1 commit intoNVIDIA:mainfrom
xrq-phys:ruqingx/visual_gen/sage_attn
Apr 30, 2026
Merged

[https://nvbugs/6111076][fix] ulysses+sage#13440
zhenhuaw-me merged 1 commit intoNVIDIA:mainfrom
xrq-phys:ruqingx/visual_gen/sage_attn

Conversation

@xrq-phys
Copy link
Copy Markdown
Collaborator

@xrq-phys xrq-phys commented Apr 24, 2026

Summary by CodeRabbit

  • Tests
    • Enabled previously skipped multi-GPU attention test cases
    • Updated tests to initialize and properly supply attention metadata state during test execution

Description

Fix ulysses+sage test due to metadata mis-configuration. The failure slipped by while working on #12937 .

Will undo #13432

Test Coverage

ulysses+sage tests

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@xrq-phys xrq-phys requested a review from a team as a code owner April 24, 2026 15:46
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

📝 Walkthrough

Walkthrough

These changes update Sage Attention multi-GPU tests by removing skip directives from the test waiver list and updating test code to initialize and supply attention metadata state objects to the attention module during test execution.

Changes

Cohort / File(s) Summary
Test Configuration
tests/integration/test_lists/waives.txt
Removed 5 SKIP directives for Ulysses Sage Attention multi-GPU test cases (test_sage_ulysses_forward and test_sage_ulysses_vs_reference variants), enabling previously skipped tests to run.
Test Implementation
tests/unittest/_torch/visual_gen/multi_gpu/test_ulysses_sage_attention.py
Added initialization of attention metadata state using create_attention_metadata_state() stored in thread-local storage; both test paths now pass this metadata state to TrtllmAttention during construction.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: fixing a ulysses+sage test configuration issue by adding metadata state initialization.
Description check ✅ Passed The description includes key sections (Description, Test Coverage, Checklist), clearly explains the issue and references related PRs, though it relies on an AI summary that hasn't been fully expanded.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unittest/_torch/visual_gen/multi_gpu/test_ulysses_sage_attention.py (1)

230-271: QA list update is unnecessary for this PR scope.

This change is limited to tests/unittest/... and does not add/alter integration test definitions.

As per coding guidelines "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/visual_gen/multi_gpu/test_ulysses_sage_attention.py`
around lines 230 - 271, The PR touches only unit tests
(tests/unittest/_torch/visual_gen/multi_gpu/test_ulysses_sage_attention.py) so
explicitly state in the PR description that QA list updates are unnecessary;
update the PR body or the commit message to include a short note like "QA list
update unnecessary — changes limited to tests/unittest/ only" and ensure any PR
template checkbox about QA lists is marked or commented accordingly so reviewers
know no integration-test or release QA changes are required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unittest/_torch/visual_gen/multi_gpu/test_ulysses_sage_attention.py`:
- Around line 230-271: The PR touches only unit tests
(tests/unittest/_torch/visual_gen/multi_gpu/test_ulysses_sage_attention.py) so
explicitly state in the PR description that QA list updates are unnecessary;
update the PR body or the commit message to include a short note like "QA list
update unnecessary — changes limited to tests/unittest/ only" and ensure any PR
template checkbox about QA lists is marked or commented accordingly so reviewers
know no integration-test or release QA changes are required.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 37138d8c-9109-46d5-aebb-046ac320d406

📥 Commits

Reviewing files that changed from the base of the PR and between 5fd0a6c and 963204c.

📒 Files selected for processing (2)
  • tests/integration/test_lists/waives.txt
  • tests/unittest/_torch/visual_gen/multi_gpu/test_ulysses_sage_attention.py
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45412 [ run ] triggered by Bot. Commit: 963204c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45412 [ run ] completed with state FAILURE. Commit: 963204c
/LLM/main/L0_MergeRequest_PR pipeline #35649 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@xrq-phys xrq-phys force-pushed the ruqingx/visual_gen/sage_attn branch from 963204c to 6343c0a Compare April 24, 2026 17:11
@xrq-phys
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45424 [ run ] triggered by Bot. Commit: 6343c0a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45424 [ run ] completed with state SUCCESS. Commit: 6343c0a
/LLM/main/L0_MergeRequest_PR pipeline #35658 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chang-l
Copy link
Copy Markdown
Collaborator

chang-l commented Apr 24, 2026

/bot run --add-multi-gpu-test

@chang-l chang-l enabled auto-merge (squash) April 24, 2026 21:37
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45449 [ run ] triggered by Bot. Commit: 6343c0a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45449 [ run ] completed with state SUCCESS. Commit: 6343c0a
/LLM/main/L0_MergeRequest_PR pipeline #35681 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45476 [ run ] triggered by Bot. Commit: 6343c0a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45476 [ run ] completed with state FAILURE. Commit: 6343c0a
/LLM/main/L0_MergeRequest_PR pipeline #35707 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45489 [ run ] triggered by Bot. Commit: 6343c0a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45489 [ run ] completed with state SUCCESS. Commit: 6343c0a
/LLM/main/L0_MergeRequest_PR pipeline #35717 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chang-l
Copy link
Copy Markdown
Collaborator

chang-l commented Apr 26, 2026

/bot run --add-multi-gpu-test --disable-fail-fast

@zhenhuaw-me
Copy link
Copy Markdown
Member

/bot run --add-multi-gpu-test

@zhenhuaw-me zhenhuaw-me requested a review from o-stoner April 27, 2026 02:44
auto-merge was automatically disabled April 27, 2026 06:16

Head branch was pushed to by a user without write access

@xrq-phys xrq-phys force-pushed the ruqingx/visual_gen/sage_attn branch from 6343c0a to c4faee6 Compare April 27, 2026 06:16
@xrq-phys
Copy link
Copy Markdown
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45669 [ kill ] triggered by Bot. Commit: c4faee6 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45669 [ kill ] completed with state SUCCESS. Commit: c4faee6
Successfully killed previous jobs for commit c4faee6

Link to invocation

Signed-off-by: Ruqing Xu <7891482+xrq-phys@users.noreply.github.com>
@xrq-phys xrq-phys force-pushed the ruqingx/visual_gen/sage_attn branch from c4faee6 to 8bb5c3e Compare April 27, 2026 12:44
@xrq-phys
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45741 [ run ] triggered by Bot. Commit: 8bb5c3e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45741 [ run ] completed with state FAILURE. Commit: 8bb5c3e
/LLM/main/L0_MergeRequest_PR pipeline #35936 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45963 [ run ] triggered by Bot. Commit: 8bb5c3e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45963 [ run ] completed with state SUCCESS. Commit: 8bb5c3e
/LLM/main/L0_MergeRequest_PR pipeline #36117 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46047 [ run ] triggered by Bot. Commit: 8bb5c3e Link to invocation

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "All related tests (single and multi-gpu) have passed. GB multi-gpu can't pass due to node availability (c.f. status=PENDING log items)."

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

@zhenhuaw-me we can merge this item now.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46181 Bot args parsing error: Failed to parse bot args

Link to invocation

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "All related tests, single and multi-gpu, have passed. GB multi-gpu cannot pass due to node availability. c.f. status PENDING log items."

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46195 [ skip ] triggered by Bot. Commit: 8bb5c3e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46195 [ skip ] completed with state SUCCESS. Commit: 8bb5c3e
Skipping testing for commit 8bb5c3e

Link to invocation

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

PR_Github #46047 has all tests passing except for SBSA multi-GPU cases. SBSA multi-GPU tests are failing / cancelled because they couldn't get nodes allocated:

[2026-04-29T07:17:06.520Z] ++ sacct -j ******* --format=State -Pn --allocations
[2026-04-29T07:17:06.520Z] + STATUS=PENDING
[2026-04-29T07:17:06.520Z] + [[ -z PENDING ]]
[2026-04-29T07:17:06.520Z] + [[ PENDING == \R\U\N\N\I\N\G ]]
[2026-04-29T07:17:06.520Z] + [[ PENDING == \P\E\N\D\I\N\G ]]
[2026-04-29T07:17:06.520Z] Slurm job ******* is still running
[2026-04-29T07:17:06.520Z] + echo 'Slurm job ******* is still running'
[2026-04-29T07:17:06.520Z] + sleep 300
...
[2026-04-29T17:53:28.510Z] Slurm job ******* is still running
[2026-04-29T17:53:28.510Z] + STATUS=PENDING # <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
[2026-04-29T17:53:28.510Z] + [[ -z PENDING ]]
[2026-04-29T17:53:28.510Z] + [[ PENDING == \R\U\N\N\I\N\G ]]
[2026-04-29T17:53:28.510Z] + [[ PENDING == \P\E\N\D\I\N\G ]]
[2026-04-29T17:53:28.510Z] + echo 'Slurm job ******* is still running'
[2026-04-29T17:53:28.510Z] + sleep 300

@zhenhuaw-me zhenhuaw-me merged commit 2bc8f7f into NVIDIA:main Apr 30, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants