Skip to content

[NVRTC] Add NVSHMEM support to NVRTC compilation path#18681

Merged
spectrometerHBH merged 5 commits intoapache:mainfrom
Kathryn-cat:nvrtc-nvshmem-ptr
Jan 24, 2026
Merged

[NVRTC] Add NVSHMEM support to NVRTC compilation path#18681
spectrometerHBH merged 5 commits intoapache:mainfrom
Kathryn-cat:nvrtc-nvshmem-ptr

Conversation

@Kathryn-cat
Copy link
Copy Markdown
Contributor

@Kathryn-cat Kathryn-cat commented Jan 23, 2026

This PR adds NVSHMEM support to the NVRTC path in python/tvm/contrib/nvcc.py.

This is enabled by a compile stage and a link stage for NVSHMEM programs. Tested locally via tests/python/disco/test_nvshmem.py.

Results show it's about 10-35% faster in compilation speed. Kernel perf is the same.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Kathryn-cat, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances TVM's CUDA compilation capabilities by integrating NVSHMEM support into the NVRTC backend. This allows users to compile and run CUDA kernels that leverage NVSHMEM features using the NVRTC compiler, offering an alternative to the traditional NVCC path. The changes involve adapting the NVRTC compilation process to include necessary linking steps, handle specific CUDA header requirements, and manage different binary output formats, ensuring seamless operation for distributed GPU programming with NVSHMEM.

Highlights

  • NVRTC NVSHMEM Support: Enabled NVSHMEM compilation and linking within the NVRTC (NVIDIA Runtime Compilation) path, which was previously only supported by NVCC.
  • Compilation and Linking Stages: Implemented distinct compilation and linking stages for NVSHMEM programs using NVRTC, including handling cubin output and leveraging the CUDA driver API for linking.
  • CUDA Header Compatibility: Addressed compatibility issues by mapping cuda::std type traits to the std namespace and including CCCL paths for NVSHMEM headers, ensuring proper compilation.
  • Flexible Binary Loading: Updated the external kernel builder to handle both PTX and cubin binary formats, allowing for NVSHMEM-specific compilation outputs.
  • New NVSHMEM Kernel Tests: Introduced new tests to verify the correct compilation and execution of NVSHMEM kernels using both NVCC and NVRTC compilation modes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces NVSHMEM support for the NVRTC compilation path, which is a valuable addition. The implementation is well-structured, including necessary workarounds for NVRTC and comprehensive end-to-end tests for both nvcc and nvrtc paths. My main concern is a potential resource leak related to CUDA context management, which I've detailed in a specific comment.

Comment thread python/tvm/contrib/nvcc.py Outdated
Comment on lines +525 to +533
# Check if there's already a CUDA context; create one if not
result, context = cu.cuCtxGetCurrent()
if result != cu.CUresult.CUDA_SUCCESS or context is None or int(context) == 0:
result, device = cu.cuDeviceGet(0)
if result != cu.CUresult.CUDA_SUCCESS:
raise RuntimeError(f"Failed to get CUDA device: {result}")
result, context = cu.cuCtxCreate(None, 0, device)
if result != cu.CUresult.CUDA_SUCCESS:
raise RuntimeError(f"Failed to create CUDA context: {result}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CUDA context created here if one doesn't already exist is not destroyed. This can lead to a resource leak, as CUDA contexts hold significant GPU resources. It would be safer to ensure that if a context is created within this function, it is also destroyed before the function returns.

A try...finally block should be used to manage the context's lifecycle:

context_created = False
context = None
try:
    # get or create context
    # ...
    # linking logic
    # ...
finally:
    if context_created and context:
        cu.cuCtxDestroy(context)

This would ensure that any context created specifically for this compilation is properly cleaned up, preventing resource leaks. Additionally, the condition context is None or int(context) == 0 can be simplified to not context since the cuda-python CUcontext object has a __bool__ method that handles this check.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@Kathryn-cat
Copy link
Copy Markdown
Contributor Author

@tvm-bot rerun

Comment thread python/tvm/contrib/nvcc.py Outdated
return bytearray(binary_buf)
# link stage for NVSHMEM
if use_nvshmem:
import ctypes # pylint: disable=import-outside-toplevel
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conside move into a separate function

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@Kathryn-cat
Copy link
Copy Markdown
Contributor Author

@tvm-bot re-run

1 similar comment
@Kathryn-cat
Copy link
Copy Markdown
Contributor Author

@tvm-bot re-run

@spectrometerHBH spectrometerHBH merged commit 2004a8b into apache:main Jan 24, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants