[Feature] Add cubin launcher utility as an extra header#283
[Feature] Add cubin launcher utility as an extra header#283yaoyaoding merged 35 commits intoapache:mainfrom
Conversation
Summary of ChangesHello @yaoyaoding, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant new feature by adding a CUBIN launcher utility to the TVM-FFI framework. This utility simplifies the process of loading and executing pre-compiled CUDA kernels, offering greater flexibility for advanced GPU programming. It enables developers to dynamically load CUBIN files or embed them directly into libraries, facilitating integration with external GPU code generation tools and custom kernel execution workflows. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This PR introduces a useful header-only utility for loading and launching CUBINs, along with comprehensive examples. The code is well-structured, particularly the cubin_launcher.h header which correctly uses RAII for resource management. I've found a critical resource leak in one of the C++ examples and a few typos in filenames and documentation. My review includes suggestions to fix the leak, improve thread-safety, and correct the typos.
|
The Triton kernel: # Define the kernel dynamically
@triton.jit
def square_kernel(X_ptr, Y_ptr, n, BLOCK: tl.constexpr = 1024): # noqa
pid = tl.program_id(0)
start = pid * BLOCK
offsets = start + tl.arange(0, BLOCK)
mask = offsets < n
x = tl.load(X_ptr + offsets, mask=mask, other=0.0)
y = x * x
tl.store(Y_ptr + offsets, y, mask=mask)
# Trigger kernel compilation by doing a dummy call
x_dummy = torch.ones(1024, dtype=torch.float32, device="cuda")
y_dummy = torch.empty(1024, dtype=torch.float32, device="cuda")
square_kernel[1, 1](x_dummy, y_dummy, 1024)The PTX: The two extra parameters are not used. Maybe it's used for some features that not used in this simple kernel. |
|
|
||
| // External symbols for embedded CUBIN data (linked via objcopy) | ||
| extern "C" const char __cubin_data[]; | ||
| extern "C" const char __cubin_data_end[]; |
There was a problem hiding this comment.
one thing to consider is how to avoid symbol conflict when we link multiple such embedded files,
here is what gemini suggest, do it in the following steps, maybe we can have a tool via python tvm.ffi.cpp.embed_cubin(output_object, input_object, cubin, key="env") to do that. we can also provide a cmake macro for those who like to use tvm-ffi cmake
- Step 1: Compile the C++ Source (source.o) Compile your C++ code normally. Ensure you declare the symbols as extern so the compiler creates an "undefined reference" (a hole to be filled later). g++ -c source.cc -o source.o
- Step 2: Convert Binary to Object (blob_raw.o) Use objcopy to wrap the raw binary file into a linkable object file. This creates Global symbols by default. objcopy -I binary -O elf64-x86-64 kernel.cubin blob_raw.o
- Step 3: Rename Symbols (blob_renamed.o) Change the auto-generated names (e.g., _binary_kernel_start) to the specific names your C++ code expects (tvm_ffi...). objcopy --redefine-sym old_name=new_name blob_raw.o blob_renamed.o
- Step 4: Partial Link / Merge (merged.o) Use ld -r to fuse the code (source.o) and the data (blob_renamed.o) together. This resolves the "undefined reference." ld -r source.o blob_renamed.o -o merged.o
- Step 5: Localize Symbols (final.o) Crucial Last Step: Now that the code and data are in the same file, use objcopy to change the symbols from Global to Local. This hides them from the outside world (Internal Linkage). objcopy --localize-symbol=__tvm_ffi__cubin_data merged.o final.o
# ==========================================
# Configuration
# ==========================================
# Files
BINARY_FILE := kernel.cubin
SOURCE_FILE := source.cc
OUTPUT_OBJ := final_module.o
# The symbol names your C++ code uses (extern "C")
SYM_NAME := __tvm_ffi__cubin_data
SYM_NAME_END := __tvm_ffi__cubin_data_end
# Compiler settings
CXX := g++
CXXFLAGS := -O2 -Wall -fPIC
LD := ld
OBJCOPY := objcopy
# ------------------------------------------
# Internal Calculation for objcopy default names
# objcopy converts "kernel.cubin" -> "_binary_kernel_cubin_start"
# We replace dots and slashes with underscores to match objcopy's behavior.
# ------------------------------------------
BINARY_FLAT := $(subst /,_,$(subst .,_,$(BINARY_FILE)))
DEFAULT_START := _binary_$(BINARY_FLAT)_start
DEFAULT_END := _binary_$(BINARY_FLAT)_end
DEFAULT_SIZE := _binary_$(BINARY_FLAT)_size
# ==========================================
# Rules
# ==========================================
.PHONY: all clean check
all: $(OUTPUT_OBJ)
# 1. Compile the C++ source into an object file.
# (Contains undefined references to the symbols)
source.o: $(SOURCE_FILE)
@echo "[1/5] Compiling Source..."
$(CXX) $(CXXFLAGS) -c $< -o $@
# 2. Convert the raw binary into an ELF object file.
# (Symbols are Global and named _binary_kernel_cubin_start)
blob_raw.o: $(BINARY_FILE)
@echo "[2/5] Converting Binary to Object..."
$(OBJCOPY) -I binary -O elf64-x86-64 $< $@
# 3. Rename the symbols to match your C++ declaration.
# (Still Global, but names match __tvm_ffi__...)
blob_renamed.o: blob_raw.o
@echo "[3/5] Renaming Symbols..."
$(OBJCOPY) \
--redefine-sym $(DEFAULT_START)=$(SYM_NAME) \
--redefine-sym $(DEFAULT_END)=$(SYM_NAME_END) \
--strip-symbol=$(DEFAULT_SIZE) \
$< $@
# 4. Partial Link (Merge).
# (Fuses source.o and blob.o. Code can now see Data.)
merged.o: source.o blob_renamed.o
@echo "[4/5] Linking (Partial Merge)..."
$(LD) -r source.o blob_renamed.o -o $@
# 5. Localize Symbols.
# (Hides the symbols from the outside world. Global D -> Local d)
$(OUTPUT_OBJ): merged.o
@echo "[5/5] Finalizing: Hiding Symbols..."
$(OBJCOPY) \
--localize-symbol=$(SYM_NAME) \
--localize-symbol=$(SYM_NAME_END) \
$< $@
@echo "Success! Created $(OUTPUT_OBJ)"
# ==========================================
# Utilities
# ==========================================
# Helper to prove the symbols are local
check: $(OUTPUT_OBJ)
@echo "Checking symbol visibility in $(OUTPUT_OBJ)..."
@echo "Look for lowercase 'd' (local data) or 'r' (local read-only):"
@nm $(OUTPUT_OBJ) | grep __tvm_ffi__
clean:
rm -f *.oThere was a problem hiding this comment.
The claude sonnet gives a slightly different steps. See the documentation of the macro TVM_FFI_EMBED_CUBIN
There was a problem hiding this comment.
I've added a python script to embed cubin to an object file:
usage: python -m tvm_ffi.utils.embed_cubin [-h] --output-obj PATH --input-obj PATH --cubin PATH --name NAME [-v]
Embed CUBIN data into existing object files that use TVM_FFI_EMBED_CUBIN macro
options:
-h, --help show this help message and exit
--output-obj PATH Output object file path (e.g., new.o)
--input-obj PATH Input object file path containing TVM_FFI_EMBED_CUBIN usage (e.g., old.o)
--cubin PATH Input CUBIN file path (e.g., kernel.cubin)
--name NAME Name used in TVM_FFI_EMBED_CUBIN macro (e.g., my_kernels)
-v, --verbose Print detailed command output
Examples:
# Basic usage
python -m tvm_ffi.utils.embed_cubin \
--output-obj new.o \
--input-obj old.o \
--cubin kernel.cubin \
--name my_kernels
# With verbose output
python -m tvm_ffi.utils.embed_cubin \
--output-obj new.o \
--input-obj old.o \
--cubin kernel.cubin \
--name my_kernels \
--verbose
Workflow:
1. Compile C++ code that uses TVM_FFI_EMBED_CUBIN to create old.o
2. Compile CUDA kernel to CUBIN (e.g., using nvcc or NVRTC)
3. Use this tool to merge them into new.o
4. Link new.o into your final shared library
Usage in C++ code (source compiled to old.o):
TVM_FFI_EMBED_CUBIN(my_kernels);
auto kernel = TVM_FFI_EMBED_CUBIN_GET_KERNEL(my_kernels, "kernel_name");
Requirements:
- GNU binutils (ld and objcopy) must be available in PATH
- Linux/Unix platform (Windows uses different embedding mechanisms)
For cmake, we have two utility functions at cmake/Utils/EmbedCubin.cmake
tvm_ffi_generate_cubin(
OUTPUT <output_cubin_file>
SOURCE <cuda_source_file>
[ARCH <architecture>]
[OPTIONS <extra_nvcc_options>...]
[DEPENDS <additional_dependencies>...]
)
tvm_ffi_embed_cubin(
OUTPUT <output_object_file>
SOURCE <source_file>
CUBIN <cubin_file>
NAME <symbol_name>
[DEPENDS <additional_dependencies>...]
)|
API-wise, i think we can streamline it a bit further cpp_source = """
#include <tvm/ffi/extra/cuda/cubin_launcher.h>
TVM_FFI_EMBED_CUBIN(env);
void AddTwo(TensorView a, TensorView b) {
static ffi::CubinKernel kernel = TVM_FFI_EMBED_CUBIN_GET_KERNEL(env, "add_two");
kernel.launch(...);
}
"""
cubin : bytes = compile_cubin_from_nvrtc(cuda_source);
cubin : bytes = compile_cubin_from_triton();
tvm_ffi.cpp.load_inline(cpp_source, embed_cubin={"env": cubin}); |
|
The last two args of triton kernel are scratch memory, it need to be allocated using the size in metadata. See https://github.com/triton-lang/triton/blob/main/third_party/nvidia/backend/driver.py#L700-L714 |
Cool, thanks for the information! |
|
Hi @tqchen @junrushao @Ubospica , the PR is ready for review, could you have one pass? |
tqchen
left a comment
There was a problem hiding this comment.
looking great, some final comments
| for (int device_id = 0; device_id < device_count; ++device_id) { | ||
| // Query device's maximum shared memory per block | ||
| cudaDeviceProp prop; | ||
| err = cudaGetDeviceProperties(&prop, device_id); |
There was a problem hiding this comment.
use https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1gb22e8256592b836df9a9cc36c9db7151 instead via cudaDevAttrMaxSharedMemoryPerBlockOptin
|
|
||
| // Set max dynamic shared memory for all devices during initialization | ||
| // This allows the kernel to use maximum available shared memory when needed | ||
| int device_count = 0; |
There was a problem hiding this comment.
move this to a function kernel.SetMaxDynamicSharedMemory(size_t static_mem_size, int64_t dynamic_smem_max=-1); consider make private and friend to CubinModule
where -1 deduce max from max value mininus static_mem_size
There was a problem hiding this comment.
Add CubinModule.GetKernelWithMaxDynamicSharedMemory(name, static_mem_size, dynamic_smem_max);
This is advanced mode since not all kernels need it
|
/gemini review |
There was a problem hiding this comment.
Code Review
This is an excellent pull request that introduces a powerful cubin_launcher utility. The feature is well-designed with a clean C++ API, comprehensive build system integration via CMake and Python utilities, and thorough documentation and examples. The code quality is high, and the addition of NVRTC and Triton integration examples is particularly valuable. I have a couple of minor suggestions for improving the documentation clarity, but overall this is a fantastic contribution.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This PR introduces a powerful CUBIN launcher utility, enabling efficient execution of pre-compiled CUDA kernels. The changes are comprehensive, including a new C++ header, CMake and Python utilities for embedding CUBINs, NVRTC integration for runtime compilation, and extensive examples and documentation. The implementation is well-structured and the new features are a great addition. I've found a couple of minor issues: one correctness issue in the C++ header and a potential caching issue in the Python build extension. After addressing these, this PR will be in excellent shape.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a powerful new feature for launching pre-compiled CUDA kernels from CUBIN files. The implementation is comprehensive, including a C++ header-only library, CMake and Python build system integrations, runtime compilation utilities via NVRTC, and extensive examples and documentation. The code is well-structured and robust. I have one suggestion to improve the reliability of the Python build extension in different environments.
This PR adds
include/tvm/ffi/extra/cubin_launcher.h, a header only utility, that loads cubin from byte buffer or a file and launch it. The utility is based on CUDA Runtime API. It's not compiled with thelibtvm_ffi.so, but shipped with the package.Usage
C++ Core Usage
docs/guides/cubin_launcher.rstexamples/cubin_launcherBenchmark
Benchmark script at
examples/cubin_launcher/benchmark_overhead.pyNote: we did not check the dtype, shape constraints in triton's launch case, thus the number above is the lower-bound of a typical triton launch overhead. When the constraints are checked, the launch overhead of Triton will be higher. On the other hand, TVM-FFI checks the constraints in C++ and is fast.
About Triton Example
The triton example is tricky, since the cubin's kernel parameters is not aligned with the kernel definition of the triton kernel in python. I checked the generated PTX, there are 5 parameters while there is only 3 in the kernel definition. The last two are not used. We need some Triton expert to write the host side code the prepare the kernel parameters. To make it widely used, we need better documentation on the triton's calling convention regarding the generated cubin. But this is out of the scope of this PR.