Cutensor bindings#38
Conversation
… that doesn't work for TBLIS right now
…ementation into cutensor_bindings
There was a problem hiding this comment.
prepared by claude, edited by me
PR #38: Cutensor Bindings — Review
+8,974 / -910 across 41 files | CI: All checks pass
Summary
This PR adds cuTENSOR bindings for the TAPP API, refactors the CMake build system (pushing test/example targets into subdirectories), adds a TAPP_handle parameter to TAPP_create_tensor_info (API-breaking change), renames TAPP_REFERENCE_ENABLE_TBLIS to TAPP_REFERENCE_USE_TBLIS, adds dynamic-loading test infrastructure, and removes some deprecated files.
High-level concerns
-
API-breaking change to
TAPP_create_tensor_info— AddingTAPP_handleas a new parameter changes the public API. The reference implementation (reference_implementation/src/tensor.c) accepts the parameter but ignores it. This is the right design (the handle is needed by cuTENSOR but not by the reference impl), but consider whether this needs a version bump or changelog entry. -
Negative strides and
negative_strtest disabled indemo.c— The negative stride test is commented out indemo.cwith a cuTENSOR-specific comment, butdemo.clinks againsttapp::reference, nottapp::cutensor. Disabling it here penalizes the reference implementation's test coverage for a cuTENSOR limitation. Consider keeping it enabled for the reference demo and only disabling it in cuTENSOR-specific tests. -
Massive code duplication:
test_dynamic.cpp(4,079 lines) — This is essentially a copy-paste oftest.cppwith all calls going through astruct impfunction-pointer table. Same fordemo_dynamic.cvsdemo.c. This creates a significant maintenance burden — any future test change must be made in both places. Consider using macros or templates to share the test logic.
Specific issues
Bugs / correctness
-
product.cpp:952— Wrong handle cast:plan_struct->handle = ((cutensorHandle_t*) handle); struct handle* handle_struct = (struct handle*) plan_struct->handle;
handleis aTAPP_handle(i.e.,intptr_t) that actually points to astruct handle. First it's cast tocutensorHandle_t*and stored, then the storedcutensorHandle_t*is cast tostruct handle*. This only works by accident because thecutensorHandle_t* libhandleis the first member ofstruct handle. This is fragile and incorrect —plan_struct->handleshould be typed asstruct handle*or at minimum the first cast should be(struct handle*). -
attributes.cpp:575—memcpyto/fromintptr_tas pointer:memcpy((void*)handle_struct->attributes[0], value, sizeof(bool));
attributes[0]is anintptr_tholding abool*. The cast(void*)handle_struct->attributes[0]is correct, but the design is fragile — theintptr_t*array is a poor man's type-erased attribute store. Consider at minimum documenting the ownership model. -
error.cpp:754— Extracting TAPP field then switching onerrorinstead oftappVal:uint64_t tappVal = code & TAPP_FIELD_MASK; if (tappVal != 0) { switch (error) // <-- should be switch(tappVal)
If both TAPP and cuTENSOR errors are packed,
errorwill include the cuTENSOR bits and never match cases 1-15. -
cutensor_demo.cpp:2678— Wrong copy size inconjugate()test:cudaMemcpy((void*)D, (void*)D_d, 9 * sizeof(float), cudaMemcpyDeviceToHost);
Disstd::complex<float>[9], so this should be9 * sizeof(std::complex<float>). Only half the data is copied back. -
error.cpp:853— CUDA error packing clears TAPP+cuTENSOR fields:uint64_t cleared_val = val & (~LOW_FIELDS_MASK); return static_cast<int>(cleared_val | new_cuda_val);
This discards any previously packed TAPP/cuTENSOR errors. The other
pack_erroroverloads preserve other fields, but this one doesn't. Inconsistent behavior.
Memory safety
-
execute_productinproduct.cpp— Early returns leak GPU memory. Everyif (cerr != cudaSuccess) return pack_error(0, cerr)betweencudaMallocAsynccalls will leak all previously allocated device buffers (A_d,B_d,C_d,D_d,E_d,contraction_work). Consider using RAII or a goto-cleanup pattern. -
create_tensor_productinproduct.cpp— Early returns leakplan_structand partial state. If any cuTENSOR call fails afternew product_plan, theplan_structand its dynamically allocated members are leaked. -
execute_product—perm_scalar_ptrusesmallocbut is never freed on error path (line ~1216 returns beforefree(perm_scalar_ptr)ifcutensorPermutefails).
Style / quality
-
Missing newlines at end of file in essentially all new headers and source files under
cutensor_bindings/. Most tools and compilers warn about this. -
Unreachable
breakstatements afterreturnin switch cases throughoutdatatype.cppandproduct.cpp(translate_operator,translate_datatype, etc.). Harmless but noisy. -
VLA usage (
int64_t sorted_strides_D[TAPP_get_nmodes(D)]inproduct.cpp,int64_t section_coordinates_D[...]inexecute_product). VLAs are not standard C++ and are a compiler extension. Consider usingstd::vectorornew[]. -
Magic number
15for "invalid key" inattributes.cpp. This should use a named constant or the error enum. -
cmake_minimum_required(VERSION 3.17)insideCMakeLists.txtat line 198 —cmake_minimum_requiredshould only be called once at the top of the project. This is a policy change mid-file. Useif(CMAKE_VERSION VERSION_LESS 3.17)/message(FATAL_ERROR ...)instead, or bump the top-level requirement. -
cutensor_bindings/CMakeLists.txt:338-341—target_link_libraries(cutensor::cutensor INTERFACE CUDA::cudart)modifies an IMPORTED target's link interface. This is a surprising side effect — it means anyone finding cuTENSOR through this build gets CUDA::cudart added transitively, even if they didn't want it. Consider linking CUDA::cudart totapp-cutensordirectly instead (which is already done on line 370).
CMake
-
examples/CMakeLists.txt:1565—tapp-reference-exercise_tucker_answerslinks againsttapp-reference(old target name) instead oftapp::reference. Inconsistent with the rest of the migration. -
test/CMakeLists.txt— The dynamic test/demo targets are only built whenTAPP_CUTENSORis enabled, but theydlopenshared libraries at runtime and don't actually depend on cuTENSOR at compile time. Could they be useful without cuTENSOR too (e.g., testing two reference implementations)?
Test infrastructure
-
test_dynamic.h—pathAandpathBare hardcoded as"./libtapp-reference.so"and"./libtapp-cutensor.so". This won't work on macOS (.dylib) or if the build output is in a different directory. These should be configurable, e.g., via CMakeconfigure_fileor command-line arguments. -
test_dynamic.cppline 7257 — Syntax error in commented-out code:str(test_mixed_strides(impA, impB)has mismatched parens.
Minor / positive notes
- The CMake refactoring (pushing test/example targets into subdirectories) is a good cleanup
TAPP_REFERENCE_ENABLE_TBLIS->TAPP_REFERENCE_USE_TBLISrename is more descriptive- The
printf("%s", message_buff)fix (fromprintf(message_buff)) is a correct format-string vulnerability fix reduce_isolated_indicesrename fromcontract_unique_idxis clearer- The conditional cleanup fix in
run_tblis_mult(checkingtblis_A_reduced != &tblis_Abefore freeing) fixes a real bug - The
rand()change from-max()tomin()avoids UB with signed overflow
|
I am going through and fixing these things. But I don't really understand 1., 3. I am also unsure about. I don't really know what's a good idea to fix that. Once I tried putting the functions that are the same into a helper file, but the template ones need to know how they are used to know which types to compile for. I could specify which types to template for, but it didn't seem like a good idea. I could do it in extreme and have #ifdef and just use one file. Also, I am not used to working with bit @janbrandejs do you understand 8.? |
we don't have either documentation or changelog, so this is probably not applicable, but if we did the API change would need to be noted somewhere.
This suggests to deal with duplication of tests in |
|
8.: I thijk here the AI says that we drop the information about subsequent errors after the first one occurs, instead of accumulatig them (if there is both error for cuda and for cutensor for instance). If this is the case, then it's fine, we agreed on working group meeting that only the first error foud will be reported. |
…o one file separated by compile definition
…o one file separated by compile definition
evaleev
left a comment
There was a problem hiding this comment.
Updated review after the 7 new commits (85c59d3..b06a4aa). prepared by claude, edited by me
✅ Fixed by the new commits
| Commit | Prior finding |
|---|---|
85c59d3 |
Wrong handle cast in product.cpp — now plan_struct->handle = handle; |
132d365 |
switch(error) → switch(tappVal) in error.cpp |
349749d |
Workspace std::max(..., 128 MiB) minimum now applied |
5e60aef |
tapp-reference → tapp::reference in examples/CMakeLists.txt |
7335ccd |
9*sizeof(float) → 9*sizeof(std::complex<float>) in cutensor_demo.cpp::conjugate() |
9b6952e |
Massive duplication: test_dynamic.{cpp,h} deleted (-5352 lines); single file gated by TAPP_DYNAMIC_LAUNCH |
b06a4aa |
Same for demo_dynamic.c (-1382 lines) |
The duplication cleanup is the big one — ~6,700 lines of copy-pasted code collapsed. The result is still noisy (30+ #ifdef TAPP_DYNAMIC_LAUNCH blocks in test.h), but it's correct and maintainable now.
⚠️ Previous findings still unaddressed
pack_error(int, cudaError_t)overload inerror.cpp:847-854still clears both TAPP and cuTENSOR fields (val & ~LOW_FIELDS_MASK) — asymmetric with the other two overloads. AcudaErrorafter a packed TAPP/cuTENSOR error silently loses the earlier context.- GPU/host memory leaks on every early-return error path in
TAPP_create_tensor_productandTAPP_execute_product—contraction_desc,permutation_desc,plan_pref, plans, and allcudaMallocAsyncbuffers (A_d,B_d,C_d,D_d,E_d,contraction_work) leak on failure. Not addressed. cmake_minimum_required(VERSION 3.17)still appears mid-file in topCMakeLists.txt.cutensor_bindings/CMakeLists.txtstill callstarget_link_libraries(cutensor::cutensor INTERFACE CUDA::cudart)on the IMPORTED target — interface pollution; should be moved totapp-cutensor.test/test.h:20-21still hardcodes./reference_implementation/libtapp-reference.soand./cutensor_bindings/libtapp-cutensor.so. Will break on macOS (.dylib) and Windows (.dll). Pass viaargvorconfigure_filewith$<TARGET_FILE:...>.- Missing newlines at EOF, unreachable
breakafterreturn, VLAs in C++, magicreturn 15;— all still present. negative_strtest still commented out indemo.ceven thoughdemo.clinkstapp::reference, not cuTENSOR.
🆕 New findings
-
Major — no stream synchronization anywhere.
grep cudaStreamSynchronize cutensor_bindings/returns nothing.TAPP_execute_productissuescudaMemcpyAsync(... DeviceToHost ...)forD(product.cpp:298) andcudaFreeAsyncfor the scratch buffers, then returns. Caller must sync the executor's stream themselves before readingD. If that's the intended contract it needs to be documented; otherwisecudaMemcpy(sync) or an explicitcudaStreamSynchronizeat the function tail is needed. Tests that pass today are likely passing by luck (small inputs finishing before host read). -
Major —
TAPP_attr_getwrites into the wrong indirection level (attributes.cpp:23):TAPP_error TAPP_attr_get(TAPP_attr attr, TAPP_key key, void** value) { ... memcpy(value, (void*)handle_struct->attributes[0], sizeof(bool));
The parameter is
void**, but the code writes 1 byte into the pointer slot itself, not into*value. Either the signature should bevoid*(matchingTAPP_attr_set) or the body shouldmemcpy(*value, ...). -
Major — return values ignored on cuTENSOR cleanup/estimate paths.
cutensorEstimateWorkspaceSize(product.cppnear 234) andcutensorDestroyPlanPreferenceresults are not checked; a failure feeds garbage intocutensorCreatePlanor silently leaks. -
Major — plan and handle not thread-safe. A single
product_plancarriescutensorPlan_t+cutensorHandle_treferences and is reused acrossTAPP_execute_productcalls. cuTENSOR plans are not safe for concurrent use; no locking. Worth documenting at minimum. -
Minor — assertions in production paths.
assert(uintptr_t(contraction_work) % 128 == 0)(product.cpp:240) and similar in the host-memory path will be compiled out underNDEBUG. If the alignment matters, use a real check.
Verdict
Real progress — the latest 7 commits resolve most of the cited bugs (handle cast, error switch, memcpy size, link name, and the massive duplication). The remaining items aren't deal-breakers individually, but the GPU memory-leak-on-error-path and the missing stream synchronization are correctness issues that should be addressed before merge. Hardcoded .so paths still block portability to non-Linux runners.
Suggested before merging:
- Add
cudaStreamSynchronize(*(cudaStream_t*)exec);at the tail ofTAPP_execute_product(or document async-result semantics). - Wrap the failure paths in
create_tensor_product/execute_productwith a goto-cleanup or RAII so descriptors/plans/device buffers are freed on error. - Replace
.soliterals intest/test.hwith values passed via CMake ($<TARGET_FILE:tapp-reference>etc.) — also makes the test usable on macOS/Windows. - Normalize
pack_error(cudaError_t)to preserve the other fields like the TAPP and cuTENSOR overloads do. - Fix
TAPP_attr_getsignature/indirection.
Bindings to cutensor. This adds a handle to create_tensor_info. Setters for the tensor_info are not implemented because of complications. This code also includes a version of the test that loads implementations dynamically and a version of the demo that does the same. It also includes a cutensor-specific demo.
I also removed some deprecated code on this branch.
The code that is run on CUDA doesn't get automatically tested because standard GitHub runners only use CPUs.
The code uses an attribute to allow the use of on-device memory or not.