Cutensor bindings by SpaceyLake · Pull Request #38 · TAPPorg/reference-implementation

SpaceyLake · 2026-02-09T16:10:37Z

Bindings to cutensor. This adds a handle to create_tensor_info. Setters for the tensor_info are not implemented because of complications. This code also includes a version of the test that loads implementations dynamically and a version of the demo that does the same. It also includes a cutensor-specific demo.
I also removed some deprecated code on this branch.
The code that is run on CUDA doesn't get automatically tested because standard GitHub runners only use CPUs.
The code uses an attribute to allow the use of on-device memory or not.

…t memory

… api

… that doesn't work for TBLIS right now

…ementation into cutensor_bindings

… with TBLIS

evaleev

prepared by claude, edited by me

PR #38: Cutensor Bindings — Review

+8,974 / -910 across 41 files | CI: All checks pass

Summary

This PR adds cuTENSOR bindings for the TAPP API, refactors the CMake build system (pushing test/example targets into subdirectories), adds a TAPP_handle parameter to TAPP_create_tensor_info (API-breaking change), renames TAPP_REFERENCE_ENABLE_TBLIS to TAPP_REFERENCE_USE_TBLIS, adds dynamic-loading test infrastructure, and removes some deprecated files.

High-level concerns

API-breaking change to TAPP_create_tensor_info — Adding TAPP_handle as a new parameter changes the public API. The reference implementation (reference_implementation/src/tensor.c) accepts the parameter but ignores it. This is the right design (the handle is needed by cuTENSOR but not by the reference impl), but consider whether this needs a version bump or changelog entry.
Negative strides and negative_str test disabled in demo.c — The negative stride test is commented out in demo.c with a cuTENSOR-specific comment, but demo.c links against tapp::reference, not tapp::cutensor. Disabling it here penalizes the reference implementation's test coverage for a cuTENSOR limitation. Consider keeping it enabled for the reference demo and only disabling it in cuTENSOR-specific tests.
Massive code duplication: test_dynamic.cpp (4,079 lines) — This is essentially a copy-paste of test.cpp with all calls going through a struct imp function-pointer table. Same for demo_dynamic.c vs demo.c. This creates a significant maintenance burden — any future test change must be made in both places. Consider using macros or templates to share the test logic.

Specific issues

Bugs / correctness

product.cpp:952 — Wrong handle cast:
```
plan_struct->handle = ((cutensorHandle_t*) handle);
struct handle* handle_struct = (struct handle*) plan_struct->handle;
```
handle is a TAPP_handle (i.e., intptr_t) that actually points to a struct handle. First it's cast to cutensorHandle_t* and stored, then the stored cutensorHandle_t* is cast to struct handle*. This only works by accident because the cutensorHandle_t* libhandle is the first member of struct handle. This is fragile and incorrect — plan_struct->handle should be typed as struct handle* or at minimum the first cast should be (struct handle*).
attributes.cpp:575 — memcpy to/from intptr_t as pointer:
```
memcpy((void*)handle_struct->attributes[0], value, sizeof(bool));
```
attributes[0] is an intptr_t holding a bool*. The cast (void*)handle_struct->attributes[0] is correct, but the design is fragile — the intptr_t* array is a poor man's type-erased attribute store. Consider at minimum documenting the ownership model.
error.cpp:754 — Extracting TAPP field then switching on error instead of tappVal:
```
uint64_t tappVal = code & TAPP_FIELD_MASK;
if (tappVal != 0) {
    switch (error)  // <-- should be switch(tappVal)
```
If both TAPP and cuTENSOR errors are packed, error will include the cuTENSOR bits and never match cases 1-15.
cutensor_demo.cpp:2678 — Wrong copy size in conjugate() test:
```
cudaMemcpy((void*)D, (void*)D_d, 9 * sizeof(float), cudaMemcpyDeviceToHost);
```
D is std::complex<float>[9], so this should be 9 * sizeof(std::complex<float>). Only half the data is copied back.
error.cpp:853 — CUDA error packing clears TAPP+cuTENSOR fields:
```
uint64_t cleared_val = val & (~LOW_FIELDS_MASK);
return static_cast<int>(cleared_val | new_cuda_val);
```
This discards any previously packed TAPP/cuTENSOR errors. The other pack_error overloads preserve other fields, but this one doesn't. Inconsistent behavior.

Memory safety

execute_product in product.cpp — Early returns leak GPU memory. Every if (cerr != cudaSuccess) return pack_error(0, cerr) between cudaMallocAsync calls will leak all previously allocated device buffers (A_d, B_d, C_d, D_d, E_d, contraction_work). Consider using RAII or a goto-cleanup pattern.
create_tensor_product in product.cpp — Early returns leak plan_struct and partial state. If any cuTENSOR call fails after new product_plan, the plan_struct and its dynamically allocated members are leaked.
execute_product — perm_scalar_ptr uses malloc but is never freed on error path (line ~1216 returns before free(perm_scalar_ptr) if cutensorPermute fails).

Style / quality

Missing newlines at end of file in essentially all new headers and source files under cutensor_bindings/. Most tools and compilers warn about this.
Unreachable break statements after return in switch cases throughout datatype.cpp and product.cpp (translate_operator, translate_datatype, etc.). Harmless but noisy.
VLA usage (int64_t sorted_strides_D[TAPP_get_nmodes(D)] in product.cpp, int64_t section_coordinates_D[...] in execute_product). VLAs are not standard C++ and are a compiler extension. Consider using std::vector or new[].
Magic number 15 for "invalid key" in attributes.cpp. This should use a named constant or the error enum.
cmake_minimum_required(VERSION 3.17) inside CMakeLists.txt at line 198 — cmake_minimum_required should only be called once at the top of the project. This is a policy change mid-file. Use if(CMAKE_VERSION VERSION_LESS 3.17) / message(FATAL_ERROR ...) instead, or bump the top-level requirement.
cutensor_bindings/CMakeLists.txt:338-341 — target_link_libraries(cutensor::cutensor INTERFACE CUDA::cudart) modifies an IMPORTED target's link interface. This is a surprising side effect — it means anyone finding cuTENSOR through this build gets CUDA::cudart added transitively, even if they didn't want it. Consider linking CUDA::cudart to tapp-cutensor directly instead (which is already done on line 370).

CMake

examples/CMakeLists.txt:1565 — tapp-reference-exercise_tucker_answers links against tapp-reference (old target name) instead of tapp::reference. Inconsistent with the rest of the migration.
test/CMakeLists.txt — The dynamic test/demo targets are only built when TAPP_CUTENSOR is enabled, but they dlopen shared libraries at runtime and don't actually depend on cuTENSOR at compile time. Could they be useful without cuTENSOR too (e.g., testing two reference implementations)?

Test infrastructure

test_dynamic.h — pathA and pathB are hardcoded as "./libtapp-reference.so" and "./libtapp-cutensor.so". This won't work on macOS (.dylib) or if the build output is in a different directory. These should be configurable, e.g., via CMake configure_file or command-line arguments.
test_dynamic.cpp line 7257 — Syntax error in commented-out code: str(test_mixed_strides(impA, impB) has mismatched parens.

Minor / positive notes

The CMake refactoring (pushing test/example targets into subdirectories) is a good cleanup
TAPP_REFERENCE_ENABLE_TBLIS -> TAPP_REFERENCE_USE_TBLIS rename is more descriptive
The printf("%s", message_buff) fix (from printf(message_buff)) is a correct format-string vulnerability fix
reduce_isolated_indices rename from contract_unique_idx is clearer
The conditional cleanup fix in run_tblis_mult (checking tblis_A_reduced != &tblis_A before freeing) fixes a real bug
The rand() change from -max() to min() avoids UB with signed overflow

SpaceyLake · 2026-02-25T16:34:23Z

I am going through and fixing these things. But I don't really understand 1., 3. I am also unsure about. I don't really know what's a good idea to fix that. Once I tried putting the functions that are the same into a helper file, but the template ones need to know how they are used to know which types to compile for. I could specify which types to template for, but it didn't seem like a good idea. I could do it in extreme and have #ifdef and just use one file. Also, I am not used to working with bit @janbrandejs do you understand 8.?

evaleev · 2026-02-25T18:20:11Z

I am going through and fixing these things. But I don't really understand 1.

we don't have either documentation or changelog, so this is probably not applicable, but if we did the API change would need to be noted somewhere.

I am also unsure about. I don't really know what's a good idea to fix that. Once I tried putting the functions that are

This suggests to deal with duplication of tests in {X,X_dynamic}.cpp. We should be able to have single source file for each X that generates 2 executables, one of each has TAPP_DYNAMIC_LAUNCH macro defined ... smth like that.

janbrandejs · 2026-02-26T11:00:08Z

8.: I thijk here the AI says that we drop the information about subsequent errors after the first one occurs, instead of accumulatig them (if there is both error for cuda and for cutensor for instance). If this is the case, then it's fine, we agreed on working group meeting that only the first error foud will be reported.

…o one file separated by compile definition

evaleev

Updated review after the 7 new commits (85c59d3..b06a4aa). prepared by claude, edited by me

✅ Fixed by the new commits

Commit	Prior finding
`85c59d3`	Wrong handle cast in `product.cpp` — now `plan_struct->handle = handle;`
`132d365`	`switch(error)` → `switch(tappVal)` in `error.cpp`
`349749d`	Workspace `std::max(..., 128 MiB)` minimum now applied
`5e60aef`	`tapp-reference` → `tapp::reference` in `examples/CMakeLists.txt`
`7335ccd`	`9sizeof(float)` → `9sizeof(std::complex<float>)` in `cutensor_demo.cpp::conjugate()`
`9b6952e`	Massive duplication: `test_dynamic.{cpp,h}` deleted (-5352 lines); single file gated by `TAPP_DYNAMIC_LAUNCH`
`b06a4aa`	Same for `demo_dynamic.c` (-1382 lines)

The duplication cleanup is the big one — ~6,700 lines of copy-pasted code collapsed. The result is still noisy (30+ #ifdef TAPP_DYNAMIC_LAUNCH blocks in test.h), but it's correct and maintainable now.

⚠️ Previous findings still unaddressed

pack_error(int, cudaError_t) overload in error.cpp:847-854 still clears both TAPP and cuTENSOR fields (val & ~LOW_FIELDS_MASK) — asymmetric with the other two overloads. A cudaError after a packed TAPP/cuTENSOR error silently loses the earlier context.
GPU/host memory leaks on every early-return error path in TAPP_create_tensor_product and TAPP_execute_product — contraction_desc, permutation_desc, plan_pref, plans, and all cudaMallocAsync buffers (A_d, B_d, C_d, D_d, E_d, contraction_work) leak on failure. Not addressed.
cmake_minimum_required(VERSION 3.17) still appears mid-file in top CMakeLists.txt.
cutensor_bindings/CMakeLists.txt still calls target_link_libraries(cutensor::cutensor INTERFACE CUDA::cudart) on the IMPORTED target — interface pollution; should be moved to tapp-cutensor.
test/test.h:20-21 still hardcodes ./reference_implementation/libtapp-reference.so and ./cutensor_bindings/libtapp-cutensor.so. Will break on macOS (.dylib) and Windows (.dll). Pass via argv or configure_file with $<TARGET_FILE:...>.
Missing newlines at EOF, unreachable break after return, VLAs in C++, magic return 15; — all still present.
negative_str test still commented out in demo.c even though demo.c links tapp::reference, not cuTENSOR.

🆕 New findings

Major — no stream synchronization anywhere. grep cudaStreamSynchronize cutensor_bindings/ returns nothing. TAPP_execute_product issues cudaMemcpyAsync(... DeviceToHost ...) for D (product.cpp:298) and cudaFreeAsync for the scratch buffers, then returns. Caller must sync the executor's stream themselves before reading D. If that's the intended contract it needs to be documented; otherwise cudaMemcpy (sync) or an explicit cudaStreamSynchronize at the function tail is needed. Tests that pass today are likely passing by luck (small inputs finishing before host read).
Major — TAPP_attr_get writes into the wrong indirection level (attributes.cpp:23):
```
TAPP_error TAPP_attr_get(TAPP_attr attr, TAPP_key key, void** value)
{
    ...
    memcpy(value, (void*)handle_struct->attributes[0], sizeof(bool));
```
The parameter is void**, but the code writes 1 byte into the pointer slot itself, not into *value. Either the signature should be void* (matching TAPP_attr_set) or the body should memcpy(*value, ...).
Major — return values ignored on cuTENSOR cleanup/estimate paths. cutensorEstimateWorkspaceSize (product.cpp near 234) and cutensorDestroyPlanPreference results are not checked; a failure feeds garbage into cutensorCreatePlan or silently leaks.
Major — plan and handle not thread-safe. A single product_plan carries cutensorPlan_t + cutensorHandle_t references and is reused across TAPP_execute_product calls. cuTENSOR plans are not safe for concurrent use; no locking. Worth documenting at minimum.
Minor — assertions in production paths. assert(uintptr_t(contraction_work) % 128 == 0) (product.cpp:240) and similar in the host-memory path will be compiled out under NDEBUG. If the alignment matters, use a real check.

Verdict

Real progress — the latest 7 commits resolve most of the cited bugs (handle cast, error switch, memcpy size, link name, and the massive duplication). The remaining items aren't deal-breakers individually, but the GPU memory-leak-on-error-path and the missing stream synchronization are correctness issues that should be addressed before merge. Hardcoded .so paths still block portability to non-Linux runners.

Suggested before merging:

Add cudaStreamSynchronize(*(cudaStream_t*)exec); at the tail of TAPP_execute_product (or document async-result semantics).
Wrap the failure paths in create_tensor_product / execute_product with a goto-cleanup or RAII so descriptors/plans/device buffers are freed on error.
Replace .so literals in test/test.h with values passed via CMake ($<TARGET_FILE:tapp-reference> etc.) — also makes the test usable on macOS/Windows.
Normalize pack_error(cudaError_t) to preserve the other fields like the TAPP and cuTENSOR overloads do.
Fix TAPP_attr_get signature/indirection.

SpaceyLake and others added 30 commits February 2, 2026 16:17

First stage of cutensor wrapper, only works with basic strides

b3da13a

Added the use of handle

362962c

Updated bindings allowing for non-contigous output tensor.

f2ed80f

Modified to work with current CuTensor bindings

933fba4

Added functionality for elemental operation on D

a2d46d3

Fixed function name

00e90e5

Fixed precision type

439d5cf

Small sectioning optimization

e8f86f0

Fixed scalar for permute D

412f1fe

Fixed sectioning

f584e7d

Created a demo version that loads libraries dynamically

2b2ecec

Created a test version that loads libraries dynamically

29230cb

Simple exapmle of using CuTensor

aa69f9a

Made cuda stream a part of TAPP_executor

f407841

Algorithm correction

4ca108b

Added cutensor handle to TAPP_handle

a917783

Corrected copying of memory

d80d06f

cutensor error handling

f8e70fb

can compile with cmake

87cdea5

Fixed typo

3353f35

Added the handle to create tensor info

31b44ba

Added handle when creating tensor info in old files

0d67763

Uncommented code

7dbaf36

Made test use tblis instead of cutensor

81e8234

Added the use of attributes to decide if input is on host or device

c6d6737

Added demo for cutensor with on device input

9f361ad

Dynamic demo running on cutensor with attribute to telling use of hos…

2a466f3

…t memory

Updated error handling

7f061fa

Updated function calls with create executor and handle as part of the…

d701639

… api

Added define statement

f6838a0

SpaceyLake and others added 18 commits February 24, 2026 14:52

Workaround, only doing reductions when necessary, avoiding some cases…

ca12525

… that doesn't work for TBLIS right now

Put alpha and beta to more appropriate values

b64966a

[cutensor] slim down cmake harness + no need for CUDA

edf664a

[cutensor] cleanup CMake yet more, missing/misnamed headers

2ef1368

[cmake] push down tests/examples CMake code into the respective subdirs

922c7b2

[cutensor] tapp-reference-cutensor -> tapp-cutensor

8d589d5

Fixed alpha, beta range for dynamic test

589be46

Moved includes to header files

c60462a

Added missed semicolon

5a520d9

include cutensor.h instead of cutensor/types.h to inject cuda_runtime.h

c8a2d36

Corrected paths for the dynamically loaded libs

4917a73

Removed accidental character

03a03fd

Removed cuda from languages

923e2b1

Merge branch 'cutensor_bindings' of github.com:TAPPorg/reference-impl…

447e382

…ementation into cutensor_bindings

Removed old, unused file

b2ee699

Changed random seed because seed 0 generates cases that doesn't agree…

e2f1262

… with TBLIS

Fixed directories when testing

5eded62

Further directory fix when for tests

53089b9

evaleev reviewed Feb 24, 2026

View reviewed changes

SpaceyLake added 7 commits February 27, 2026 17:15

Fixed handle in cutensor plan struct

85c59d3

Fixed value used for giving error description

132d365

Added the recommended minimum workspace

349749d

Fixed linking name

5e60aef

Fixed size of copied memory

7335ccd

Combined tests with dynamically loaded and statically loaded libs int…

9b6952e

…o one file separated by compile definition

Combined demos with dynamically loaded and statically loaded libs int…

b06a4aa

…o one file separated by compile definition

evaleev reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cutensor bindings#38

Cutensor bindings#38
SpaceyLake wants to merge 202 commits into
mainfrom
cutensor_bindings

SpaceyLake commented Feb 9, 2026

Uh oh!

evaleev left a comment •

edited

Loading

Uh oh!

SpaceyLake commented Feb 25, 2026

Uh oh!

evaleev commented Feb 25, 2026

Uh oh!

janbrandejs commented Feb 26, 2026

Uh oh!

evaleev left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

SpaceyLake commented Feb 9, 2026

Uh oh!

evaleev left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

PR #38: Cutensor Bindings — Review

Summary

High-level concerns

Specific issues

Bugs / correctness

Memory safety

Style / quality

CMake

Test infrastructure

Minor / positive notes

Uh oh!

SpaceyLake commented Feb 25, 2026

Uh oh!

evaleev commented Feb 25, 2026

Uh oh!

janbrandejs commented Feb 26, 2026

Uh oh!

evaleev left a comment

Choose a reason for hiding this comment

✅ Fixed by the new commits

⚠️ Previous findings still unaddressed

🆕 New findings

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

evaleev left a comment •

edited

Loading