Skip to content

Cutensor bindings#38

Open
SpaceyLake wants to merge 202 commits into
mainfrom
cutensor_bindings
Open

Cutensor bindings#38
SpaceyLake wants to merge 202 commits into
mainfrom
cutensor_bindings

Conversation

@SpaceyLake
Copy link
Copy Markdown
Collaborator

Bindings to cutensor. This adds a handle to create_tensor_info. Setters for the tensor_info are not implemented because of complications. This code also includes a version of the test that loads implementations dynamically and a version of the demo that does the same. It also includes a cutensor-specific demo.
I also removed some deprecated code on this branch.
The code that is run on CUDA doesn't get automatically tested because standard GitHub runners only use CPUs.
The code uses an attribute to allow the use of on-device memory or not.

Copy link
Copy Markdown
Contributor

@evaleev evaleev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepared by claude, edited by me

PR #38: Cutensor Bindings — Review

+8,974 / -910 across 41 files | CI: All checks pass

Summary

This PR adds cuTENSOR bindings for the TAPP API, refactors the CMake build system (pushing test/example targets into subdirectories), adds a TAPP_handle parameter to TAPP_create_tensor_info (API-breaking change), renames TAPP_REFERENCE_ENABLE_TBLIS to TAPP_REFERENCE_USE_TBLIS, adds dynamic-loading test infrastructure, and removes some deprecated files.


High-level concerns

  1. API-breaking change to TAPP_create_tensor_info — Adding TAPP_handle as a new parameter changes the public API. The reference implementation (reference_implementation/src/tensor.c) accepts the parameter but ignores it. This is the right design (the handle is needed by cuTENSOR but not by the reference impl), but consider whether this needs a version bump or changelog entry.

  2. Negative strides and negative_str test disabled in demo.c — The negative stride test is commented out in demo.c with a cuTENSOR-specific comment, but demo.c links against tapp::reference, not tapp::cutensor. Disabling it here penalizes the reference implementation's test coverage for a cuTENSOR limitation. Consider keeping it enabled for the reference demo and only disabling it in cuTENSOR-specific tests.

  3. Massive code duplication: test_dynamic.cpp (4,079 lines) — This is essentially a copy-paste of test.cpp with all calls going through a struct imp function-pointer table. Same for demo_dynamic.c vs demo.c. This creates a significant maintenance burden — any future test change must be made in both places. Consider using macros or templates to share the test logic.


Specific issues

Bugs / correctness

  1. product.cpp:952 — Wrong handle cast:

    plan_struct->handle = ((cutensorHandle_t*) handle);
    struct handle* handle_struct = (struct handle*) plan_struct->handle;

    handle is a TAPP_handle (i.e., intptr_t) that actually points to a struct handle. First it's cast to cutensorHandle_t* and stored, then the stored cutensorHandle_t* is cast to struct handle*. This only works by accident because the cutensorHandle_t* libhandle is the first member of struct handle. This is fragile and incorrect — plan_struct->handle should be typed as struct handle* or at minimum the first cast should be (struct handle*).

  2. attributes.cpp:575memcpy to/from intptr_t as pointer:

    memcpy((void*)handle_struct->attributes[0], value, sizeof(bool));

    attributes[0] is an intptr_t holding a bool*. The cast (void*)handle_struct->attributes[0] is correct, but the design is fragile — the intptr_t* array is a poor man's type-erased attribute store. Consider at minimum documenting the ownership model.

  3. error.cpp:754 — Extracting TAPP field then switching on error instead of tappVal:

    uint64_t tappVal = code & TAPP_FIELD_MASK;
    if (tappVal != 0) {
        switch (error)  // <-- should be switch(tappVal)

    If both TAPP and cuTENSOR errors are packed, error will include the cuTENSOR bits and never match cases 1-15.

  4. cutensor_demo.cpp:2678 — Wrong copy size in conjugate() test:

    cudaMemcpy((void*)D, (void*)D_d, 9 * sizeof(float), cudaMemcpyDeviceToHost);

    D is std::complex<float>[9], so this should be 9 * sizeof(std::complex<float>). Only half the data is copied back.

  5. error.cpp:853 — CUDA error packing clears TAPP+cuTENSOR fields:

    uint64_t cleared_val = val & (~LOW_FIELDS_MASK);
    return static_cast<int>(cleared_val | new_cuda_val);

    This discards any previously packed TAPP/cuTENSOR errors. The other pack_error overloads preserve other fields, but this one doesn't. Inconsistent behavior.

Memory safety

  1. execute_product in product.cpp — Early returns leak GPU memory. Every if (cerr != cudaSuccess) return pack_error(0, cerr) between cudaMallocAsync calls will leak all previously allocated device buffers (A_d, B_d, C_d, D_d, E_d, contraction_work). Consider using RAII or a goto-cleanup pattern.

  2. create_tensor_product in product.cpp — Early returns leak plan_struct and partial state. If any cuTENSOR call fails after new product_plan, the plan_struct and its dynamically allocated members are leaked.

  3. execute_productperm_scalar_ptr uses malloc but is never freed on error path (line ~1216 returns before free(perm_scalar_ptr) if cutensorPermute fails).

Style / quality

  1. Missing newlines at end of file in essentially all new headers and source files under cutensor_bindings/. Most tools and compilers warn about this.

  2. Unreachable break statements after return in switch cases throughout datatype.cpp and product.cpp (translate_operator, translate_datatype, etc.). Harmless but noisy.

  3. VLA usage (int64_t sorted_strides_D[TAPP_get_nmodes(D)] in product.cpp, int64_t section_coordinates_D[...] in execute_product). VLAs are not standard C++ and are a compiler extension. Consider using std::vector or new[].

  4. Magic number 15 for "invalid key" in attributes.cpp. This should use a named constant or the error enum.

  5. cmake_minimum_required(VERSION 3.17) inside CMakeLists.txt at line 198 — cmake_minimum_required should only be called once at the top of the project. This is a policy change mid-file. Use if(CMAKE_VERSION VERSION_LESS 3.17) / message(FATAL_ERROR ...) instead, or bump the top-level requirement.

  6. cutensor_bindings/CMakeLists.txt:338-341target_link_libraries(cutensor::cutensor INTERFACE CUDA::cudart) modifies an IMPORTED target's link interface. This is a surprising side effect — it means anyone finding cuTENSOR through this build gets CUDA::cudart added transitively, even if they didn't want it. Consider linking CUDA::cudart to tapp-cutensor directly instead (which is already done on line 370).

CMake

  1. examples/CMakeLists.txt:1565tapp-reference-exercise_tucker_answers links against tapp-reference (old target name) instead of tapp::reference. Inconsistent with the rest of the migration.

  2. test/CMakeLists.txt — The dynamic test/demo targets are only built when TAPP_CUTENSOR is enabled, but they dlopen shared libraries at runtime and don't actually depend on cuTENSOR at compile time. Could they be useful without cuTENSOR too (e.g., testing two reference implementations)?

Test infrastructure

  1. test_dynamic.hpathA and pathB are hardcoded as "./libtapp-reference.so" and "./libtapp-cutensor.so". This won't work on macOS (.dylib) or if the build output is in a different directory. These should be configurable, e.g., via CMake configure_file or command-line arguments.

  2. test_dynamic.cpp line 7257 — Syntax error in commented-out code: str(test_mixed_strides(impA, impB) has mismatched parens.


Minor / positive notes

  • The CMake refactoring (pushing test/example targets into subdirectories) is a good cleanup
  • TAPP_REFERENCE_ENABLE_TBLIS -> TAPP_REFERENCE_USE_TBLIS rename is more descriptive
  • The printf("%s", message_buff) fix (from printf(message_buff)) is a correct format-string vulnerability fix
  • reduce_isolated_indices rename from contract_unique_idx is clearer
  • The conditional cleanup fix in run_tblis_mult (checking tblis_A_reduced != &tblis_A before freeing) fixes a real bug
  • The rand() change from -max() to min() avoids UB with signed overflow

@SpaceyLake
Copy link
Copy Markdown
Collaborator Author

I am going through and fixing these things. But I don't really understand 1., 3. I am also unsure about. I don't really know what's a good idea to fix that. Once I tried putting the functions that are the same into a helper file, but the template ones need to know how they are used to know which types to compile for. I could specify which types to template for, but it didn't seem like a good idea. I could do it in extreme and have #ifdef and just use one file. Also, I am not used to working with bit @janbrandejs do you understand 8.?

@evaleev
Copy link
Copy Markdown
Contributor

evaleev commented Feb 25, 2026

I am going through and fixing these things. But I don't really understand 1.

we don't have either documentation or changelog, so this is probably not applicable, but if we did the API change would need to be noted somewhere.

  1. I am also unsure about. I don't really know what's a good idea to fix that. Once I tried putting the functions that are

This suggests to deal with duplication of tests in {X,X_dynamic}.cpp. We should be able to have single source file for each X that generates 2 executables, one of each has TAPP_DYNAMIC_LAUNCH macro defined ... smth like that.

@janbrandejs
Copy link
Copy Markdown
Contributor

8.: I thijk here the AI says that we drop the information about subsequent errors after the first one occurs, instead of accumulatig them (if there is both error for cuda and for cutensor for instance). If this is the case, then it's fine, we agreed on working group meeting that only the first error foud will be reported.

Copy link
Copy Markdown
Contributor

@evaleev evaleev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated review after the 7 new commits (85c59d3..b06a4aa). prepared by claude, edited by me

✅ Fixed by the new commits

Commit Prior finding
85c59d3 Wrong handle cast in product.cpp — now plan_struct->handle = handle;
132d365 switch(error)switch(tappVal) in error.cpp
349749d Workspace std::max(..., 128 MiB) minimum now applied
5e60aef tapp-referencetapp::reference in examples/CMakeLists.txt
7335ccd 9*sizeof(float)9*sizeof(std::complex<float>) in cutensor_demo.cpp::conjugate()
9b6952e Massive duplication: test_dynamic.{cpp,h} deleted (-5352 lines); single file gated by TAPP_DYNAMIC_LAUNCH
b06a4aa Same for demo_dynamic.c (-1382 lines)

The duplication cleanup is the big one — ~6,700 lines of copy-pasted code collapsed. The result is still noisy (30+ #ifdef TAPP_DYNAMIC_LAUNCH blocks in test.h), but it's correct and maintainable now.

⚠️ Previous findings still unaddressed

  • pack_error(int, cudaError_t) overload in error.cpp:847-854 still clears both TAPP and cuTENSOR fields (val & ~LOW_FIELDS_MASK) — asymmetric with the other two overloads. A cudaError after a packed TAPP/cuTENSOR error silently loses the earlier context.
  • GPU/host memory leaks on every early-return error path in TAPP_create_tensor_product and TAPP_execute_productcontraction_desc, permutation_desc, plan_pref, plans, and all cudaMallocAsync buffers (A_d, B_d, C_d, D_d, E_d, contraction_work) leak on failure. Not addressed.
  • cmake_minimum_required(VERSION 3.17) still appears mid-file in top CMakeLists.txt.
  • cutensor_bindings/CMakeLists.txt still calls target_link_libraries(cutensor::cutensor INTERFACE CUDA::cudart) on the IMPORTED target — interface pollution; should be moved to tapp-cutensor.
  • test/test.h:20-21 still hardcodes ./reference_implementation/libtapp-reference.so and ./cutensor_bindings/libtapp-cutensor.so. Will break on macOS (.dylib) and Windows (.dll). Pass via argv or configure_file with $<TARGET_FILE:...>.
  • Missing newlines at EOF, unreachable break after return, VLAs in C++, magic return 15; — all still present.
  • negative_str test still commented out in demo.c even though demo.c links tapp::reference, not cuTENSOR.

🆕 New findings

  1. Major — no stream synchronization anywhere. grep cudaStreamSynchronize cutensor_bindings/ returns nothing. TAPP_execute_product issues cudaMemcpyAsync(... DeviceToHost ...) for D (product.cpp:298) and cudaFreeAsync for the scratch buffers, then returns. Caller must sync the executor's stream themselves before reading D. If that's the intended contract it needs to be documented; otherwise cudaMemcpy (sync) or an explicit cudaStreamSynchronize at the function tail is needed. Tests that pass today are likely passing by luck (small inputs finishing before host read).

  2. Major — TAPP_attr_get writes into the wrong indirection level (attributes.cpp:23):

    TAPP_error TAPP_attr_get(TAPP_attr attr, TAPP_key key, void** value)
    {
        ...
        memcpy(value, (void*)handle_struct->attributes[0], sizeof(bool));

    The parameter is void**, but the code writes 1 byte into the pointer slot itself, not into *value. Either the signature should be void* (matching TAPP_attr_set) or the body should memcpy(*value, ...).

  3. Major — return values ignored on cuTENSOR cleanup/estimate paths. cutensorEstimateWorkspaceSize (product.cpp near 234) and cutensorDestroyPlanPreference results are not checked; a failure feeds garbage into cutensorCreatePlan or silently leaks.

  4. Major — plan and handle not thread-safe. A single product_plan carries cutensorPlan_t + cutensorHandle_t references and is reused across TAPP_execute_product calls. cuTENSOR plans are not safe for concurrent use; no locking. Worth documenting at minimum.

  5. Minor — assertions in production paths. assert(uintptr_t(contraction_work) % 128 == 0) (product.cpp:240) and similar in the host-memory path will be compiled out under NDEBUG. If the alignment matters, use a real check.

Verdict

Real progress — the latest 7 commits resolve most of the cited bugs (handle cast, error switch, memcpy size, link name, and the massive duplication). The remaining items aren't deal-breakers individually, but the GPU memory-leak-on-error-path and the missing stream synchronization are correctness issues that should be addressed before merge. Hardcoded .so paths still block portability to non-Linux runners.

Suggested before merging:

  1. Add cudaStreamSynchronize(*(cudaStream_t*)exec); at the tail of TAPP_execute_product (or document async-result semantics).
  2. Wrap the failure paths in create_tensor_product / execute_product with a goto-cleanup or RAII so descriptors/plans/device buffers are freed on error.
  3. Replace .so literals in test/test.h with values passed via CMake ($<TARGET_FILE:tapp-reference> etc.) — also makes the test usable on macOS/Windows.
  4. Normalize pack_error(cudaError_t) to preserve the other fields like the TAPP and cuTENSOR overloads do.
  5. Fix TAPP_attr_get signature/indirection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants