Skip to content

C++ wilcoxon#587

Open
Intron7 wants to merge 17 commits intomainfrom
c++-wilcoxon
Open

C++ wilcoxon#587
Intron7 wants to merge 17 commits intomainfrom
c++-wilcoxon

Conversation

@Intron7
Copy link
Copy Markdown
Member

@Intron7 Intron7 commented Feb 25, 2026

No description provided.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 25, 2026

Codecov Report

❌ Patch coverage is 81.25000% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.63%. Comparing base (046bdb3) to head (88bcdd5).

Files with missing lines Patch % Lines
...s_singlecell/tools/_rank_genes_groups/_wilcoxon.py 83.41% 33 Missing ⚠️
...pids_singlecell/tools/_rank_genes_groups/_utils.py 0.00% 4 Missing ⚠️
src/rapids_singlecell/_cuda/__init__.py 60.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #587      +/-   ##
==========================================
- Coverage   88.04%   87.63%   -0.41%     
==========================================
  Files          96       95       -1     
  Lines        7032     7071      +39     
==========================================
+ Hits         6191     6197       +6     
- Misses        841      874      +33     
Files with missing lines Coverage Δ
...ds_singlecell/tools/_rank_genes_groups/__init__.py 82.45% <ø> (ø)
...apids_singlecell/tools/_rank_genes_groups/_core.py 72.76% <ø> (-18.73%) ⬇️
src/rapids_singlecell/_cuda/__init__.py 66.66% <60.00%> (-3.34%) ⬇️
...pids_singlecell/tools/_rank_genes_groups/_utils.py 67.16% <0.00%> (-18.76%) ⬇️
...s_singlecell/tools/_rank_genes_groups/_wilcoxon.py 85.26% <83.41%> (-12.36%) ⬇️

... and 1 file with indirect coverage changes

@Intron7 Intron7 marked this pull request as draft February 26, 2026 10:31
@Intron7 Intron7 closed this Apr 9, 2026
@Intron7 Intron7 reopened this Apr 9, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 14, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 14, 2026

📝 Walkthrough

Summary by CodeRabbit

  • Dependencies

    • Optional RAPIDS CUDA extras now include librmm for cu12/cu13 to support new GPU build/runtime paths.
  • New Features

    • Reworked GPU Wilcoxon with streaming OVO/OVR pipelines, host/device streaming, tiered dispatch, fused kernels, and optional tie-correction/statistics.
  • Refactor

    • Replaced the previous GPU backend with a streaming/vectorized pipeline; chunk_size removed from the Wilcoxon API (still used for wilcoxon_binned).
  • Tests

    • Low-level kernel tests removed; expanded high-level behavioral tests across formats, options, and edge cases.

Walkthrough

Replaces the legacy Wilcoxon CUDA backend with two new streaming nanobind modules (OVR and OVO), adds fused rank-sum/tie-correction kernels and shared CUDA utilities, updates build/packaging to locate/preload RMM, refactors Python Wilcoxon pipeline to call streaming kernels, and updates tests accordingly.

Changes

Cohort / File(s) Summary
Build & CI
\.github/workflows/publish.yml, pyproject.toml, CMakeLists.txt
CI wheel build now locates/preloads librmm and rapids_logger, symlinks lib64 to /usr/local/lib, adjusts LD/Auditwheel steps, adds librmm-cu12/cu13>=25.10 optional deps, and extends CMake search hints to find rmm before find_package(rmm CONFIG REQUIRED).
CUDA module exports & init
src/rapids_singlecell/_cuda/__init__.py
Pre-loads librmm on import (swallowing ImportError/OSError) and replaces _wilcoxon_cuda export with _wilcoxon_ovo_cuda and _wilcoxon_ovr_cuda in __all__.
Nanobind host types
src/rapids_singlecell/_cuda/nb_types.h
Adds host_array<T> and host_array_2d<T> nanobind (NumPy-device) type aliases (1D and 2D).
Common CUDA utilities
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_common.cuh
New common header with CUDA exec constants, RmmPool RAII allocator, warp/round helpers, upload helper, and a templated CSR→dense extraction kernel.
Fused/generic kernels (OVR)
src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh, src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu
Replaces two-pass tie/average-rank kernels with fused rank_sums_from_sorted_kernel and sparse-aware rank_sums_sparse_ovr_kernel; adds wilcoxon_ovr nanobind module implementing streaming OVR pipelines (dense/CSR/CSC), sub-batching, CUB segmented sort, multi-stream RMM-backed buffers, and host-streaming paths.
New OVO kernels & module
src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon_ovo.cuh, src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu
Adds OVO-specific device helpers (block reductions, parallel tie-correction), presorted and fused shared-memory kernels, batched presorted ranking, and wilcoxon_ovo nanobind module with device and host streaming entrypoints.
Removed legacy CUDA file
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon.cu
Deletes the old _wilcoxon_cuda implementation and its nanobind bindings (tie_correction, average_rank).
Python rank-gen refactor
src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py, src/rapids_singlecell/tools/_rank_genes_groups/_utils.py, src/rapids_singlecell/tools/_rank_genes_groups/_core.py, src/rapids_singlecell/tools/_rank_genes_groups/__init__.py
Refactors Python Wilcoxon paths to use streaming kernels (OVR/OVO), adds GPU data-prep and segmented sort helpers, removes chunk_size for wilcoxon, introduces STREAMING_SUB_BATCH = 64, and stops forcing float64 casts from sparse→dense conversions.
Tests
tests/test_rank_genes_groups_wilcoxon.py, tests/test_rank_genes_groups_wilcoxon_binned.py
Removes kernel-level unit tests tied to deleted entrypoints; adds/expands higher-level behavioral tests across SciPy/CuPy/CSR/CSC formats, continuity/tie-correction coverage, mask/key handling, and new wilcoxon_binned tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • update rank_genes_groups #570: Refactors Wilcoxon rank_genes_groups implementation and group handling (group_codes/group_sizes), directly overlapping Python-side Wilcoxon changes.
  • Check for kernel errors #619: Prior Wilcoxon CUDA edits and CUDA utilities that overlap with the new CUDA backend additions and header changes.
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning No pull request description was provided by the author, making it impossible to evaluate relatedness to the changeset. A description is needed to understand intent and context. Add a clear pull request description explaining the purpose, scope, and key changes of this C++ Wilcoxon refactoring.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'C++ wilcoxon' is concise and directly refers to a major component of the changeset (C++ Wilcoxon implementation), but lacks specificity about what aspect or change is being introduced.
Docstring Coverage ✅ Passed Docstring coverage is 91.94% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch c++-wilcoxon

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (4)
src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py (2)

44-56: Inconsistent int32 casting for indptr arrays.

indices is cast to int32 but indptr is not. The CUDA kernels expect int32 for all sparse array components. If the input has a large row count causing int64 indptr (possible with >2^31 rows, though rare), this could cause dtype mismatches at the kernel boundary.

For consistency and safety:

Proposed fix
         return cpsp.csc_matrix(
             (
                 cp.asarray(csc.data),
                 cp.asarray(csc.indices.astype(np.int32, copy=False)),
-                cp.asarray(csc.indptr),
+                cp.asarray(csc.indptr.astype(np.int32, copy=False)),
             ),
             shape=(n_rows, n_cols),
         )
     csr = X.tocsr() if X.format != "csr" else X
     return cpsp.csr_matrix(
         (
             cp.asarray(csr.data),
             cp.asarray(csr.indices.astype(np.int32, copy=False)),
-            cp.asarray(csr.indptr),
+            cp.asarray(csr.indptr.astype(np.int32, copy=False)),
         ),
         shape=(n_rows, n_cols),
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py` around lines 44
- 56, The indptr arrays are not being cast to int32 while indices are, causing a
potential dtype mismatch for CUDA kernels; update both sparse-construction sites
(the csc -> cpsp.csc_matrix block and the csr -> cpsp.csr_matrix block around
the variables csr, csc, indices, and indptr) to cast indptr to np.int32 before
moving to the GPU (i.e., use indptr.astype(np.int32, copy=False) wrapped with
cp.asarray just like indices) so all sparse-array components use int32
consistently for the kernels.

39-41: Redundant format check and ternary.

Line 40 re-checks sp.issparse(X) and X.format == "csc" after the outer condition already ensures X.format == "csc". Line 41's ternary always takes the first branch.

Simplified version
     if isinstance(X, sp.spmatrix | sp.sparray):
-        if sp.issparse(X) and X.format == "csc":
-            csc = X if X.format == "csc" else X.tocsc()
+        if X.format == "csc":
             return cpsp.csc_matrix(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py` around lines 39
- 41, The nested check and ternary around X.format are redundant; in the block
handling sparse inputs (check using sp.issparse and the existing isinstance
check for sp.spmatrix|sp.sparray), simplify the logic in _wilcoxon.py by
removing the repeated "X.format == 'csc'" test and assign csc using a single
conversion path (e.g., in the sp.issparse branch set csc = X.tocsc() which is
safe whether X is already csc or not, and for non-sparse fall back to
constructing or converting to a csc matrix), updating the code around the
variable X and the csc assignment accordingly.
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu (1)

706-712: Host memory registration is correct but consider error handling.

The cudaHostRegister/cudaHostUnregister pattern is correct for async transfers. The const_cast is necessary since the CUDA API lacks a const variant. Consider wrapping in RAII or checking return values for robustness in production.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu` around lines 706 - 712,
The cudaHostRegister calls for h_data, h_indices, h_rank_sums and h_tie_corr
currently ignore CUDA errors and rely on manual cudaHostUnregister elsewhere;
update this to add error handling and/or an RAII wrapper: check and handle the
return value of cudaHostRegister for each buffer (e.g., propagate or log the
cudaError_t from cudaHostRegister) and ensure paired cudaHostUnregister is
called in destructor/finalizer (create a small HostRegisterGuard class that
registers in its constructor and unregisters in its destructor and replace the
direct cudaHostRegister calls with instances of that guard for h_data,
h_indices, h_rank_sums and h_tie_corr).
src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh (1)

125-126: Clarify warp buffer reuse after shared memory arrays.

warp_buf = smem + n_groups overlaps with s_sum when compute_stats is true. While this works because s_sum is no longer needed after line 119, a comment explaining this intentional reuse would improve maintainability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh` around lines 125 -
126, The assignment warp_buf = smem + n_groups inside the compute_tie_corr
branch reuses the same shared-memory region previously used for s_sum when
compute_stats was true; add an inline comment next to that assignment explaining
that this overlap is intentional because s_sum is no longer needed by that point
(reference symbols: warp_buf, smem, n_groups, s_sum, compute_stats,
compute_tie_corr) so future readers won't treat it as a bug and will understand
the shared-memory reuse constraints.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/publish.yml:
- Line 136: CIBW_REPAIR_WHEEL_COMMAND currently runs only the auditwheel repair
step and omits the abi3audit check; update the environment variable value to run
abi3audit after auditwheel by appending the abi3audit invocation (e.g., && pipx
run abi3audit --strict --report {wheel}) so the full sequence includes both
auditwheel repair (keeping the existing --exclude flags and {dest_dir} {wheel}
placeholders) and the pipx run abi3audit --strict --report {wheel} step to
validate Limited API wheels.
- Around line 125-133: Current workflow derives .so paths from SITE=$(python -c
"import sysconfig;print(sysconfig.get_path('purelib'))") and hard-codes lib64,
which is brittle; instead import the packages and read their actual package dirs
(use librmm.__path__[0] and rapids_logger.__path__[0]) to locate .so files
dynamically and create symlinks from those resolved locations (update the steps
that write /tmp/.librmm_dir accordingly); also stop overriding
CIBW_REPAIR_WHEEL_COMMAND with only auditwheel or restore the abi3audit step so
the abi3audit validation defined in pyproject.toml is preserved (keep both
abi3audit and auditwheel in the repair command or remove the env override so
pyproject.toml controls it).

In `@CMakeLists.txt`:
- Around line 23-25: The explicit override appends ENV{LIBRMM_DIR}/.. which
moves one level up and breaks the environment-root based discovery; update the
CMakeLists logic that appends to _env_roots so it pushes "$ENV{LIBRMM_DIR}" (not
"$ENV{LIBRMM_DIR}/..") so the subsequent glob checks that look for
${_root}/lib/cmake/rmm and ${_root}/lib/python*/site-packages/librmm/... will
find the correct paths.

In `@src/rapids_singlecell/_cuda/nb_types.h`:
- Around line 45-50: The host_array_2d alias currently uses nb::ndarray<T,
nb::numpy> which allows any rank; change it to enforce 2D by specifying
nb::ndim<2> so only rank-2 NumPy arrays bind. Update the alias host_array_2d to
use nb::ndarray<T, nb::numpy, nb::ndim<2>> (keeping host_array unchanged) so
mismatched ranks fail at binding rather than later in matrix code.

In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh`:
- Around line 29-33: The kernel currently allocates shared memory arrays (smem
-> grp_sums, s_sum, s_sq, s_nnz) sized by n_groups which can overflow GPU shared
memory for large n_groups; change the launch/init logic so you detect the
required shared_bytes = (4 * n_groups + 32) * sizeof(double) and if it exceeds
the device maxSharedMemPerBlock (or a safe threshold) switch to a fallback path:
either use global-memory accumulators with atomicAdd on
grp_sums/s_sum/s_sq/s_nnz, or process groups in tiles that fit into shared
memory (loop over group tiles, load tile into smem, reduce, and write back) —
implement the detection and choose between the existing shared-memory code path
and the fallback (global atomics or tiled processing) based on that check,
referencing smem, grp_sums, s_sum, s_sq, s_nnz and n_groups to locate the code
to change.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu`:
- Around line 149-166: Add post-launch error checks after each CUDA kernel
invocation (e.g., after build_seg_offsets_kernel<<<...>>> and
batched_rank_sums_presorted_kernel<<<...>>>): call cudaGetLastError() and
handle/report non-success results (use the existing error-handling macro or
CUDA_CHECK_LAST_ERROR pattern used elsewhere in the project) so
configuration/launch failures are detected; apply the same fix for other kernel
launches in this file that currently lack a cudaGetLastError() check.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu`:
- Around line 137-152: The kernel/CUB calls in this block lack post-launch error
checks; after the launch of fill_row_indices_kernel, after the
cub::DeviceSegmentedRadixSort::SortPairs call, and after the launch of
rank_sums_from_sorted_kernel add CUDA error checks by calling cudaGetLastError()
(and optionally cudaDeviceSynchronize() or checking the returned cudaError_t)
and log/return on error using the existing logging/return pattern (refer to
fill_row_indices_kernel, cub::DeviceSegmentedRadixSort::SortPairs,
rank_sums_from_sorted_kernel and the buf/stream variables to locate the calls);
ensure each check captures and reports the cuda error string so failures are not
silent.
- Around line 122-123: The shared memory calc for rank_sums_from_sorted_kernel
(smem_rank = (4 * n_groups + 32) * sizeof(double)) can exceed
device.sharedMemPerBlock for large n_groups; add a runtime check using
cudaGetDeviceProperties to compare smem_rank to prop.sharedMemPerBlock and if it
exceeds the limit either choose the fallback path used elsewhere (e.g., the
alternative kernel/path in kernels_wilcoxon.cuh) or return/throw an error so you
do not launch rank_sums_from_sorted_kernel with too-large shared memory; update
the code around tpb_rank, smem_rank and the kernel launch to perform this guard
and select the safe code path.

In `@src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py`:
- Around line 417-431: The loop building test_group_indices/all_grp_rows can
produce an empty all_grp_rows causing np.concatenate(all_grp_rows) to raise; add
a guard after the loop that checks if not all_grp_rows (or n_test == 0) and
return an appropriate empty result (e.g., [] or the function's expected empty
structure) to skip downstream GPU ops; update the code near variables
test_group_indices, all_grp_rows, all_grp_row_ids_np, grp_offsets_gpu to
early-return when there are no test groups to compare.
- Around line 329-342: The sparse CSR/CSC components passed into the CUDA
kernels (_ovr.ovr_streaming_csr and its CSC counterpart) should be forced to
C-contiguous arrays to match kernel expectations; wrap X_gpu.data,
X_gpu.indices, and X_gpu.indptr (and the CSC equivalents) with
cp.ascontiguousarray after casting (e.g., X_gpu.data.astype(..., copy=False))
before passing them into the kernel calls so the arrays are guaranteed
contiguous and safe for GPU kernels.

---

Nitpick comments:
In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh`:
- Around line 125-126: The assignment warp_buf = smem + n_groups inside the
compute_tie_corr branch reuses the same shared-memory region previously used for
s_sum when compute_stats was true; add an inline comment next to that assignment
explaining that this overlap is intentional because s_sum is no longer needed by
that point (reference symbols: warp_buf, smem, n_groups, s_sum, compute_stats,
compute_tie_corr) so future readers won't treat it as a bug and will understand
the shared-memory reuse constraints.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu`:
- Around line 706-712: The cudaHostRegister calls for h_data, h_indices,
h_rank_sums and h_tie_corr currently ignore CUDA errors and rely on manual
cudaHostUnregister elsewhere; update this to add error handling and/or an RAII
wrapper: check and handle the return value of cudaHostRegister for each buffer
(e.g., propagate or log the cudaError_t from cudaHostRegister) and ensure paired
cudaHostUnregister is called in destructor/finalizer (create a small
HostRegisterGuard class that registers in its constructor and unregisters in its
destructor and replace the direct cudaHostRegister calls with instances of that
guard for h_data, h_indices, h_rank_sums and h_tie_corr).

In `@src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py`:
- Around line 44-56: The indptr arrays are not being cast to int32 while indices
are, causing a potential dtype mismatch for CUDA kernels; update both
sparse-construction sites (the csc -> cpsp.csc_matrix block and the csr ->
cpsp.csr_matrix block around the variables csr, csc, indices, and indptr) to
cast indptr to np.int32 before moving to the GPU (i.e., use
indptr.astype(np.int32, copy=False) wrapped with cp.asarray just like indices)
so all sparse-array components use int32 consistently for the kernels.
- Around line 39-41: The nested check and ternary around X.format are redundant;
in the block handling sparse inputs (check using sp.issparse and the existing
isinstance check for sp.spmatrix|sp.sparray), simplify the logic in _wilcoxon.py
by removing the repeated "X.format == 'csc'" test and assign csc using a single
conversion path (e.g., in the sp.issparse branch set csc = X.tocsc() which is
safe whether X is already csc or not, and for non-sparse fall back to
constructing or converting to a csc matrix), updating the code around the
variable X and the csc assignment accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eba4718b-34f0-4ad7-aa18-dbeb80fa18a2

📥 Commits

Reviewing files that changed from the base of the PR and between 046bdb3 and 1b926c0.

📒 Files selected for processing (14)
  • .github/workflows/publish.yml
  • CMakeLists.txt
  • pyproject.toml
  • src/rapids_singlecell/_cuda/__init__.py
  • src/rapids_singlecell/_cuda/nb_types.h
  • src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh
  • src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon_ovo.cuh
  • src/rapids_singlecell/_cuda/wilcoxon/wilcoxon.cu
  • src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_common.cuh
  • src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu
  • src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu
  • src/rapids_singlecell/tools/_rank_genes_groups/_utils.py
  • src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py
  • tests/test_rank_genes_groups_wilcoxon.py
💤 Files with no reviewable changes (1)
  • src/rapids_singlecell/_cuda/wilcoxon/wilcoxon.cu

Comment thread .github/workflows/publish.yml Outdated
Comment on lines +125 to +133
SITE=$(python -c "import sysconfig;print(sysconfig.get_path('purelib'))") &&
echo "[rsc-build] site-packages=$SITE" &&
echo "[rsc-build] librmm=$(ls $SITE/librmm/lib64/*.so 2>/dev/null)" &&
echo "[rsc-build] rapids_logger=$(ls $SITE/rapids_logger/lib64/*.so 2>/dev/null)" &&
ln -sf "$SITE/librmm/lib64/librmm.so" /usr/local/lib/librmm.so &&
ln -sf "$SITE/rapids_logger/lib64/librapids_logger.so" /usr/local/lib/librapids_logger.so &&
ldconfig &&
python -c "import librmm;print(librmm.__path__[0])" > /tmp/.librmm_dir &&
echo "[rsc-build] marker=$(cat /tmp/.librmm_dir)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

git ls-files | grep -E "publish\.yml"

Repository: scverse/rapids-singlecell

Length of output: 98


🏁 Script executed:

cat -n .github/workflows/publish.yml | sed -n '120,140p'

Repository: scverse/rapids-singlecell

Length of output: 1537


🏁 Script executed:

head -20 .github/workflows/publish.yml 2>/dev/null || echo "File not found"

Repository: scverse/rapids-singlecell

Length of output: 581


🏁 Script executed:

pwd && find . -maxdepth 3 -type d -name ".github" 2>/dev/null

Repository: scverse/rapids-singlecell

Length of output: 97


🏁 Script executed:

git ls-files --others --exclude-standard | head -20

Repository: scverse/rapids-singlecell

Length of output: 90


🏁 Script executed:

git ls-files | head -50

Repository: scverse/rapids-singlecell

Length of output: 1371


🏁 Script executed:

git ls-files | grep -i publish

Repository: scverse/rapids-singlecell

Length of output: 98


🏁 Script executed:

find . -type f \( -name "*.yml" -o -name "*.yaml" \) | head -20

Repository: scverse/rapids-singlecell

Length of output: 448


🏁 Script executed:

cat pyproject.toml | grep -A 20 -B 5 "abi3\|cibuildwheel\|CIBW_REPAIR"

Repository: scverse/rapids-singlecell

Length of output: 1338


🏁 Script executed:

cat pyproject.toml 2>/dev/null | head -200

Repository: scverse/rapids-singlecell

Length of output: 5332


🏁 Script executed:

cat pyproject.toml | grep -A 30 "\[tool.cibuildwheel\]"

Repository: scverse/rapids-singlecell

Length of output: 51


🏁 Script executed:

rg "abi3\|CIBW_REPAIR" pyproject.toml

Repository: scverse/rapids-singlecell

Length of output: 51


🏁 Script executed:

python3 << 'EOF'
import tomllib
with open('pyproject.toml', 'rb') as f:
    data = tomllib.load(f)
    if 'tool' in data and 'cibuildwheel' in data['tool']:
        import json
        print(json.dumps(data['tool']['cibuildwheel'], indent=2))
    else:
        print("No [tool.cibuildwheel] section found")
EOF

Repository: scverse/rapids-singlecell

Length of output: 718


🏁 Script executed:

grep -n "cibuildwheel\|abi3" pyproject.toml

Repository: scverse/rapids-singlecell

Length of output: 388


🏁 Script executed:

cat pyproject.toml

Repository: scverse/rapids-singlecell

Length of output: 5332


🏁 Script executed:

cat pyproject.toml | tail -200

Repository: scverse/rapids-singlecell

Length of output: 5332


🏁 Script executed:

ast-grep --pattern $'[tool.cibuildwheel]'

Repository: scverse/rapids-singlecell

Length of output: 51


Resolve the binary package roots from the modules, not purelib/lib64, and restore abi3audit validation.

Lines 129-130 derive symlink paths from sysconfig.get_path('purelib') with hard-coded lib64 subdirectories. This is brittle: binary packages should use platlib (platform-specific), not purelib, and lib64 is not guaranteed—some systems use lib. When layout assumptions fail, symlinks silently break and /tmp/.librmm_dir points to the wrong location.

Additionally, line 136 overrides CIBW_REPAIR_WHEEL_COMMAND with only the auditwheel step, removing the abi3audit validation defined in pyproject.toml. This is a regression that bypasses quality checks for Limited API wheels.

Fix by: (1) importing the modules first, then dynamically locating their .so files from their actual package paths via librmm.__path__[0] and rapids_logger.__path__[0] instead of assuming a fixed layout; (2) preserving the abi3audit step in the repair command or removing the environment variable override to use the pyproject.toml configuration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/publish.yml around lines 125 - 133, Current workflow
derives .so paths from SITE=$(python -c "import
sysconfig;print(sysconfig.get_path('purelib'))") and hard-codes lib64, which is
brittle; instead import the packages and read their actual package dirs (use
librmm.__path__[0] and rapids_logger.__path__[0]) to locate .so files
dynamically and create symlinks from those resolved locations (update the steps
that write /tmp/.librmm_dir accordingly); also stop overriding
CIBW_REPAIR_WHEEL_COMMAND with only auditwheel or restore the abi3audit step so
the abi3audit validation defined in pyproject.toml is preserved (keep both
abi3audit and auditwheel in the repair command or remove the env override so
pyproject.toml controls it).

Comment thread .github/workflows/publish.yml Outdated
Comment thread CMakeLists.txt Outdated
Comment thread src/rapids_singlecell/_cuda/nb_types.h Outdated
Comment thread src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh Outdated
Comment thread src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu
Comment thread src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu Outdated
Comment on lines +137 to +152
fill_row_indices_kernel<<<sb_cols, 256, 0, stream>>>(buf.vals_in,
n_rows, sb_cols);

// Sort: keys = block columns [col, col+sb_cols), already F-order
const float* keys_in = block + (long long)col * n_rows;
size_t temp = cub_temp_bytes;
cub::DeviceSegmentedRadixSort::SortPairs(
buf.cub_temp, temp, keys_in, buf.keys_out, buf.vals_in,
buf.vals_out, sb_items, sb_cols, buf.seg_offsets,
buf.seg_offsets + 1, BEGIN_BIT, END_BIT, stream);

// Fused rank sums into sub-batch buffer
rank_sums_from_sorted_kernel<<<sb_cols, tpb_rank, smem_rank, stream>>>(
buf.keys_out, buf.vals_out, group_codes, buf.sub_rank_sums,
buf.sub_tie_corr, nullptr, nullptr, nullptr, n_rows, sb_cols,
n_groups, compute_tie_corr, false);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing cudaGetLastError() after kernel launches.

Same issue as in wilcoxon_ovo.cu: kernel launches at lines 137, 143-146 (CUB), and 149 lack error checking. Silent failures could corrupt results or cause hard-to-debug issues.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 137 - 152,
The kernel/CUB calls in this block lack post-launch error checks; after the
launch of fill_row_indices_kernel, after the
cub::DeviceSegmentedRadixSort::SortPairs call, and after the launch of
rank_sums_from_sorted_kernel add CUDA error checks by calling cudaGetLastError()
(and optionally cudaDeviceSynchronize() or checking the returned cudaError_t)
and log/return on error using the existing logging/return pattern (refer to
fill_row_indices_kernel, cub::DeviceSegmentedRadixSort::SortPairs,
rank_sums_from_sorted_kernel and the buf/stream variables to locate the calls);
ensure each check captures and reports the cuda error string so failures are not
silent.

Comment on lines +329 to +342
if cpsp.isspmatrix_csr(X_gpu):
_ovr.ovr_streaming_csr(
X_gpu.data.astype(cp.float32, copy=False),
X_gpu.indices.astype(cp.int32, copy=False),
X_gpu.indptr.astype(cp.int32, copy=False),
group_codes_gpu,
rank_sums,
tie_corr,
n_rows=n_cells,
n_cols=n_total_genes,
n_groups=n_groups,
compute_tie_corr=tie_correct,
sub_batch_cols=STREAMING_SUB_BATCH,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if CuPy sparse matrix components are guaranteed to be contiguous
rg -n "\.data\b|\.indices\b|\.indptr\b" --type py -C 2 src/rapids_singlecell/ | head -50

Repository: scverse/rapids-singlecell

Length of output: 3462


🏁 Script executed:

# Read the full context of _wilcoxon.py around lines 329-342
cat -n src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py | sed -n '320,350p'

Repository: scverse/rapids-singlecell

Length of output: 1511


🏁 Script executed:

# Search for kernel definition and any contiguity-related patterns
rg -n "ovr_streaming_csr|gpu_array_c|ascontiguousarray" --type py src/rapids_singlecell/ | head -30

Repository: scverse/rapids-singlecell

Length of output: 1262


🏁 Script executed:

# Check CuPy sparse matrix properties and contiguity guarantees
# Look for any documentation or comments about sparse array handling
rg -n "sparse|CSR|contiguous" --type py src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py | head -40

Repository: scverse/rapids-singlecell

Length of output: 951


🏁 Script executed:

# Search for the _ovr module/kernel implementation
find src -name "*ovr*" -o -name "*wilcoxon*" | head -20

Repository: scverse/rapids-singlecell

Length of output: 694


🏁 Script executed:

# Look for nanobind/CUDA kernel definitions related to ovr_streaming
rg -n "ovr_streaming|def ovr" --type py --type cpp --type cu src/rapids_singlecell/ | head -30

Repository: scverse/rapids-singlecell

Length of output: 95


🏁 Script executed:

# Check if there's documentation about kernel array requirements
rg -n "contiguous|C-order|F-order" --type cpp --type cu src/rapids_singlecell/ | head -30

Repository: scverse/rapids-singlecell

Length of output: 95


🏁 Script executed:

# Read the wilcoxon_ovr.cu file to see the kernel signature and requirements
cat src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu | head -150

Repository: scverse/rapids-singlecell

Length of output: 5817


🏁 Script executed:

# Check the header files for any contiguity or memory layout documentation
cat src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_common.cuh | head -100

Repository: scverse/rapids-singlecell

Length of output: 3152


🏁 Script executed:

# Search for nanobind binding definitions for ovr_streaming_csr
rg -n "PYBIND11_MODULE|def_.*ovr_streaming|ovr_streaming_csr" --type py --type cpp src/rapids_singlecell/ | head -40

Repository: scverse/rapids-singlecell

Length of output: 168


🏁 Script executed:

# Look for any .pyi stub files or type hints that show kernel signatures
find src -name "*.pyi" -o -name "*binding*" -o -name "*pybind*" | head -20

Repository: scverse/rapids-singlecell

Length of output: 51


🏁 Script executed:

# Check if CuPy sparse matrix .data, .indices, .indptr are guaranteed contiguous
# by searching for documentation or tests
rg -n "\.data.*contiguous|\.indices.*contiguous|\.indptr.*contiguous" --type py src/

Repository: scverse/rapids-singlecell

Length of output: 166


🏁 Script executed:

# Check CuPy documentation and source patterns - look for how sparse matrices are created
rg -n "csr_matrix|csc_matrix|issparse" --type py src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py -B 3 -A 3 | head -60

Repository: scverse/rapids-singlecell

Length of output: 1620


🏁 Script executed:

# Look at the _move_to_gpu function which creates/transfers sparse matrices
cat -n src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py | sed -n '25,50p'

Repository: scverse/rapids-singlecell

Length of output: 1238


🏁 Script executed:

# Check actual usage patterns in similar code - do other parts of codebase use astype on sparse components?
rg -n "\.data\.astype|\.indices\.astype|\.indptr\.astype" --type py src/rapids_singlecell/ -B 2 -A 2

Repository: scverse/rapids-singlecell

Length of output: 6551


🏁 Script executed:

# Verify what dtype X_gpu.data actually is to confirm if astype conversion happens
rg -n "X_gpu.*dtype|\.data.*dtype|\.indices.*dtype" --type py src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py | head -20

Repository: scverse/rapids-singlecell

Length of output: 51


🏁 Script executed:

# Check if there are any reported issues or TODOs about contiguity
rg -n "TODO|FIXME|BUG|contiguous|contiguity" --type py src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py

Repository: scverse/rapids-singlecell

Length of output: 149


Ensure C-contiguity of sparse matrix components passed to CUDA kernels.

While CuPy sparse array components (.data, .indices, .indptr) are typically C-contiguous by default, explicitly wrapping them with cp.ascontiguousarray() provides defensive safety and consistency. Line 167 in this same file uses this pattern:

cp.ascontiguousarray(data.astype(cp.float32).ravel(order="F"))

Applying the same safeguard to lines 331–333 (and the CSC variant at 345–347) ensures arrays meet the kernel's contiguity expectations:

Suggested fix
_ovr.ovr_streaming_csr(
    cp.ascontiguousarray(X_gpu.data.astype(cp.float32, copy=False)),
    cp.ascontiguousarray(X_gpu.indices.astype(cp.int32, copy=False)),
    cp.ascontiguousarray(X_gpu.indptr.astype(cp.int32, copy=False)),
    ...
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py` around lines 329
- 342, The sparse CSR/CSC components passed into the CUDA kernels
(_ovr.ovr_streaming_csr and its CSC counterpart) should be forced to
C-contiguous arrays to match kernel expectations; wrap X_gpu.data,
X_gpu.indices, and X_gpu.indptr (and the CSC equivalents) with
cp.ascontiguousarray after casting (e.g., X_gpu.data.astype(..., copy=False))
before passing them into the kernel calls so the arrays are guaranteed
contiguous and safe for GPU kernels.

Comment thread src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py
@Intron7
Copy link
Copy Markdown
Member Author

Intron7 commented Apr 14, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 14, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (3)
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu (1)

160-180: ⚠️ Potential issue | 🟠 Major

Missing cudaGetLastError() after kernel launches.

Per coding guidelines, every kernel launch must be followed by cudaGetLastError() to detect launch failures immediately. The current code only checks errors at stream synchronization (lines 201-207), which delays error detection and makes debugging harder.

This pattern repeats throughout all *_impl functions: kernel launches at lines 80, 83-84, 160-161, 177-180, 289-290, 305-308, 408-409, 413-414, 429-432, 581-583, 587-588, 603-606, 725-726, 741-744.

Example fix pattern for one kernel launch
         rank_sums_from_sorted_kernel<<<sb_cols, tpb_rank, smem_rank, stream>>>(
             buf.keys_out, buf.vals_out, group_codes, buf.sub_rank_sums,
             buf.sub_tie_corr, nullptr, nullptr, nullptr, n_rows, sb_cols,
             n_groups, compute_tie_corr, false, use_gmem);
+        cudaError_t err = cudaGetLastError();
+        if (err != cudaSuccess)
+            throw std::runtime_error(
+                std::string("rank_sums_from_sorted_kernel launch failed: ") +
+                cudaGetErrorString(err));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 160 - 180,
After each device kernel launch and any CUB launch in wilcoxon_ovr.cu (e.g.,
after fill_row_indices_kernel<<<...>>>, after the
cub::DeviceSegmentedRadixSort::SortPairs(...) call, and after
rank_sums_from_sorted_kernel<<<...>>>), insert a call to cudaGetLastError() and
check its return (and log or propagate the error) immediately to detect launch
failures early; repeat this pattern for all similar kernel/CUB launches in the
file (the reviewer listed examples at the start of each *_impl: kernels around
symbols fill_row_indices_kernel, rank_sums_from_sorted_kernel and other kernels
noted in the comment) so each launch is followed by an immediate
cudaGetLastError() check rather than relying solely on stream synchronization.
.github/workflows/publish.yml (2)

136-136: ⚠️ Potential issue | 🟠 Major

Still bypassing abi3audit on Linux.

Line 136 overrides the Linux repair sequence from pyproject.toml with an auditwheel-only command, so the Limited API check never runs for the published wheels. Either append the same abi3audit step here or move the new exclude list into pyproject.toml and let that table stay the source of truth.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/publish.yml at line 136, The CI overrides the Linux repair
step by setting CIBW_REPAIR_WHEEL_COMMAND to an auditwheel-only command, which
bypasses the abi3 audit in pyproject.toml; restore the Limited API check by
either appending the abi3audit step to CIBW_REPAIR_WHEEL_COMMAND (run abi3audit
after auditwheel repair) or move the custom exclude list into the existing
repair/table in pyproject.toml so the original abi3audit invocation is
preserved; update the CIBW_REPAIR_WHEEL_COMMAND or pyproject.toml accordingly
and ensure the repair sequence still invokes abi3audit for Linux builds.

124-133: ⚠️ Potential issue | 🟠 Major

Still assuming lib64 in the RAPIDS wheels.

Line 129 and Line 130 still hard-code lib64, but CMakeLists.txt:18-75 already treats the installed layout as lib*. If either wheel lands under lib instead, these symlinks break before the extensions link. Resolve the actual .so from lib*/ rather than composing a lib64 path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/publish.yml around lines 124 - 133, The symlink creation
assumes a lib64 layout and should instead resolve the actual .so under any lib*
directory; update the ln -sf steps that reference RMM_ROOT and LOG_ROOT so they
search for the shared objects (e.g., under "$RMM_ROOT"/lib*/librmm.so and
"$LOG_ROOT"/lib*/librapids_logger.so or by using a find/glob to locate the first
matching .so) and create symlinks to those resolved paths rather than
hard-coding lib64, keeping the same RMM_ROOT and LOG_ROOT variables and the two
ln -sf operations (for librmm and librapids_logger).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Around line 201-208: The pyproject.toml currently declares librmm under
[tool.uv.extra-build-dependencies], which is ignored by PEP 517 build isolation;
move the librmm requirement into the PEP 517 build requirements by adding
"librmm-cu12>=25.10" to [build-system].requires in pyproject.toml so source
builds (pip install --no-binary) can find headers required by CMakeLists.txt for
the Wilcoxon CUDA extensions, or alternatively update the project installation
docs to explicitly state that librmm (librmm-cu12>=25.10) must be preinstalled
and show using RSC_BUILD_EXTENSIONS=OFF to skip CUDA builds when librmm is not
available.

In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh`:
- Around line 154-156: The warp buffer offset is wrong when use_gmem is false:
in the compute_tie_corr branch the code sets double* warp_buf = use_gmem ? smem
: smem + n_groups but the shared-memory layout places grp_sums, s_sum, s_sq,
s_nnz (4 arrays of size n_groups) before the warp buffer; update the offset
calculation in the compute_tie_corr block to use smem + 4 * n_groups (keeping
the use_gmem ? smem logic) so warp_buf points after those four arrays (refer to
variables compute_tie_corr, warp_buf, use_gmem, smem, n_groups).

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu`:
- Around line 540-546: The cudaHostRegister calls for h_data, h_indices,
h_rank_sums, and h_tie_corr are missing error checks; wrap each
cudaHostRegister(...) invocation with a check of the returned cudaError_t and
handle failures (e.g., log the error using cudaGetErrorString(err) and
return/throw an error code) so failures aren’t silent; update the block where
cudaHostRegister is called (references: cudaHostRegister, h_data, h_indices,
h_rank_sums, h_tie_corr) to verify the return value, clean up any previously
pinned memory on failure, and propagate an error up from the function.

---

Duplicate comments:
In @.github/workflows/publish.yml:
- Line 136: The CI overrides the Linux repair step by setting
CIBW_REPAIR_WHEEL_COMMAND to an auditwheel-only command, which bypasses the abi3
audit in pyproject.toml; restore the Limited API check by either appending the
abi3audit step to CIBW_REPAIR_WHEEL_COMMAND (run abi3audit after auditwheel
repair) or move the custom exclude list into the existing repair/table in
pyproject.toml so the original abi3audit invocation is preserved; update the
CIBW_REPAIR_WHEEL_COMMAND or pyproject.toml accordingly and ensure the repair
sequence still invokes abi3audit for Linux builds.
- Around line 124-133: The symlink creation assumes a lib64 layout and should
instead resolve the actual .so under any lib* directory; update the ln -sf steps
that reference RMM_ROOT and LOG_ROOT so they search for the shared objects
(e.g., under "$RMM_ROOT"/lib*/librmm.so and "$LOG_ROOT"/lib*/librapids_logger.so
or by using a find/glob to locate the first matching .so) and create symlinks to
those resolved paths rather than hard-coding lib64, keeping the same RMM_ROOT
and LOG_ROOT variables and the two ln -sf operations (for librmm and
librapids_logger).

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu`:
- Around line 160-180: After each device kernel launch and any CUB launch in
wilcoxon_ovr.cu (e.g., after fill_row_indices_kernel<<<...>>>, after the
cub::DeviceSegmentedRadixSort::SortPairs(...) call, and after
rank_sums_from_sorted_kernel<<<...>>>), insert a call to cudaGetLastError() and
check its return (and log or propagate the error) immediately to detect launch
failures early; repeat this pattern for all similar kernel/CUB launches in the
file (the reviewer listed examples at the start of each *_impl: kernels around
symbols fill_row_indices_kernel, rank_sums_from_sorted_kernel and other kernels
noted in the comment) so each launch is followed by an immediate
cudaGetLastError() check rather than relying solely on stream synchronization.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 548a08c7-d10f-413f-b6b1-1d702b116125

📥 Commits

Reviewing files that changed from the base of the PR and between 1b926c0 and 68ad73c.

📒 Files selected for processing (8)
  • .github/workflows/publish.yml
  • CMakeLists.txt
  • pyproject.toml
  • src/rapids_singlecell/_cuda/nb_types.h
  • src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh
  • src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu
  • src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu
  • src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py
✅ Files skipped from review due to trivial changes (2)
  • src/rapids_singlecell/_cuda/nb_types.h
  • src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • CMakeLists.txt
  • src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu

Comment thread pyproject.toml
Comment on lines +201 to +208
[tool.uv]
# librmm headers are needed at build time for the wilcoxon CUDA kernels.
# The headers are identical across cu12/cu13 — only the .so differs (loaded
# at runtime via librmm.load_library()). cu12 is used here as the build-time
# provider; cu13 envs get the same headers.

[tool.uv.extra-build-dependencies]
rapids-singlecell = [ "librmm-cu12>=25.10" ]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the pyproject.toml file at the specified lines
wc -l pyproject.toml

Repository: scverse/rapids-singlecell

Length of output: 87


🏁 Script executed:

# Read the relevant section of pyproject.toml
head -n 220 pyproject.toml | tail -n 30

Repository: scverse/rapids-singlecell

Length of output: 816


🏁 Script executed:

# Check CMakeLists.txt for the rmm requirements at the specified lines
if [ -f CMakeLists.txt ]; then
  head -n 25 CMakeLists.txt | tail -n 15
fi

Repository: scverse/rapids-singlecell

Length of output: 650


🏁 Script executed:

# Check if CMakeLists.txt mentions librmm or rmm
if [ -f CMakeLists.txt ]; then
  rg -n "rmm|librmm" CMakeLists.txt | head -20
fi

Repository: scverse/rapids-singlecell

Length of output: 1181


🏁 Script executed:

# Check the build-system configuration in pyproject.toml
rg -A 5 "\[build-system\]" pyproject.toml

Repository: scverse/rapids-singlecell

Length of output: 177


🏁 Script executed:

# Search for any documentation or comments about rmm requirements
rg -i "rmm|wilcoxon" --type=py --type=toml --type=md | head -30

Repository: scverse/rapids-singlecell

Length of output: 3193


🏁 Script executed:

# Check for package configuration and publishing info
rg -i "packages|distribution|sdist" pyproject.toml | head -20

Repository: scverse/rapids-singlecell

Length of output: 225


🏁 Script executed:

# Look for documentation about build requirements
fd -e md -e rst -e txt | xargs rg -l "librmm|rmm|build.*require" 2>/dev/null | head -10

Repository: scverse/rapids-singlecell

Length of output: 129


🏁 Script executed:

# Check README and installation docs
if [ -f README.md ]; then
  head -100 README.md
fi

Repository: scverse/rapids-singlecell

Length of output: 2931


🏁 Script executed:

# Search for RSC_BUILD_EXTENSIONS usage and documentation
rg -n "RSC_BUILD_EXTENSIONS|BUILD_EXTENSIONS" --type=py --type=toml --type=md

Repository: scverse/rapids-singlecell

Length of output: 160


🏁 Script executed:

# Check pyproject.toml for wheel/sdist configuration
cat pyproject.toml | grep -A 10 -B 2 "wheel\|sdist\|package"

Repository: scverse/rapids-singlecell

Length of output: 2382


Move librmm to a build-system-agnostic configuration or document the requirement explicitly.

The [tool.uv.extra-build-dependencies] entry is ignored by pip and generic PEP 517 build isolation. While the project publishes an sdist and CMakeLists.txt (lines 177–180) requires rmm for the Wilcoxon CUDA extensions, source builds via pip install --no-binary will fail to locate librmm headers during compilation unless the package is manually provided. Either add librmm-cu12>=25.10 to [build-system].requires in pyproject.toml, or clearly document this requirement in installation documentation along with guidance on using the RSC_BUILD_EXTENSIONS=OFF flag to skip CUDA extension builds when librmm is unavailable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 201 - 208, The pyproject.toml currently declares
librmm under [tool.uv.extra-build-dependencies], which is ignored by PEP 517
build isolation; move the librmm requirement into the PEP 517 build requirements
by adding "librmm-cu12>=25.10" to [build-system].requires in pyproject.toml so
source builds (pip install --no-binary) can find headers required by
CMakeLists.txt for the Wilcoxon CUDA extensions, or alternatively update the
project installation docs to explicitly state that librmm (librmm-cu12>=25.10)
must be preinstalled and show using RSC_BUILD_EXTENSIONS=OFF to skip CUDA builds
when librmm is not available.

Comment on lines +154 to +156
if (compute_tie_corr) {
// Warp buf always in shared memory (32 doubles — always fits)
double* warp_buf = use_gmem ? smem : smem + n_groups;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Incorrect warp buffer offset in shared memory path.

When use_gmem=false, the shared memory layout allocates 4 arrays of size n_groups (grp_sums, s_sum, s_sq, s_nnz) before the warp buffer. The offset calculation should be smem + 4 * n_groups, not smem + n_groups.

🐛 Proposed fix
     if (compute_tie_corr) {
         // Warp buf always in shared memory (32 doubles — always fits)
-        double* warp_buf = use_gmem ? smem : smem + n_groups;
+        double* warp_buf = use_gmem ? smem : smem + 4 * n_groups;
 `#pragma` unroll
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (compute_tie_corr) {
// Warp buf always in shared memory (32 doubles — always fits)
double* warp_buf = use_gmem ? smem : smem + n_groups;
if (compute_tie_corr) {
// Warp buf always in shared memory (32 doubles — always fits)
double* warp_buf = use_gmem ? smem : smem + 4 * n_groups;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh` around lines 154 -
156, The warp buffer offset is wrong when use_gmem is false: in the
compute_tie_corr branch the code sets double* warp_buf = use_gmem ? smem : smem
+ n_groups but the shared-memory layout places grp_sums, s_sum, s_sq, s_nnz (4
arrays of size n_groups) before the warp buffer; update the offset calculation
in the compute_tie_corr block to use smem + 4 * n_groups (keeping the use_gmem ?
smem logic) so warp_buf points after those four arrays (refer to variables
compute_tie_corr, warp_buf, use_gmem, smem, n_groups).

Comment thread src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu
Comment on lines +540 to +546
cudaHostRegister(const_cast<float*>(h_data),
(size_t)h_indptr[n_cols] * sizeof(float), 0);
cudaHostRegister(const_cast<int*>(h_indices),
(size_t)h_indptr[n_cols] * sizeof(int), 0);
cudaHostRegister(h_rank_sums, (size_t)n_groups * n_cols * sizeof(double),
0);
cudaHostRegister(h_tie_corr, n_cols * sizeof(double), 0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing error checks for cudaHostRegister.

cudaHostRegister can fail (e.g., if memory is already pinned, or the system doesn't support pinning). Without error checking, failures are silent and async transfers may not work as expected.

Suggested fix
     // Pin host memory for async transfers
-    cudaHostRegister(const_cast<float*>(h_data),
-                     (size_t)h_indptr[n_cols] * sizeof(float), 0);
-    cudaHostRegister(const_cast<int*>(h_indices),
-                     (size_t)h_indptr[n_cols] * sizeof(int), 0);
-    cudaHostRegister(h_rank_sums, (size_t)n_groups * n_cols * sizeof(double),
-                     0);
-    cudaHostRegister(h_tie_corr, n_cols * sizeof(double), 0);
+    cudaError_t pin_err;
+    pin_err = cudaHostRegister(const_cast<float*>(h_data),
+                     (size_t)h_indptr[n_cols] * sizeof(float), 0);
+    // Note: cudaErrorHostMemoryAlreadyRegistered is acceptable
+    if (pin_err != cudaSuccess && pin_err != cudaErrorHostMemoryAlreadyRegistered)
+        throw std::runtime_error("Failed to pin h_data: " + 
+            std::string(cudaGetErrorString(pin_err)));
+    // ... similar for other registers
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 540 - 546,
The cudaHostRegister calls for h_data, h_indices, h_rank_sums, and h_tie_corr
are missing error checks; wrap each cudaHostRegister(...) invocation with a
check of the returned cudaError_t and handle failures (e.g., log the error using
cudaGetErrorString(err) and return/throw an error code) so failures aren’t
silent; update the block where cudaHostRegister is called (references:
cudaHostRegister, h_data, h_indices, h_rank_sums, h_tie_corr) to verify the
return value, clean up any previously pinned memory on failure, and propagate an
error up from the function.

@Intron7
Copy link
Copy Markdown
Member Author

Intron7 commented Apr 15, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 15, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu (2)

686-692: CSR histogram kernel lacks cudaGetLastError() check.

The csr_col_histogram_kernel launch at line 690-691 doesn't check for launch errors. If the kernel fails, the subsequent cudaMemcpy to h_col_counts would silently read uninitialized/incorrect data.

Suggested fix
         csr_col_histogram_kernel<<<blocks, tpb>>>(csr_indices, csr_indptr,
                                                   d_col_counts, n_rows, n_cols);
+        cudaError_t err = cudaGetLastError();
+        if (err != cudaSuccess)
+            throw std::runtime_error(
+                std::string("csr_col_histogram_kernel launch failed: ") +
+                cudaGetErrorString(err));
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 686 - 692,
After launching csr_col_histogram_kernel(...) add a cudaGetLastError() (and
optionally cudaDeviceSynchronize()) check immediately after the kernel launch
and before the subsequent cudaMemcpy to h_col_counts; if an error is returned,
log/return the CUDA error (or propagate it) rather than proceeding to
cudaMemcpy. Ensure the check references csr_col_histogram_kernel launch and the
device buffer d_col_counts (and host h_col_counts) so failures are detected and
handled before copying results back to host.

182-203: Consider adding cudaGetLastError() after kernel launches.

While stream synchronization at line 225-230 will surface async errors, adding cudaGetLastError() immediately after kernel launches (lines 183-184, 200-203) would catch launch configuration errors (e.g., invalid shared memory size) earlier, making debugging easier.

This is a minor improvement since errors are eventually caught at sync.

Suggested improvement
         fill_row_indices_kernel<<<sb_cols, 256, 0, stream>>>(buf.vals_in,
                                                              n_rows, sb_cols);
+        if (cudaGetLastError() != cudaSuccess)
+            throw std::runtime_error("fill_row_indices_kernel launch failed");
 
         // Sort: keys = block columns [col, col+sb_cols), already F-order
         const float* keys_in = block + (long long)col * n_rows;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 182 - 203,
After the launches of fill_row_indices_kernel and rank_sums_from_sorted_kernel
(and optionally after the cub::DeviceSegmentedRadixSort::SortPairs call),
immediately call cudaGetLastError() and handle/report any non-success result so
launch-time failures (e.g., invalid <<<...>>> or shared memory size) are caught
earlier; locate the kernel launches by the symbols fill_row_indices_kernel and
rank_sums_from_sorted_kernel in wilcoxon_ovr.cu and add a cudaGetLastError()
check right after each launch (or use your project's CUDA error-check macro) to
log/propagate the launch error before later stream synchronization.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu`:
- Around line 686-692: After launching csr_col_histogram_kernel(...) add a
cudaGetLastError() (and optionally cudaDeviceSynchronize()) check immediately
after the kernel launch and before the subsequent cudaMemcpy to h_col_counts; if
an error is returned, log/return the CUDA error (or propagate it) rather than
proceeding to cudaMemcpy. Ensure the check references csr_col_histogram_kernel
launch and the device buffer d_col_counts (and host h_col_counts) so failures
are detected and handled before copying results back to host.
- Around line 182-203: After the launches of fill_row_indices_kernel and
rank_sums_from_sorted_kernel (and optionally after the
cub::DeviceSegmentedRadixSort::SortPairs call), immediately call
cudaGetLastError() and handle/report any non-success result so launch-time
failures (e.g., invalid <<<...>>> or shared memory size) are caught earlier;
locate the kernel launches by the symbols fill_row_indices_kernel and
rank_sums_from_sorted_kernel in wilcoxon_ovr.cu and add a cudaGetLastError()
check right after each launch (or use your project's CUDA error-check macro) to
log/propagate the launch error before later stream synchronization.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a64a3c4e-ea59-4657-95d3-8688ccbed35a

📥 Commits

Reviewing files that changed from the base of the PR and between 68ad73c and 7559778.

📒 Files selected for processing (12)
  • .github/workflows/publish.yml
  • pyproject.toml
  • src/rapids_singlecell/_cuda/__init__.py
  • src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh
  • src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_common.cuh
  • src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu
  • src/rapids_singlecell/tools/_rank_genes_groups/__init__.py
  • src/rapids_singlecell/tools/_rank_genes_groups/_core.py
  • src/rapids_singlecell/tools/_rank_genes_groups/_utils.py
  • src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py
  • tests/test_rank_genes_groups_wilcoxon.py
  • tests/test_rank_genes_groups_wilcoxon_binned.py
💤 Files with no reviewable changes (1)
  • src/rapids_singlecell/tools/_rank_genes_groups/_core.py
✅ Files skipped from review due to trivial changes (1)
  • src/rapids_singlecell/tools/_rank_genes_groups/init.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/rapids_singlecell/_cuda/init.py
  • .github/workflows/publish.yml
  • pyproject.toml
  • src/rapids_singlecell/tools/_rank_genes_groups/_utils.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants