C++ wilcoxon by Intron7 · Pull Request #587 · scverse/rapids-singlecell

Intron7 · 2026-02-25T20:13:03Z

No description provided.

codecov-commenter · 2026-02-25T20:16:12Z

Codecov Report

❌ Patch coverage is 81.25000% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.63%. Comparing base (046bdb3) to head (88bcdd5).

Files with missing lines	Patch %	Lines
...s_singlecell/tools/_rank_genes_groups/_wilcoxon.py	83.41%	33 Missing ⚠️
...pids_singlecell/tools/_rank_genes_groups/_utils.py	0.00%	4 Missing ⚠️
src/rapids_singlecell/_cuda/__init__.py	60.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #587      +/-   ##
==========================================
- Coverage   88.04%   87.63%   -0.41%     
==========================================
  Files          96       95       -1     
  Lines        7032     7071      +39     
==========================================
+ Hits         6191     6197       +6     
- Misses        841      874      +33

Files with missing lines	Coverage Δ
...ds_singlecell/tools/_rank_genes_groups/__init__.py	`82.45% <ø> (ø)`
...apids_singlecell/tools/_rank_genes_groups/_core.py	`72.76% <ø> (-18.73%)`	⬇️
src/rapids_singlecell/_cuda/__init__.py	`66.66% <60.00%> (-3.34%)`	⬇️
...pids_singlecell/tools/_rank_genes_groups/_utils.py	`67.16% <0.00%> (-18.76%)`	⬇️
...s_singlecell/tools/_rank_genes_groups/_wilcoxon.py	`85.26% <83.41%> (-12.36%)`	⬇️

... and 1 file with indirect coverage changes

coderabbitai · 2026-04-14T08:55:39Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-04-14T09:04:56Z

📝 Walkthrough

Summary by CodeRabbit

Dependencies
- Optional RAPIDS CUDA extras now include librmm for cu12/cu13 to support new GPU build/runtime paths.
New Features
- Reworked GPU Wilcoxon with streaming OVO/OVR pipelines, host/device streaming, tiered dispatch, fused kernels, and optional tie-correction/statistics.
Refactor
- Replaced the previous GPU backend with a streaming/vectorized pipeline; chunk_size removed from the Wilcoxon API (still used for wilcoxon_binned).
Tests
- Low-level kernel tests removed; expanded high-level behavioral tests across formats, options, and edge cases.

Walkthrough

Replaces the legacy Wilcoxon CUDA backend with two new streaming nanobind modules (OVR and OVO), adds fused rank-sum/tie-correction kernels and shared CUDA utilities, updates build/packaging to locate/preload RMM, refactors Python Wilcoxon pipeline to call streaming kernels, and updates tests accordingly.

Changes

Cohort / File(s)	Summary
Build & CI `\.github/workflows/publish.yml`, `pyproject.toml`, `CMakeLists.txt`	CI wheel build now locates/preloads `librmm` and `rapids_logger`, symlinks lib64 to `/usr/local/lib`, adjusts LD/Auditwheel steps, adds `librmm-cu12/cu13>=25.10` optional deps, and extends CMake search hints to find `rmm` before `find_package(rmm CONFIG REQUIRED)`.
CUDA module exports & init `src/rapids_singlecell/_cuda/__init__.py`	Pre-loads `librmm` on import (swallowing ImportError/OSError) and replaces `_wilcoxon_cuda` export with `_wilcoxon_ovo_cuda` and `_wilcoxon_ovr_cuda` in `__all__`.
Nanobind host types `src/rapids_singlecell/_cuda/nb_types.h`	Adds `host_array<T>` and `host_array_2d<T>` nanobind (NumPy-device) type aliases (1D and 2D).
Common CUDA utilities `src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_common.cuh`	New common header with CUDA exec constants, `RmmPool` RAII allocator, warp/round helpers, upload helper, and a templated CSR→dense extraction kernel.
Fused/generic kernels (OVR) `src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh`, `src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu`	Replaces two-pass tie/average-rank kernels with fused `rank_sums_from_sorted_kernel` and sparse-aware `rank_sums_sparse_ovr_kernel`; adds `wilcoxon_ovr` nanobind module implementing streaming OVR pipelines (dense/CSR/CSC), sub-batching, CUB segmented sort, multi-stream RMM-backed buffers, and host-streaming paths.
New OVO kernels & module `src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon_ovo.cuh`, `src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu`	Adds OVO-specific device helpers (block reductions, parallel tie-correction), presorted and fused shared-memory kernels, batched presorted ranking, and `wilcoxon_ovo` nanobind module with device and host streaming entrypoints.
Removed legacy CUDA file `src/rapids_singlecell/_cuda/wilcoxon/wilcoxon.cu`	Deletes the old `_wilcoxon_cuda` implementation and its nanobind bindings (`tie_correction`, `average_rank`).
Python rank-gen refactor `src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py`, `src/rapids_singlecell/tools/_rank_genes_groups/_utils.py`, `src/rapids_singlecell/tools/_rank_genes_groups/_core.py`, `src/rapids_singlecell/tools/_rank_genes_groups/__init__.py`	Refactors Python Wilcoxon paths to use streaming kernels (OVR/OVO), adds GPU data-prep and segmented sort helpers, removes chunk_size for `wilcoxon`, introduces `STREAMING_SUB_BATCH = 64`, and stops forcing float64 casts from sparse→dense conversions.
Tests `tests/test_rank_genes_groups_wilcoxon.py`, `tests/test_rank_genes_groups_wilcoxon_binned.py`	Removes kernel-level unit tests tied to deleted entrypoints; adds/expands higher-level behavioral tests across SciPy/CuPy/CSR/CSC formats, continuity/tie-correction coverage, mask/key handling, and new wilcoxon_binned tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

update rank_genes_groups #570: Refactors Wilcoxon rank_genes_groups implementation and group handling (group_codes/group_sizes), directly overlapping Python-side Wilcoxon changes.
Check for kernel errors #619: Prior Wilcoxon CUDA edits and CUDA utilities that overlap with the new CUDA backend additions and header changes.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	No pull request description was provided by the author, making it impossible to evaluate relatedness to the changeset. A description is needed to understand intent and context.	Add a clear pull request description explaining the purpose, scope, and key changes of this C++ Wilcoxon refactoring.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'C++ wilcoxon' is concise and directly refers to a major component of the changeset (C++ Wilcoxon implementation), but lacks specificity about what aspect or change is being introduced.
Docstring Coverage	✅ Passed	Docstring coverage is 91.94% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch c++-wilcoxon

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 10

🧹 Nitpick comments (4)

src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py (2)

44-56: Inconsistent int32 casting for indptr arrays.

indices is cast to int32 but indptr is not. The CUDA kernels expect int32 for all sparse array components. If the input has a large row count causing int64 indptr (possible with >2^31 rows, though rare), this could cause dtype mismatches at the kernel boundary.

For consistency and safety:

Proposed fix

         return cpsp.csc_matrix(
             (
                 cp.asarray(csc.data),
                 cp.asarray(csc.indices.astype(np.int32, copy=False)),
-                cp.asarray(csc.indptr),
+                cp.asarray(csc.indptr.astype(np.int32, copy=False)),
             ),
             shape=(n_rows, n_cols),
         )
     csr = X.tocsr() if X.format != "csr" else X
     return cpsp.csr_matrix(
         (
             cp.asarray(csr.data),
             cp.asarray(csr.indices.astype(np.int32, copy=False)),
-            cp.asarray(csr.indptr),
+            cp.asarray(csr.indptr.astype(np.int32, copy=False)),
         ),
         shape=(n_rows, n_cols),
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py` around lines 44
- 56, The indptr arrays are not being cast to int32 while indices are, causing a
potential dtype mismatch for CUDA kernels; update both sparse-construction sites
(the csc -> cpsp.csc_matrix block and the csr -> cpsp.csr_matrix block around
the variables csr, csc, indices, and indptr) to cast indptr to np.int32 before
moving to the GPU (i.e., use indptr.astype(np.int32, copy=False) wrapped with
cp.asarray just like indices) so all sparse-array components use int32
consistently for the kernels.

39-41: Redundant format check and ternary.

Line 40 re-checks sp.issparse(X) and X.format == "csc" after the outer condition already ensures X.format == "csc". Line 41's ternary always takes the first branch.

Simplified version

     if isinstance(X, sp.spmatrix | sp.sparray):
-        if sp.issparse(X) and X.format == "csc":
-            csc = X if X.format == "csc" else X.tocsc()
+        if X.format == "csc":
             return cpsp.csc_matrix(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py` around lines 39
- 41, The nested check and ternary around X.format are redundant; in the block
handling sparse inputs (check using sp.issparse and the existing isinstance
check for sp.spmatrix|sp.sparray), simplify the logic in _wilcoxon.py by
removing the repeated "X.format == 'csc'" test and assign csc using a single
conversion path (e.g., in the sp.issparse branch set csc = X.tocsc() which is
safe whether X is already csc or not, and for non-sparse fall back to
constructing or converting to a csc matrix), updating the code around the
variable X and the csc assignment accordingly.

src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu (1)

706-712: Host memory registration is correct but consider error handling.

The cudaHostRegister/cudaHostUnregister pattern is correct for async transfers. The const_cast is necessary since the CUDA API lacks a const variant. Consider wrapping in RAII or checking return values for robustness in production.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu` around lines 706 - 712,
The cudaHostRegister calls for h_data, h_indices, h_rank_sums and h_tie_corr
currently ignore CUDA errors and rely on manual cudaHostUnregister elsewhere;
update this to add error handling and/or an RAII wrapper: check and handle the
return value of cudaHostRegister for each buffer (e.g., propagate or log the
cudaError_t from cudaHostRegister) and ensure paired cudaHostUnregister is
called in destructor/finalizer (create a small HostRegisterGuard class that
registers in its constructor and unregisters in its destructor and replace the
direct cudaHostRegister calls with instances of that guard for h_data,
h_indices, h_rank_sums and h_tie_corr).

src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh (1)

125-126: Clarify warp buffer reuse after shared memory arrays.

warp_buf = smem + n_groups overlaps with s_sum when compute_stats is true. While this works because s_sum is no longer needed after line 119, a comment explaining this intentional reuse would improve maintainability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh` around lines 125 -
126, The assignment warp_buf = smem + n_groups inside the compute_tie_corr
branch reuses the same shared-memory region previously used for s_sum when
compute_stats was true; add an inline comment next to that assignment explaining
that this overlap is intentional because s_sum is no longer needed by that point
(reference symbols: warp_buf, smem, n_groups, s_sum, compute_stats,
compute_tie_corr) so future readers won't treat it as a bug and will understand
the shared-memory reuse constraints.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/publish.yml:
- Line 136: CIBW_REPAIR_WHEEL_COMMAND currently runs only the auditwheel repair
step and omits the abi3audit check; update the environment variable value to run
abi3audit after auditwheel by appending the abi3audit invocation (e.g., && pipx
run abi3audit --strict --report {wheel}) so the full sequence includes both
auditwheel repair (keeping the existing --exclude flags and {dest_dir} {wheel}
placeholders) and the pipx run abi3audit --strict --report {wheel} step to
validate Limited API wheels.
- Around line 125-133: Current workflow derives .so paths from SITE=$(python -c
"import sysconfig;print(sysconfig.get_path('purelib'))") and hard-codes lib64,
which is brittle; instead import the packages and read their actual package dirs
(use librmm.__path__[0] and rapids_logger.__path__[0]) to locate .so files
dynamically and create symlinks from those resolved locations (update the steps
that write /tmp/.librmm_dir accordingly); also stop overriding
CIBW_REPAIR_WHEEL_COMMAND with only auditwheel or restore the abi3audit step so
the abi3audit validation defined in pyproject.toml is preserved (keep both
abi3audit and auditwheel in the repair command or remove the env override so
pyproject.toml controls it).

In `@CMakeLists.txt`:
- Around line 23-25: The explicit override appends ENV{LIBRMM_DIR}/.. which
moves one level up and breaks the environment-root based discovery; update the
CMakeLists logic that appends to _env_roots so it pushes "$ENV{LIBRMM_DIR}" (not
"$ENV{LIBRMM_DIR}/..") so the subsequent glob checks that look for
${_root}/lib/cmake/rmm and ${_root}/lib/python*/site-packages/librmm/... will
find the correct paths.

In `@src/rapids_singlecell/_cuda/nb_types.h`:
- Around line 45-50: The host_array_2d alias currently uses nb::ndarray<T,
nb::numpy> which allows any rank; change it to enforce 2D by specifying
nb::ndim<2> so only rank-2 NumPy arrays bind. Update the alias host_array_2d to
use nb::ndarray<T, nb::numpy, nb::ndim<2>> (keeping host_array unchanged) so
mismatched ranks fail at binding rather than later in matrix code.

In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh`:
- Around line 29-33: The kernel currently allocates shared memory arrays (smem
-> grp_sums, s_sum, s_sq, s_nnz) sized by n_groups which can overflow GPU shared
memory for large n_groups; change the launch/init logic so you detect the
required shared_bytes = (4 * n_groups + 32) * sizeof(double) and if it exceeds
the device maxSharedMemPerBlock (or a safe threshold) switch to a fallback path:
either use global-memory accumulators with atomicAdd on
grp_sums/s_sum/s_sq/s_nnz, or process groups in tiles that fit into shared
memory (loop over group tiles, load tile into smem, reduce, and write back) —
implement the detection and choose between the existing shared-memory code path
and the fallback (global atomics or tiled processing) based on that check,
referencing smem, grp_sums, s_sum, s_sq, s_nnz and n_groups to locate the code
to change.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu`:
- Around line 149-166: Add post-launch error checks after each CUDA kernel
invocation (e.g., after build_seg_offsets_kernel<<<...>>> and
batched_rank_sums_presorted_kernel<<<...>>>): call cudaGetLastError() and
handle/report non-success results (use the existing error-handling macro or
CUDA_CHECK_LAST_ERROR pattern used elsewhere in the project) so
configuration/launch failures are detected; apply the same fix for other kernel
launches in this file that currently lack a cudaGetLastError() check.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu`:
- Around line 137-152: The kernel/CUB calls in this block lack post-launch error
checks; after the launch of fill_row_indices_kernel, after the
cub::DeviceSegmentedRadixSort::SortPairs call, and after the launch of
rank_sums_from_sorted_kernel add CUDA error checks by calling cudaGetLastError()
(and optionally cudaDeviceSynchronize() or checking the returned cudaError_t)
and log/return on error using the existing logging/return pattern (refer to
fill_row_indices_kernel, cub::DeviceSegmentedRadixSort::SortPairs,
rank_sums_from_sorted_kernel and the buf/stream variables to locate the calls);
ensure each check captures and reports the cuda error string so failures are not
silent.
- Around line 122-123: The shared memory calc for rank_sums_from_sorted_kernel
(smem_rank = (4 * n_groups + 32) * sizeof(double)) can exceed
device.sharedMemPerBlock for large n_groups; add a runtime check using
cudaGetDeviceProperties to compare smem_rank to prop.sharedMemPerBlock and if it
exceeds the limit either choose the fallback path used elsewhere (e.g., the
alternative kernel/path in kernels_wilcoxon.cuh) or return/throw an error so you
do not launch rank_sums_from_sorted_kernel with too-large shared memory; update
the code around tpb_rank, smem_rank and the kernel launch to perform this guard
and select the safe code path.

In `@src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py`:
- Around line 417-431: The loop building test_group_indices/all_grp_rows can
produce an empty all_grp_rows causing np.concatenate(all_grp_rows) to raise; add
a guard after the loop that checks if not all_grp_rows (or n_test == 0) and
return an appropriate empty result (e.g., [] or the function's expected empty
structure) to skip downstream GPU ops; update the code near variables
test_group_indices, all_grp_rows, all_grp_row_ids_np, grp_offsets_gpu to
early-return when there are no test groups to compare.
- Around line 329-342: The sparse CSR/CSC components passed into the CUDA
kernels (_ovr.ovr_streaming_csr and its CSC counterpart) should be forced to
C-contiguous arrays to match kernel expectations; wrap X_gpu.data,
X_gpu.indices, and X_gpu.indptr (and the CSC equivalents) with
cp.ascontiguousarray after casting (e.g., X_gpu.data.astype(..., copy=False))
before passing them into the kernel calls so the arrays are guaranteed
contiguous and safe for GPU kernels.

---

Nitpick comments:
In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh`:
- Around line 125-126: The assignment warp_buf = smem + n_groups inside the
compute_tie_corr branch reuses the same shared-memory region previously used for
s_sum when compute_stats was true; add an inline comment next to that assignment
explaining that this overlap is intentional because s_sum is no longer needed by
that point (reference symbols: warp_buf, smem, n_groups, s_sum, compute_stats,
compute_tie_corr) so future readers won't treat it as a bug and will understand
the shared-memory reuse constraints.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu`:
- Around line 706-712: The cudaHostRegister calls for h_data, h_indices,
h_rank_sums and h_tie_corr currently ignore CUDA errors and rely on manual
cudaHostUnregister elsewhere; update this to add error handling and/or an RAII
wrapper: check and handle the return value of cudaHostRegister for each buffer
(e.g., propagate or log the cudaError_t from cudaHostRegister) and ensure paired
cudaHostUnregister is called in destructor/finalizer (create a small
HostRegisterGuard class that registers in its constructor and unregisters in its
destructor and replace the direct cudaHostRegister calls with instances of that
guard for h_data, h_indices, h_rank_sums and h_tie_corr).

In `@src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py`:
- Around line 44-56: The indptr arrays are not being cast to int32 while indices
are, causing a potential dtype mismatch for CUDA kernels; update both
sparse-construction sites (the csc -> cpsp.csc_matrix block and the csr ->
cpsp.csr_matrix block around the variables csr, csc, indices, and indptr) to
cast indptr to np.int32 before moving to the GPU (i.e., use
indptr.astype(np.int32, copy=False) wrapped with cp.asarray just like indices)
so all sparse-array components use int32 consistently for the kernels.
- Around line 39-41: The nested check and ternary around X.format are redundant;
in the block handling sparse inputs (check using sp.issparse and the existing
isinstance check for sp.spmatrix|sp.sparray), simplify the logic in _wilcoxon.py
by removing the repeated "X.format == 'csc'" test and assign csc using a single
conversion path (e.g., in the sp.issparse branch set csc = X.tocsc() which is
safe whether X is already csc or not, and for non-sparse fall back to
constructing or converting to a csc matrix), updating the code around the
variable X and the csc assignment accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eba4718b-34f0-4ad7-aa18-dbeb80fa18a2

📥 Commits

Reviewing files that changed from the base of the PR and between 046bdb3 and 1b926c0.

📒 Files selected for processing (14)

.github/workflows/publish.yml
CMakeLists.txt
pyproject.toml
src/rapids_singlecell/_cuda/__init__.py
src/rapids_singlecell/_cuda/nb_types.h
src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh
src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon_ovo.cuh
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon.cu
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_common.cuh
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu
src/rapids_singlecell/tools/_rank_genes_groups/_utils.py
src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py
tests/test_rank_genes_groups_wilcoxon.py

💤 Files with no reviewable changes (1)

src/rapids_singlecell/_cuda/wilcoxon/wilcoxon.cu

coderabbitai · 2026-04-14T09:04:59Z

+            SITE=$(python -c "import sysconfig;print(sysconfig.get_path('purelib'))") &&
+            echo "[rsc-build] site-packages=$SITE" &&
+            echo "[rsc-build] librmm=$(ls $SITE/librmm/lib64/*.so 2>/dev/null)" &&
+            echo "[rsc-build] rapids_logger=$(ls $SITE/rapids_logger/lib64/*.so 2>/dev/null)" &&
+            ln -sf "$SITE/librmm/lib64/librmm.so" /usr/local/lib/librmm.so &&
+            ln -sf "$SITE/rapids_logger/lib64/librapids_logger.so" /usr/local/lib/librapids_logger.so &&
+            ldconfig &&
+            python -c "import librmm;print(librmm.__path__[0])" > /tmp/.librmm_dir &&
+            echo "[rsc-build] marker=$(cat /tmp/.librmm_dir)"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

git ls-files | grep -E "publish\.yml"

Repository: scverse/rapids-singlecell

Length of output: 98

🏁 Script executed:

cat -n .github/workflows/publish.yml | sed -n '120,140p'

Repository: scverse/rapids-singlecell

Length of output: 1537

🏁 Script executed:

head -20 .github/workflows/publish.yml 2>/dev/null || echo "File not found"

Repository: scverse/rapids-singlecell

Length of output: 581

🏁 Script executed:

pwd && find . -maxdepth 3 -type d -name ".github" 2>/dev/null

Repository: scverse/rapids-singlecell

Length of output: 97

🏁 Script executed:

git ls-files --others --exclude-standard | head -20

Repository: scverse/rapids-singlecell

Length of output: 90

🏁 Script executed:

git ls-files | head -50

Repository: scverse/rapids-singlecell

Length of output: 1371

🏁 Script executed:

git ls-files | grep -i publish

Repository: scverse/rapids-singlecell

Length of output: 98

🏁 Script executed:

find . -type f $ -name "*.yml" -o -name "*.yaml" $ | head -20

Repository: scverse/rapids-singlecell

Length of output: 448

🏁 Script executed:

cat pyproject.toml | grep -A 20 -B 5 "abi3\|cibuildwheel\|CIBW_REPAIR"

Repository: scverse/rapids-singlecell

Length of output: 1338

🏁 Script executed:

cat pyproject.toml 2>/dev/null | head -200

Repository: scverse/rapids-singlecell

Length of output: 5332

🏁 Script executed:

cat pyproject.toml | grep -A 30 "\[tool.cibuildwheel\]"

Repository: scverse/rapids-singlecell

Length of output: 51

🏁 Script executed:

rg "abi3\|CIBW_REPAIR" pyproject.toml

Repository: scverse/rapids-singlecell

Length of output: 51

🏁 Script executed:

python3 << 'EOF' import tomllib with open('pyproject.toml', 'rb') as f: data = tomllib.load(f) if 'tool' in data and 'cibuildwheel' in data['tool']: import json print(json.dumps(data['tool']['cibuildwheel'], indent=2)) else: print("No [tool.cibuildwheel] section found") EOF

Repository: scverse/rapids-singlecell

Length of output: 718

🏁 Script executed:

grep -n "cibuildwheel\|abi3" pyproject.toml

Repository: scverse/rapids-singlecell

Length of output: 388

🏁 Script executed:

cat pyproject.toml

Repository: scverse/rapids-singlecell

Length of output: 5332

🏁 Script executed:

cat pyproject.toml | tail -200

Repository: scverse/rapids-singlecell

Length of output: 5332

🏁 Script executed:

ast-grep --pattern $'[tool.cibuildwheel]'

Repository: scverse/rapids-singlecell

Length of output: 51

Resolve the binary package roots from the modules, not purelib/lib64, and restore abi3audit validation.

Lines 129-130 derive symlink paths from sysconfig.get_path('purelib') with hard-coded lib64 subdirectories. This is brittle: binary packages should use platlib (platform-specific), not purelib, and lib64 is not guaranteed—some systems use lib. When layout assumptions fail, symlinks silently break and /tmp/.librmm_dir points to the wrong location.

Additionally, line 136 overrides CIBW_REPAIR_WHEEL_COMMAND with only the auditwheel step, removing the abi3audit validation defined in pyproject.toml. This is a regression that bypasses quality checks for Limited API wheels.

Fix by: (1) importing the modules first, then dynamically locating their .so files from their actual package paths via librmm.__path__[0] and rapids_logger.__path__[0] instead of assuming a fixed layout; (2) preserving the abi3audit step in the repair command or removing the environment variable override to use the pyproject.toml configuration.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/publish.yml around lines 125 - 133, Current workflow derives .so paths from SITE=$(python -c "import sysconfig;print(sysconfig.get_path('purelib'))") and hard-codes lib64, which is brittle; instead import the packages and read their actual package dirs (use librmm.__path__[0] and rapids_logger.__path__[0]) to locate .so files dynamically and create symlinks from those resolved locations (update the steps that write /tmp/.librmm_dir accordingly); also stop overriding CIBW_REPAIR_WHEEL_COMMAND with only auditwheel or restore the abi3audit step so the abi3audit validation defined in pyproject.toml is preserved (keep both abi3audit and auditwheel in the repair command or remove the env override so pyproject.toml controls it).

coderabbitai · 2026-04-14T09:05:00Z

+        fill_row_indices_kernel<<<sb_cols, 256, 0, stream>>>(buf.vals_in,
+                                                             n_rows, sb_cols);
+
+        // Sort: keys = block columns [col, col+sb_cols), already F-order
+        const float* keys_in = block + (long long)col * n_rows;
+        size_t temp = cub_temp_bytes;
+        cub::DeviceSegmentedRadixSort::SortPairs(
+            buf.cub_temp, temp, keys_in, buf.keys_out, buf.vals_in,
+            buf.vals_out, sb_items, sb_cols, buf.seg_offsets,
+            buf.seg_offsets + 1, BEGIN_BIT, END_BIT, stream);
+
+        // Fused rank sums into sub-batch buffer
+        rank_sums_from_sorted_kernel<<<sb_cols, tpb_rank, smem_rank, stream>>>(
+            buf.keys_out, buf.vals_out, group_codes, buf.sub_rank_sums,
+            buf.sub_tie_corr, nullptr, nullptr, nullptr, n_rows, sb_cols,
+            n_groups, compute_tie_corr, false);


⚠️ Potential issue | 🟠 Major

Missing cudaGetLastError() after kernel launches.

Same issue as in wilcoxon_ovo.cu: kernel launches at lines 137, 143-146 (CUB), and 149 lack error checking. Silent failures could corrupt results or cause hard-to-debug issues.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 137 - 152, The kernel/CUB calls in this block lack post-launch error checks; after the launch of fill_row_indices_kernel, after the cub::DeviceSegmentedRadixSort::SortPairs call, and after the launch of rank_sums_from_sorted_kernel add CUDA error checks by calling cudaGetLastError() (and optionally cudaDeviceSynchronize() or checking the returned cudaError_t) and log/return on error using the existing logging/return pattern (refer to fill_row_indices_kernel, cub::DeviceSegmentedRadixSort::SortPairs, rank_sums_from_sorted_kernel and the buf/stream variables to locate the calls); ensure each check captures and reports the cuda error string so failures are not silent.

coderabbitai · 2026-04-14T09:05:00Z

+        if cpsp.isspmatrix_csr(X_gpu):
+            _ovr.ovr_streaming_csr(
+                X_gpu.data.astype(cp.float32, copy=False),
+                X_gpu.indices.astype(cp.int32, copy=False),
+                X_gpu.indptr.astype(cp.int32, copy=False),
+                group_codes_gpu,
+                rank_sums,
+                tie_corr,
+                n_rows=n_cells,
+                n_cols=n_total_genes,
+                n_groups=n_groups,
+                compute_tie_corr=tie_correct,
+                sub_batch_cols=STREAMING_SUB_BATCH,
+            )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if CuPy sparse matrix components are guaranteed to be contiguous rg -n "\.data\b|\.indices\b|\.indptr\b" --type py -C 2 src/rapids_singlecell/ | head -50

Repository: scverse/rapids-singlecell

Length of output: 3462

🏁 Script executed:

# Read the full context of _wilcoxon.py around lines 329-342 cat -n src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py | sed -n '320,350p'

Repository: scverse/rapids-singlecell

Length of output: 1511

🏁 Script executed:

# Search for kernel definition and any contiguity-related patterns rg -n "ovr_streaming_csr|gpu_array_c|ascontiguousarray" --type py src/rapids_singlecell/ | head -30

Repository: scverse/rapids-singlecell

Length of output: 1262

🏁 Script executed:

# Check CuPy sparse matrix properties and contiguity guarantees # Look for any documentation or comments about sparse array handling rg -n "sparse|CSR|contiguous" --type py src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py | head -40

Repository: scverse/rapids-singlecell

Length of output: 951

🏁 Script executed:

# Search for the _ovr module/kernel implementation find src -name "*ovr*" -o -name "*wilcoxon*" | head -20

Repository: scverse/rapids-singlecell

Length of output: 694

🏁 Script executed:

# Look for nanobind/CUDA kernel definitions related to ovr_streaming rg -n "ovr_streaming|def ovr" --type py --type cpp --type cu src/rapids_singlecell/ | head -30

Repository: scverse/rapids-singlecell

Length of output: 95

🏁 Script executed:

# Check if there's documentation about kernel array requirements rg -n "contiguous|C-order|F-order" --type cpp --type cu src/rapids_singlecell/ | head -30

Repository: scverse/rapids-singlecell

Length of output: 95

🏁 Script executed:

# Read the wilcoxon_ovr.cu file to see the kernel signature and requirements cat src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu | head -150

Repository: scverse/rapids-singlecell

Length of output: 5817

🏁 Script executed:

# Check the header files for any contiguity or memory layout documentation cat src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_common.cuh | head -100

Repository: scverse/rapids-singlecell

Length of output: 3152

🏁 Script executed:

# Search for nanobind binding definitions for ovr_streaming_csr rg -n "PYBIND11_MODULE|def_.*ovr_streaming|ovr_streaming_csr" --type py --type cpp src/rapids_singlecell/ | head -40

Repository: scverse/rapids-singlecell

Length of output: 168

🏁 Script executed:

# Look for any .pyi stub files or type hints that show kernel signatures find src -name "*.pyi" -o -name "*binding*" -o -name "*pybind*" | head -20

Repository: scverse/rapids-singlecell

Length of output: 51

🏁 Script executed:

# Check if CuPy sparse matrix .data, .indices, .indptr are guaranteed contiguous # by searching for documentation or tests rg -n "\.data.*contiguous|\.indices.*contiguous|\.indptr.*contiguous" --type py src/

Repository: scverse/rapids-singlecell

Length of output: 166

🏁 Script executed:

# Check CuPy documentation and source patterns - look for how sparse matrices are created rg -n "csr_matrix|csc_matrix|issparse" --type py src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py -B 3 -A 3 | head -60

Repository: scverse/rapids-singlecell

Length of output: 1620

🏁 Script executed:

# Look at the _move_to_gpu function which creates/transfers sparse matrices cat -n src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py | sed -n '25,50p'

Repository: scverse/rapids-singlecell

Length of output: 1238

🏁 Script executed:

# Check actual usage patterns in similar code - do other parts of codebase use astype on sparse components? rg -n "\.data\.astype|\.indices\.astype|\.indptr\.astype" --type py src/rapids_singlecell/ -B 2 -A 2

Repository: scverse/rapids-singlecell

Length of output: 6551

🏁 Script executed:

# Verify what dtype X_gpu.data actually is to confirm if astype conversion happens rg -n "X_gpu.*dtype|\.data.*dtype|\.indices.*dtype" --type py src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py | head -20

Repository: scverse/rapids-singlecell

Length of output: 51

🏁 Script executed:

# Check if there are any reported issues or TODOs about contiguity rg -n "TODO|FIXME|BUG|contiguous|contiguity" --type py src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py

Repository: scverse/rapids-singlecell

Length of output: 149

Ensure C-contiguity of sparse matrix components passed to CUDA kernels.

While CuPy sparse array components (.data, .indices, .indptr) are typically C-contiguous by default, explicitly wrapping them with cp.ascontiguousarray() provides defensive safety and consistency. Line 167 in this same file uses this pattern:

cp.ascontiguousarray(data.astype(cp.float32).ravel(order="F"))

Applying the same safeguard to lines 331–333 (and the CSC variant at 345–347) ensures arrays meet the kernel's contiguity expectations:

Suggested fix

_ovr.ovr_streaming_csr( cp.ascontiguousarray(X_gpu.data.astype(cp.float32, copy=False)), cp.ascontiguousarray(X_gpu.indices.astype(cp.int32, copy=False)), cp.ascontiguousarray(X_gpu.indptr.astype(cp.int32, copy=False)), ... )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py` around lines 329 - 342, The sparse CSR/CSC components passed into the CUDA kernels (_ovr.ovr_streaming_csr and its CSC counterpart) should be forced to C-contiguous arrays to match kernel expectations; wrap X_gpu.data, X_gpu.indices, and X_gpu.indptr (and the CSC equivalents) with cp.ascontiguousarray after casting (e.g., X_gpu.data.astype(..., copy=False)) before passing them into the kernel calls so the arrays are guaranteed contiguous and safe for GPU kernels.

Intron7 · 2026-04-14T10:15:49Z

@coderabbitai review

coderabbitai · 2026-04-14T10:15:54Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (3)

src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu (1)
160-180: ⚠️ Potential issue | 🟠 Major

Missing cudaGetLastError() after kernel launches.

Per coding guidelines, every kernel launch must be followed by cudaGetLastError() to detect launch failures immediately. The current code only checks errors at stream synchronization (lines 201-207), which delays error detection and makes debugging harder.

This pattern repeats throughout all *_impl functions: kernel launches at lines 80, 83-84, 160-161, 177-180, 289-290, 305-308, 408-409, 413-414, 429-432, 581-583, 587-588, 603-606, 725-726, 741-744.
Example fix pattern for one kernel launch
         rank_sums_from_sorted_kernel<<<sb_cols, tpb_rank, smem_rank, stream>>>(
             buf.keys_out, buf.vals_out, group_codes, buf.sub_rank_sums,
             buf.sub_tie_corr, nullptr, nullptr, nullptr, n_rows, sb_cols,
             n_groups, compute_tie_corr, false, use_gmem);
+        cudaError_t err = cudaGetLastError();
+        if (err != cudaSuccess)
+            throw std::runtime_error(
+                std::string("rank_sums_from_sorted_kernel launch failed: ") +
+                cudaGetErrorString(err));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 160 - 180,
After each device kernel launch and any CUB launch in wilcoxon_ovr.cu (e.g.,
after fill_row_indices_kernel<<<...>>>, after the
cub::DeviceSegmentedRadixSort::SortPairs(...) call, and after
rank_sums_from_sorted_kernel<<<...>>>), insert a call to cudaGetLastError() and
check its return (and log or propagate the error) immediately to detect launch
failures early; repeat this pattern for all similar kernel/CUB launches in the
file (the reviewer listed examples at the start of each *_impl: kernels around
symbols fill_row_indices_kernel, rank_sums_from_sorted_kernel and other kernels
noted in the comment) so each launch is followed by an immediate
cudaGetLastError() check rather than relying solely on stream synchronization.
.github/workflows/publish.yml (2)
136-136: ⚠️ Potential issue | 🟠 Major

Still bypassing abi3audit on Linux.

Line 136 overrides the Linux repair sequence from pyproject.toml with an auditwheel-only command, so the Limited API check never runs for the published wheels. Either append the same abi3audit step here or move the new exclude list into pyproject.toml and let that table stay the source of truth.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/publish.yml at line 136, The CI overrides the Linux repair
step by setting CIBW_REPAIR_WHEEL_COMMAND to an auditwheel-only command, which
bypasses the abi3 audit in pyproject.toml; restore the Limited API check by
either appending the abi3audit step to CIBW_REPAIR_WHEEL_COMMAND (run abi3audit
after auditwheel repair) or move the custom exclude list into the existing
repair/table in pyproject.toml so the original abi3audit invocation is
preserved; update the CIBW_REPAIR_WHEEL_COMMAND or pyproject.toml accordingly
and ensure the repair sequence still invokes abi3audit for Linux builds.
124-133: ⚠️ Potential issue | 🟠 Major

Still assuming lib64 in the RAPIDS wheels.

Line 129 and Line 130 still hard-code lib64, but CMakeLists.txt:18-75 already treats the installed layout as lib*. If either wheel lands under lib instead, these symlinks break before the extensions link. Resolve the actual .so from lib*/ rather than composing a lib64 path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/publish.yml around lines 124 - 133, The symlink creation
assumes a lib64 layout and should instead resolve the actual .so under any lib*
directory; update the ln -sf steps that reference RMM_ROOT and LOG_ROOT so they
search for the shared objects (e.g., under "$RMM_ROOT"/lib*/librmm.so and
"$LOG_ROOT"/lib*/librapids_logger.so or by using a find/glob to locate the first
matching .so) and create symlinks to those resolved paths rather than
hard-coding lib64, keeping the same RMM_ROOT and LOG_ROOT variables and the two
ln -sf operations (for librmm and librapids_logger).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Around line 201-208: The pyproject.toml currently declares librmm under
[tool.uv.extra-build-dependencies], which is ignored by PEP 517 build isolation;
move the librmm requirement into the PEP 517 build requirements by adding
"librmm-cu12>=25.10" to [build-system].requires in pyproject.toml so source
builds (pip install --no-binary) can find headers required by CMakeLists.txt for
the Wilcoxon CUDA extensions, or alternatively update the project installation
docs to explicitly state that librmm (librmm-cu12>=25.10) must be preinstalled
and show using RSC_BUILD_EXTENSIONS=OFF to skip CUDA builds when librmm is not
available.

In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh`:
- Around line 154-156: The warp buffer offset is wrong when use_gmem is false:
in the compute_tie_corr branch the code sets double* warp_buf = use_gmem ? smem
: smem + n_groups but the shared-memory layout places grp_sums, s_sum, s_sq,
s_nnz (4 arrays of size n_groups) before the warp buffer; update the offset
calculation in the compute_tie_corr block to use smem + 4 * n_groups (keeping
the use_gmem ? smem logic) so warp_buf points after those four arrays (refer to
variables compute_tie_corr, warp_buf, use_gmem, smem, n_groups).

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu`:
- Around line 540-546: The cudaHostRegister calls for h_data, h_indices,
h_rank_sums, and h_tie_corr are missing error checks; wrap each
cudaHostRegister(...) invocation with a check of the returned cudaError_t and
handle failures (e.g., log the error using cudaGetErrorString(err) and
return/throw an error code) so failures aren’t silent; update the block where
cudaHostRegister is called (references: cudaHostRegister, h_data, h_indices,
h_rank_sums, h_tie_corr) to verify the return value, clean up any previously
pinned memory on failure, and propagate an error up from the function.

---

Duplicate comments:
In @.github/workflows/publish.yml:
- Line 136: The CI overrides the Linux repair step by setting
CIBW_REPAIR_WHEEL_COMMAND to an auditwheel-only command, which bypasses the abi3
audit in pyproject.toml; restore the Limited API check by either appending the
abi3audit step to CIBW_REPAIR_WHEEL_COMMAND (run abi3audit after auditwheel
repair) or move the custom exclude list into the existing repair/table in
pyproject.toml so the original abi3audit invocation is preserved; update the
CIBW_REPAIR_WHEEL_COMMAND or pyproject.toml accordingly and ensure the repair
sequence still invokes abi3audit for Linux builds.
- Around line 124-133: The symlink creation assumes a lib64 layout and should
instead resolve the actual .so under any lib* directory; update the ln -sf steps
that reference RMM_ROOT and LOG_ROOT so they search for the shared objects
(e.g., under "$RMM_ROOT"/lib*/librmm.so and "$LOG_ROOT"/lib*/librapids_logger.so
or by using a find/glob to locate the first matching .so) and create symlinks to
those resolved paths rather than hard-coding lib64, keeping the same RMM_ROOT
and LOG_ROOT variables and the two ln -sf operations (for librmm and
librapids_logger).

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu`:
- Around line 160-180: After each device kernel launch and any CUB launch in
wilcoxon_ovr.cu (e.g., after fill_row_indices_kernel<<<...>>>, after the
cub::DeviceSegmentedRadixSort::SortPairs(...) call, and after
rank_sums_from_sorted_kernel<<<...>>>), insert a call to cudaGetLastError() and
check its return (and log or propagate the error) immediately to detect launch
failures early; repeat this pattern for all similar kernel/CUB launches in the
file (the reviewer listed examples at the start of each *_impl: kernels around
symbols fill_row_indices_kernel, rank_sums_from_sorted_kernel and other kernels
noted in the comment) so each launch is followed by an immediate
cudaGetLastError() check rather than relying solely on stream synchronization.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 548a08c7-d10f-413f-b6b1-1d702b116125

📥 Commits

Reviewing files that changed from the base of the PR and between 1b926c0 and 68ad73c.

📒 Files selected for processing (8)

.github/workflows/publish.yml
CMakeLists.txt
pyproject.toml
src/rapids_singlecell/_cuda/nb_types.h
src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu
src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py

✅ Files skipped from review due to trivial changes (2)

src/rapids_singlecell/_cuda/nb_types.h
src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py

🚧 Files skipped from review as they are similar to previous changes (2)

CMakeLists.txt
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu

coderabbitai · 2026-04-14T10:26:16Z

+[tool.uv]
+# librmm headers are needed at build time for the wilcoxon CUDA kernels.
+# The headers are identical across cu12/cu13 — only the .so differs (loaded
+# at runtime via librmm.load_library()).  cu12 is used here as the build-time
+# provider; cu13 envs get the same headers.
+
+[tool.uv.extra-build-dependencies]
+rapids-singlecell = [ "librmm-cu12>=25.10" ]


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the pyproject.toml file at the specified lines wc -l pyproject.toml

Repository: scverse/rapids-singlecell

Length of output: 87

🏁 Script executed:

# Read the relevant section of pyproject.toml head -n 220 pyproject.toml | tail -n 30

Repository: scverse/rapids-singlecell

Length of output: 816

🏁 Script executed:

# Check CMakeLists.txt for the rmm requirements at the specified lines if [ -f CMakeLists.txt ]; then head -n 25 CMakeLists.txt | tail -n 15 fi

Repository: scverse/rapids-singlecell

Length of output: 650

🏁 Script executed:

# Check if CMakeLists.txt mentions librmm or rmm if [ -f CMakeLists.txt ]; then rg -n "rmm|librmm" CMakeLists.txt | head -20 fi

Repository: scverse/rapids-singlecell

Length of output: 1181

🏁 Script executed:

# Check the build-system configuration in pyproject.toml rg -A 5 "\[build-system\]" pyproject.toml

Repository: scverse/rapids-singlecell

Length of output: 177

🏁 Script executed:

# Search for any documentation or comments about rmm requirements rg -i "rmm|wilcoxon" --type=py --type=toml --type=md | head -30

Repository: scverse/rapids-singlecell

Length of output: 3193

🏁 Script executed:

# Check for package configuration and publishing info rg -i "packages|distribution|sdist" pyproject.toml | head -20

Repository: scverse/rapids-singlecell

Length of output: 225

🏁 Script executed:

# Look for documentation about build requirements fd -e md -e rst -e txt | xargs rg -l "librmm|rmm|build.*require" 2>/dev/null | head -10

Repository: scverse/rapids-singlecell

Length of output: 129

🏁 Script executed:

# Check README and installation docs if [ -f README.md ]; then head -100 README.md fi

Repository: scverse/rapids-singlecell

Length of output: 2931

🏁 Script executed:

# Search for RSC_BUILD_EXTENSIONS usage and documentation rg -n "RSC_BUILD_EXTENSIONS|BUILD_EXTENSIONS" --type=py --type=toml --type=md

Repository: scverse/rapids-singlecell

Length of output: 160

🏁 Script executed:

# Check pyproject.toml for wheel/sdist configuration cat pyproject.toml | grep -A 10 -B 2 "wheel\|sdist\|package"

Repository: scverse/rapids-singlecell

Length of output: 2382

Move librmm to a build-system-agnostic configuration or document the requirement explicitly.

The [tool.uv.extra-build-dependencies] entry is ignored by pip and generic PEP 517 build isolation. While the project publishes an sdist and CMakeLists.txt (lines 177–180) requires rmm for the Wilcoxon CUDA extensions, source builds via pip install --no-binary will fail to locate librmm headers during compilation unless the package is manually provided. Either add librmm-cu12>=25.10 to [build-system].requires in pyproject.toml, or clearly document this requirement in installation documentation along with guidance on using the RSC_BUILD_EXTENSIONS=OFF flag to skip CUDA extension builds when librmm is unavailable.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 201 - 208, The pyproject.toml currently declares librmm under [tool.uv.extra-build-dependencies], which is ignored by PEP 517 build isolation; move the librmm requirement into the PEP 517 build requirements by adding "librmm-cu12>=25.10" to [build-system].requires in pyproject.toml so source builds (pip install --no-binary) can find headers required by CMakeLists.txt for the Wilcoxon CUDA extensions, or alternatively update the project installation docs to explicitly state that librmm (librmm-cu12>=25.10) must be preinstalled and show using RSC_BUILD_EXTENSIONS=OFF to skip CUDA builds when librmm is not available.

coderabbitai · 2026-04-14T10:26:17Z

+    if (compute_tie_corr) {
+        // Warp buf always in shared memory (32 doubles — always fits)
+        double* warp_buf = use_gmem ? smem : smem + n_groups;


⚠️ Potential issue | 🔴 Critical

Incorrect warp buffer offset in shared memory path.

When use_gmem=false, the shared memory layout allocates 4 arrays of size n_groups (grp_sums, s_sum, s_sq, s_nnz) before the warp buffer. The offset calculation should be smem + 4 * n_groups, not smem + n_groups.

🐛 Proposed fix

if (compute_tie_corr) { // Warp buf always in shared memory (32 doubles — always fits) - double* warp_buf = use_gmem ? smem : smem + n_groups; + double* warp_buf = use_gmem ? smem : smem + 4 * n_groups; `#pragma` unroll

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (compute_tie_corr) {

// Warp buf always in shared memory (32 doubles — always fits)

double* warp_buf = use_gmem ? smem : smem + n_groups;

if (compute_tie_corr) {

// Warp buf always in shared memory (32 doubles — always fits)

double* warp_buf = use_gmem ? smem : smem + 4 * n_groups;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh` around lines 154 - 156, The warp buffer offset is wrong when use_gmem is false: in the compute_tie_corr branch the code sets double* warp_buf = use_gmem ? smem : smem + n_groups but the shared-memory layout places grp_sums, s_sum, s_sq, s_nnz (4 arrays of size n_groups) before the warp buffer; update the offset calculation in the compute_tie_corr block to use smem + 4 * n_groups (keeping the use_gmem ? smem logic) so warp_buf points after those four arrays (refer to variables compute_tie_corr, warp_buf, use_gmem, smem, n_groups).

coderabbitai · 2026-04-14T10:26:17Z

+    cudaHostRegister(const_cast<float*>(h_data),
+                     (size_t)h_indptr[n_cols] * sizeof(float), 0);
+    cudaHostRegister(const_cast<int*>(h_indices),
+                     (size_t)h_indptr[n_cols] * sizeof(int), 0);
+    cudaHostRegister(h_rank_sums, (size_t)n_groups * n_cols * sizeof(double),
+                     0);
+    cudaHostRegister(h_tie_corr, n_cols * sizeof(double), 0);


⚠️ Potential issue | 🟡 Minor

Missing error checks for cudaHostRegister.

cudaHostRegister can fail (e.g., if memory is already pinned, or the system doesn't support pinning). Without error checking, failures are silent and async transfers may not work as expected.

Suggested fix

// Pin host memory for async transfers - cudaHostRegister(const_cast<float*>(h_data), - (size_t)h_indptr[n_cols] * sizeof(float), 0); - cudaHostRegister(const_cast<int*>(h_indices), - (size_t)h_indptr[n_cols] * sizeof(int), 0); - cudaHostRegister(h_rank_sums, (size_t)n_groups * n_cols * sizeof(double), - 0); - cudaHostRegister(h_tie_corr, n_cols * sizeof(double), 0); + cudaError_t pin_err; + pin_err = cudaHostRegister(const_cast<float*>(h_data), + (size_t)h_indptr[n_cols] * sizeof(float), 0); + // Note: cudaErrorHostMemoryAlreadyRegistered is acceptable + if (pin_err != cudaSuccess && pin_err != cudaErrorHostMemoryAlreadyRegistered) + throw std::runtime_error("Failed to pin h_data: " + + std::string(cudaGetErrorString(pin_err))); + // ... similar for other registers

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 540 - 546, The cudaHostRegister calls for h_data, h_indices, h_rank_sums, and h_tie_corr are missing error checks; wrap each cudaHostRegister(...) invocation with a check of the returned cudaError_t and handle failures (e.g., log the error using cudaGetErrorString(err) and return/throw an error code) so failures aren’t silent; update the block where cudaHostRegister is called (references: cudaHostRegister, h_data, h_indices, h_rank_sums, h_tie_corr) to verify the return value, clean up any previously pinned memory on failure, and propagate an error up from the function.

Intron7 · 2026-04-15T15:51:57Z

@coderabbitai review

coderabbitai · 2026-04-15T15:52:03Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

🧹 Nitpick comments (2)

src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu (2)

686-692: CSR histogram kernel lacks cudaGetLastError() check.

The csr_col_histogram_kernel launch at line 690-691 doesn't check for launch errors. If the kernel fails, the subsequent cudaMemcpy to h_col_counts would silently read uninitialized/incorrect data.

Suggested fix

         csr_col_histogram_kernel<<<blocks, tpb>>>(csr_indices, csr_indptr,
                                                   d_col_counts, n_rows, n_cols);
+        cudaError_t err = cudaGetLastError();
+        if (err != cudaSuccess)
+            throw std::runtime_error(
+                std::string("csr_col_histogram_kernel launch failed: ") +
+                cudaGetErrorString(err));
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 686 - 692,
After launching csr_col_histogram_kernel(...) add a cudaGetLastError() (and
optionally cudaDeviceSynchronize()) check immediately after the kernel launch
and before the subsequent cudaMemcpy to h_col_counts; if an error is returned,
log/return the CUDA error (or propagate it) rather than proceeding to
cudaMemcpy. Ensure the check references csr_col_histogram_kernel launch and the
device buffer d_col_counts (and host h_col_counts) so failures are detected and
handled before copying results back to host.

182-203: Consider adding cudaGetLastError() after kernel launches.

While stream synchronization at line 225-230 will surface async errors, adding cudaGetLastError() immediately after kernel launches (lines 183-184, 200-203) would catch launch configuration errors (e.g., invalid shared memory size) earlier, making debugging easier.

This is a minor improvement since errors are eventually caught at sync.

Suggested improvement

         fill_row_indices_kernel<<<sb_cols, 256, 0, stream>>>(buf.vals_in,
                                                              n_rows, sb_cols);
+        if (cudaGetLastError() != cudaSuccess)
+            throw std::runtime_error("fill_row_indices_kernel launch failed");
 
         // Sort: keys = block columns [col, col+sb_cols), already F-order
         const float* keys_in = block + (long long)col * n_rows;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 182 - 203,
After the launches of fill_row_indices_kernel and rank_sums_from_sorted_kernel
(and optionally after the cub::DeviceSegmentedRadixSort::SortPairs call),
immediately call cudaGetLastError() and handle/report any non-success result so
launch-time failures (e.g., invalid <<<...>>> or shared memory size) are caught
earlier; locate the kernel launches by the symbols fill_row_indices_kernel and
rank_sums_from_sorted_kernel in wilcoxon_ovr.cu and add a cudaGetLastError()
check right after each launch (or use your project's CUDA error-check macro) to
log/propagate the launch error before later stream synchronization.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu`:
- Around line 686-692: After launching csr_col_histogram_kernel(...) add a
cudaGetLastError() (and optionally cudaDeviceSynchronize()) check immediately
after the kernel launch and before the subsequent cudaMemcpy to h_col_counts; if
an error is returned, log/return the CUDA error (or propagate it) rather than
proceeding to cudaMemcpy. Ensure the check references csr_col_histogram_kernel
launch and the device buffer d_col_counts (and host h_col_counts) so failures
are detected and handled before copying results back to host.
- Around line 182-203: After the launches of fill_row_indices_kernel and
rank_sums_from_sorted_kernel (and optionally after the
cub::DeviceSegmentedRadixSort::SortPairs call), immediately call
cudaGetLastError() and handle/report any non-success result so launch-time
failures (e.g., invalid <<<...>>> or shared memory size) are caught earlier;
locate the kernel launches by the symbols fill_row_indices_kernel and
rank_sums_from_sorted_kernel in wilcoxon_ovr.cu and add a cudaGetLastError()
check right after each launch (or use your project's CUDA error-check macro) to
log/propagate the launch error before later stream synchronization.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a64a3c4e-ea59-4657-95d3-8688ccbed35a

📥 Commits

Reviewing files that changed from the base of the PR and between 68ad73c and 7559778.

📒 Files selected for processing (12)

.github/workflows/publish.yml
pyproject.toml
src/rapids_singlecell/_cuda/__init__.py
src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_common.cuh
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu
src/rapids_singlecell/tools/_rank_genes_groups/__init__.py
src/rapids_singlecell/tools/_rank_genes_groups/_core.py
src/rapids_singlecell/tools/_rank_genes_groups/_utils.py
src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py
tests/test_rank_genes_groups_wilcoxon.py
tests/test_rank_genes_groups_wilcoxon_binned.py

💤 Files with no reviewable changes (1)

src/rapids_singlecell/tools/_rank_genes_groups/_core.py

✅ Files skipped from review due to trivial changes (1)

src/rapids_singlecell/tools/_rank_genes_groups/init.py

🚧 Files skipped from review as they are similar to previous changes (4)

src/rapids_singlecell/_cuda/init.py
.github/workflows/publish.yml
pyproject.toml
src/rapids_singlecell/tools/_rank_genes_groups/_utils.py

Intron7 added the run-gpu-ci label Feb 25, 2026

github-actions bot removed the run-gpu-ci label Feb 25, 2026

Intron7 added the run-gpu-ci label Feb 25, 2026

github-actions bot removed the run-gpu-ci label Feb 25, 2026

Intron7 marked this pull request as draft February 26, 2026 10:31

Intron7 closed this Apr 9, 2026

Intron7 force-pushed the c++-wilcoxon branch from 9f71931 to f949cce Compare April 9, 2026 13:37

rebase to main and start with illico optimization

8e858b0

Intron7 reopened this Apr 9, 2026

Intron7 added 2 commits April 10, 2026 15:50

v1 illico

e1ef547

try add rmm

9fa9c98

Intron7 added the run-gpu-ci label Apr 13, 2026

github-actions bot removed the run-gpu-ci label Apr 13, 2026

Intron7 added 3 commits April 13, 2026 19:33

fix publish

0335c0b

try fix build wheel

3fb11d9

fix ld path

abddd30

Intron7 added run-gpu-ci and removed run-gpu-ci labels Apr 13, 2026

github-actions bot removed the run-gpu-ci label Apr 13, 2026

Intron7 added 2 commits April 13, 2026 20:02

update files

8a59ee1

add debugging

a7f8852

Intron7 added the run-gpu-ci label Apr 13, 2026

github-actions bot removed the run-gpu-ci label Apr 13, 2026

Merge branch 'main' into c++-wilcoxon

71246ae