Add segmented_reduce python api by oleksandr-pavlyk · Pull Request #3906 · NVIDIA/cccl

oleksandr-pavlyk · 2025-02-21T22:30:03Z

Description

This PR adds Python API for segmented_reduce algorithm.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

Also avoid recomputing cccl_value of init in both segmented_reduce and in reduce

1. Include np.complex64 2. Device output size in a variable and reuse it to avoid repeated occurrances of literal values 3. Generate real/imag values for complex arrays in a single call to sampling function for efficiency 4. Change range of generated integral arrays based on the signness of the integral data type. For unsigned types we continue to sample in interval [0, 10), for signed we sample from [-5, 5].

copy-pr-bot · 2025-02-21T22:30:06Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…e base Additionally, changed the __hash__ of IteratorKind to mix the hash of its value with hash of self.__class__.

This is used to advance a given iterator `it` the `offset` steps without running into multiple definitions of the advance/derefence methods.

…d_iterator

This calls IteratorBase.__add__ to produce an iterator whose state is advanced by 1, but which shares the same advance/dereference methods.

github-actions · 2025-02-24T23:41:30Z

🟩 CI finished in 40m 15s: Pass: 100%/1 | Total: 40m 15s | Avg: 40m 15s | Max: 40m 15s

🟩 python: Pass: 100%/1 | Total: 40m 15s | Avg: 40m 15s | Max: 40m 15s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 40m 15s | Avg: 40m 15s | Max: 40m 15s
🟩 ctk
  🟩 12.8               Pass: 100%/1   | Total: 40m 15s | Avg: 40m 15s | Max: 40m 15s
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/1   | Total: 40m 15s | Avg: 40m 15s | Max: 40m 15s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 40m 15s | Avg: 40m 15s | Max: 40m 15s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 40m 15s | Avg: 40m 15s | Max: 40m 15s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 40m 15s | Avg: 40m 15s | Max: 40m 15s
🟩 gpu
  🟩 rtx2080            Pass: 100%/1   | Total: 40m 15s | Avg: 40m 15s | Max: 40m 15s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 40m 15s | Avg: 40m 15s | Max: 40m 15s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

github-actions · 2025-02-25T18:50:38Z

🟩 CI finished in 40m 27s: Pass: 100%/1 | Total: 40m 27s | Avg: 40m 27s | Max: 40m 27s

🟩 python: Pass: 100%/1 | Total: 40m 27s | Avg: 40m 27s | Max: 40m 27s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 40m 27s | Avg: 40m 27s | Max: 40m 27s
🟩 ctk
  🟩 12.8               Pass: 100%/1   | Total: 40m 27s | Avg: 40m 27s | Max: 40m 27s
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/1   | Total: 40m 27s | Avg: 40m 27s | Max: 40m 27s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 40m 27s | Avg: 40m 27s | Max: 40m 27s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 40m 27s | Avg: 40m 27s | Max: 40m 27s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 40m 27s | Avg: 40m 27s | Max: 40m 27s
🟩 gpu
  🟩 rtx2080            Pass: 100%/1   | Total: 40m 27s | Avg: 40m 27s | Max: 40m 27s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 40m 27s | Avg: 40m 27s | Max: 40m 27s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

python/cuda_parallel/cuda/parallel/experimental/algorithms/segmented_reduce.py

python/cuda_parallel/cuda/parallel/experimental/iterators/_iterators.py

python/cuda_parallel/tests/conftest.py

Also make generation of complex array in test_reduce.py more efficient by genering real and imaginary components in a single call to np.random.random instead of using two calls.

These were only defined for TransformIterator and AdvancedIterator classes, but not for other classes. Implemented review suggestion to type type(self) instead of self.__class__

…cumulation For short range data types we take a small slice of the input array to avoid running into the overflow problem. This works because input_array fixture samples from uniform discrete distribution with small upper range (8), hence using 31 uint8 elements can run up to 31 * 7 = 217 ( < 255) and fits in the type.

…d_reduce.py

github-actions · 2025-02-26T16:17:31Z

🟩 CI finished in 40m 55s: Pass: 100%/1 | Total: 40m 55s | Avg: 40m 55s | Max: 40m 55s

🟩 python: Pass: 100%/1 | Total: 40m 55s | Avg: 40m 55s | Max: 40m 55s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 40m 55s | Avg: 40m 55s | Max: 40m 55s
🟩 ctk
  🟩 12.8               Pass: 100%/1   | Total: 40m 55s | Avg: 40m 55s | Max: 40m 55s
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/1   | Total: 40m 55s | Avg: 40m 55s | Max: 40m 55s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 40m 55s | Avg: 40m 55s | Max: 40m 55s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 40m 55s | Avg: 40m 55s | Max: 40m 55s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 40m 55s | Avg: 40m 55s | Max: 40m 55s
🟩 gpu
  🟩 rtx2080            Pass: 100%/1   | Total: 40m 55s | Avg: 40m 55s | Max: 40m 55s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 40m 55s | Avg: 40m 55s | Max: 40m 55s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

This finds compute capability and include paths and appends them to the algorithm-specific arguments. Used the utility in segmented_reduce.

rwgk

I really like your last commit 13d8d19!

python/cuda_parallel/cuda/parallel/experimental/_bindings.py

github-actions · 2025-02-26T17:59:39Z

🟩 CI finished in 52m 20s: Pass: 100%/1 | Total: 52m 20s | Avg: 52m 20s | Max: 52m 20s

🟩 python: Pass: 100%/1 | Total: 52m 20s | Avg: 52m 20s | Max: 52m 20s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 52m 20s | Avg: 52m 20s | Max: 52m 20s
🟩 ctk
  🟩 12.8               Pass: 100%/1   | Total: 52m 20s | Avg: 52m 20s | Max: 52m 20s
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/1   | Total: 52m 20s | Avg: 52m 20s | Max: 52m 20s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 52m 20s | Avg: 52m 20s | Max: 52m 20s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 52m 20s | Avg: 52m 20s | Max: 52m 20s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 52m 20s | Avg: 52m 20s | Max: 52m 20s
🟩 gpu
  🟩 rtx2080            Pass: 100%/1   | Total: 52m 20s | Avg: 52m 20s | Max: 52m 20s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 52m 20s | Avg: 52m 20s | Max: 52m 20s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

github-actions · 2025-02-26T22:47:05Z

🟩 CI finished in 51m 12s: Pass: 100%/1 | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s

🟩 python: Pass: 100%/1 | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
🟩 ctk
  🟩 12.8               Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
🟩 gpu
  🟩 rtx2080            Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

github-actions · 2025-02-28T03:46:52Z

🟩 CI finished in 52m 35s: Pass: 100%/1 | Total: 52m 35s | Avg: 52m 35s | Max: 52m 35s

🟩 python: Pass: 100%/1 | Total: 52m 35s | Avg: 52m 35s | Max: 52m 35s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 52m 35s | Avg: 52m 35s | Max: 52m 35s
🟩 ctk
  🟩 12.8               Pass: 100%/1   | Total: 52m 35s | Avg: 52m 35s | Max: 52m 35s
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/1   | Total: 52m 35s | Avg: 52m 35s | Max: 52m 35s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 52m 35s | Avg: 52m 35s | Max: 52m 35s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 52m 35s | Avg: 52m 35s | Max: 52m 35s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 52m 35s | Avg: 52m 35s | Max: 52m 35s
🟩 gpu
  🟩 rtx2080            Pass: 100%/1   | Total: 52m 35s | Avg: 52m 35s | Max: 52m 35s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 52m 35s | Avg: 52m 35s | Max: 52m 35s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

shwina

Looks great - thank you, Sasha!

…A#3968)

* Add algorithms.segmented_reduce Python API Also avoid recomputing cccl_value of init in both segmented_reduce and in reduce * Change to input_array fixture 1. Include np.complex64 2. Device output size in a variable and reuse it to avoid repeated occurrances of literal values 3. Generate real/imag values for complex arrays in a single call to sampling function for efficiency 4. Change range of generated integral arrays based on the signness of the integral data type. For unsigned types we continue to sample in interval [0, 10), for signed we sample from [-5, 5]. * Corrected docstring of segmented_reduce function * Add initial tests for segmented_reduce * Improve readability of test_segmented_reduce_api example * TransformIteratorKind need not override __eq__/__hash__ methods of the base Additionally, changed the __hash__ of IteratorKind to mix the hash of its value with hash of self.__class__. * Add AdvancedIterator(it, offset=1) function This is used to advance a given iterator `it` the `offset` steps without running into multiple definitions of the advance/derefence methods. * Add example for summing rows of a matrix using segmented_reduce * Implement IteratorBase.__add__(self, offset : int) using make_advanced_iterator * Use end_offsets = start_offsets + 1 This calls IteratorBase.__add__ to produce an iterator whose state is advanced by 1, but which shares the same advance/dereference methods. * Add a test for segmented_reduce on gpu_struct * Change hash of transform iterator to mix its kind * Rename variable n to sample_size Also make generation of complex array in test_reduce.py more efficient by genering real and imaginary components in a single call to np.random.random instead of using two calls. * Remove __hash__ and __eq__ special methods from some iterator classes These were only defined for TransformIterator and AdvancedIterator classes, but not for other classes. Implemented review suggestion to type type(self) instead of self.__class__ * Tweak test_scan_array_input to avoid integer overflows during host accumulation For short range data types we take a small slice of the input array to avoid running into the overflow problem. This works because input_array fixture samples from uniform discrete distribution with small upper range (8), hence using 31 uint8 elements can run up to 31 * 7 = 217 ( < 255) and fits in the type. * Add cccl.set_cccl_iterator_state utility function and use in segmented_reduce.py * Introduce _bindings.call_build utility This finds compute capability and include paths and appends them to the algorithm-specific arguments. Used the utility in segmented_reduce. * Make call_build take *args, **kwargs

…A#3968)

oleksandr-pavlyk added 4 commits February 21, 2025 16:23

Add algorithms.segmented_reduce Python API

9672a22

Also avoid recomputing cccl_value of init in both segmented_reduce and in reduce

Corrected docstring of segmented_reduce function

ae9ee6f

Add initial tests for segmented_reduce

ad3b103

oleksandr-pavlyk added 7 commits February 24, 2025 08:10

Improve readability of test_segmented_reduce_api example

6937a17

TransformIteratorKind need not override __eq__/__hash__ methods of th…

2753cf5

…e base Additionally, changed the __hash__ of IteratorKind to mix the hash of its value with hash of self.__class__.

Add AdvancedIterator(it, offset=1) function

5c0ce63

This is used to advance a given iterator `it` the `offset` steps without running into multiple definitions of the advance/derefence methods.

Add example for summing rows of a matrix using segmented_reduce

bb10d46

Implement IteratorBase.__add__(self, offset : int) using make_advance…

799267a

…d_iterator

Use end_offsets = start_offsets + 1

57fed46

This calls IteratorBase.__add__ to produce an iterator whose state is advanced by 1, but which shares the same advance/dereference methods.

Add a test for segmented_reduce on gpu_struct

b96e9e2

oleksandr-pavlyk marked this pull request as ready for review February 24, 2025 22:58

oleksandr-pavlyk requested a review from a team as a code owner February 24, 2025 22:58

oleksandr-pavlyk requested a review from rwgk February 24, 2025 22:58

Merge branch 'main' into add-segmented-reduce-python-api

c651a67

oleksandr-pavlyk requested review from leofang and shwina February 24, 2025 22:59

Change hash of transform iterator to mix its kind

ed864d7

oleksandr-pavlyk force-pushed the add-segmented-reduce-python-api branch from 73e2154 to ed864d7 Compare February 25, 2025 18:07

rwgk approved these changes Feb 25, 2025

View reviewed changes

oleksandr-pavlyk added 4 commits February 26, 2025 08:29

Rename variable n to sample_size

2a83978

Also make generation of complex array in test_reduce.py more efficient by genering real and imaginary components in a single call to np.random.random instead of using two calls.

Remove __hash__ and __eq__ special methods from some iterator classes

15a3012

These were only defined for TransformIterator and AdvancedIterator classes, but not for other classes. Implemented review suggestion to type type(self) instead of self.__class__

Add cccl.set_cccl_iterator_state utility function and use in segmente…

d6d39fa

…d_reduce.py

oleksandr-pavlyk requested a review from rwgk February 26, 2025 15:33

rwgk approved these changes Feb 26, 2025

View reviewed changes

oleksandr-pavlyk added 2 commits February 26, 2025 10:42

Introduce _bindings.call_build utility

13d8d19

This finds compute capability and include paths and appends them to the algorithm-specific arguments. Used the utility in segmented_reduce.

Merge branch 'main' into add-segmented-reduce-python-api

9f65dee

rwgk approved these changes Feb 26, 2025

View reviewed changes

shwina reviewed Feb 26, 2025

View reviewed changes

python/cuda_parallel/cuda/parallel/experimental/_bindings.py Outdated Show resolved Hide resolved

Make call_build take *args, **kwargs

ecfca41

oleksandr-pavlyk requested a review from shwina February 26, 2025 22:16

Merge branch 'main' into add-segmented-reduce-python-api

af23cca

shwina approved these changes Feb 28, 2025

View reviewed changes

shwina merged commit 0183959 into NVIDIA:main Feb 28, 2025
16 of 19 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Feb 28, 2025

oleksandr-pavlyk deleted the add-segmented-reduce-python-api branch February 28, 2025 16:44

oleksandr-pavlyk mentioned this pull request Feb 28, 2025

cuda-parallel: Apply utilities introduced in gh-3906 to other algorithms #3967

Closed

oleksandr-pavlyk added a commit to oleksandr-pavlyk/cccl that referenced this pull request Feb 28, 2025

Change to apply utilities added in NVIDIAgh-3906 in algorithms

a3d994e

oleksandr-pavlyk added a commit that referenced this pull request Mar 11, 2025

Change to apply utilities added in gh-3906 in algorithms (#3968)

35cc4d0

davebayer pushed a commit to davebayer/cccl that referenced this pull request Mar 12, 2025

Change to apply utilities added in NVIDIAgh-3906 in algorithms (NVIDI…

e7f625d

…A#3968)

bernhardmgruber pushed a commit to bernhardmgruber/cccl that referenced this pull request Mar 13, 2025

Change to apply utilities added in NVIDIAgh-3906 in algorithms (NVIDI…

48db0ab

…A#3968)

davebayer pushed a commit to davebayer/cccl that referenced this pull request Apr 7, 2025

Change to apply utilities added in NVIDIAgh-3906 in algorithms (NVIDI…

fb3ea35

…A#3968)

Conversation

oleksandr-pavlyk commented Feb 21, 2025

Description

Checklist

Uh oh!

copy-pr-bot bot commented Feb 21, 2025

Uh oh!

github-actions bot commented Feb 24, 2025

🟩 python: Pass: 100%/1 | Total: 40m 15s | Avg: 40m 15s | Max: 40m 15s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

Uh oh!

github-actions bot commented Feb 25, 2025

🟩 python: Pass: 100%/1 | Total: 40m 27s | Avg: 40m 27s | Max: 40m 27s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 26, 2025

🟩 python: Pass: 100%/1 | Total: 40m 55s | Avg: 40m 55s | Max: 40m 55s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

Uh oh!

rwgk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Feb 26, 2025

🟩 python: Pass: 100%/1 | Total: 52m 20s | Avg: 52m 20s | Max: 52m 20s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

Uh oh!

github-actions bot commented Feb 26, 2025

🟩 python: Pass: 100%/1 | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

Uh oh!

github-actions bot commented Feb 28, 2025

🟩 python: Pass: 100%/1 | Total: 52m 35s | Avg: 52m 35s | Max: 52m 35s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

Uh oh!

shwina left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants