Skip to content

cuco::bloom_filter#101

Closed
sleeepyjack wants to merge 36 commits intoNVIDIA:devfrom
sleeepyjack:bloom-filter
Closed

cuco::bloom_filter#101
sleeepyjack wants to merge 36 commits intoNVIDIA:devfrom
sleeepyjack:bloom-filter

Conversation

@sleeepyjack
Copy link
Copy Markdown
Collaborator

@sleeepyjack sleeepyjack commented Aug 9, 2021

Adds a new class called cuco::bloom_filter for approximate set membership queries.

It is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed; the more items added, the larger the probability of false positives.

The type of implementation used here is known as a "partitioned" or "pattern-blocked" bloom filter.

This PR comes with examples, benchmarks, as well as unit tests.

@GPUtester
Copy link
Copy Markdown

Can one of the admins verify this patch?

@sleeepyjack sleeepyjack changed the title [WIP] Added bloom_filter with example and benchmarks [REVIEW] Add cuco::bloom_filter Aug 18, 2021
@sleeepyjack
Copy link
Copy Markdown
Collaborator Author

ok to test

@sleeepyjack
Copy link
Copy Markdown
Collaborator Author

Meh, forgot that I don't have the permissions to fire up the CI.

This PR is ready to test and ready for review.

@jrhemstad
Copy link
Copy Markdown
Collaborator

add to whitelist

@jrhemstad
Copy link
Copy Markdown
Collaborator

okay to test

@jrhemstad
Copy link
Copy Markdown
Collaborator

ok to test

@dillon-cullinan
Copy link
Copy Markdown
Contributor

add to whitelist

@sleeepyjack
Copy link
Copy Markdown
Collaborator Author

rerun tests

1 similar comment
@jrhemstad
Copy link
Copy Markdown
Collaborator

rerun tests

* in the filter.
*
* @tparam block_size The size of the thread block
* @tparam InputIt Device accessible input iterator whose `value_type` is
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand input iterators don't enforce equality_comparable property (unlike legacy input iterators or random access iterators). If I'm not mistaken, we might need to rewrite (first + tid) < last as auto size = distance(first, last); tid < size or require legacy input iterators in the documentation. I'm not particularly strong in the field of iterator concepts, so correct me if I'm wrong 😅

@jrhemstad
Copy link
Copy Markdown
Collaborator

@sleeepyjack can you resolve conflicts?

@sleeepyjack
Copy link
Copy Markdown
Collaborator Author

@sleeepyjack can you resolve conflicts?

on it

@PointKernel PointKernel added the topic: build CMake build issue label Dec 3, 2021
@PointKernel PointKernel marked this pull request as draft July 28, 2022 16:34
@sleeepyjack
Copy link
Copy Markdown
Collaborator Author

Bildschirmfoto 2022-07-29 um 14 21 22

Wider Slot types result in a better FPR. However, since cuda::atomic<__int128_t>::is_lock_free() == false, the query throughput drops drastically.

@sleeepyjack
Copy link
Copy Markdown
Collaborator Author

I'm dropping the cuda::annotated_ptr/cuda::apply_access_policy strategy as the access policy is apparently not applied correctly (virtual no performance difference between L2-persistent and non-persistent filters). Thus, I'm rolling back to the old strategy, i.e., using the CUDA driver API.

Here are some benchmark results on A100 80GB L2-resident vs. non-resident filter:

KeyType SlotType FilterOperation FilterScope DataScope NumInputs NumBits NumHashes nv/filter/fpr nv/filter/size/mb Samples CPU Time Noise GPU Time Noise Elem/s GlobalMem BW BWUtil Samples Batch GPU
I32 U64 INSERT GMEM GMEM 10000000 300000000 2 0.0059597 37 773x 656.308 us 1.37% 647.584 us 0.22% 15.442G 123.536 GB/s 6.38% 814x 641.942 us
I32 U64 INSERT GMEM GMEM 100000000 300000000 2 0.24194 37 81x 6.194 ms 1.50% 6.185 ms 1.49% 16.168G 129.345 GB/s 6.68% 85x 6.163 ms
I32 U64 INSERT GMEM REGS 10000000 300000000 2 0.0059597 37 1289x 396.613 us 2.47% 387.908 us 1.02% 25.779G 206.235 GB/s 10.66% 1374x 380.201 us
I32 U64 INSERT GMEM REGS 100000000 300000000 2 0.24194 37 171x 2.940 ms 0.87% 2.932 ms 0.81% 34.111G 272.888 GB/s 14.10% 180x 2.940 ms
I32 U64 INSERT L2 GMEM 10000000 300000000 2 0.0059597 37 1819x 283.593 us 3.21% 274.877 us 0.51% 36.380G 291.039 GB/s 15.04% 1896x 269.990 us
I32 U64 INSERT L2 GMEM 100000000 300000000 2 0.24194 37 201x 2.505 ms 0.74% 2.496 ms 0.64% 40.060G 320.483 GB/s 16.56% 202x 2.519 ms
I32 U64 INSERT L2 REGS 10000000 300000000 2 0.0059597 37 1951x 265.059 us 3.47% 256.315 us 0.62% 39.014G 312.115 GB/s 16.13% 2031x 251.270 us
I32 U64 INSERT L2 REGS 100000000 300000000 2 0.24194 37 217x 2.316 ms 0.39% 2.307 ms 0.04% 43.341G 346.728 GB/s 17.92% 227x 2.302 ms
I32 U64 CONTAINS GMEM GMEM 10000000 300000000 2 0.0059597 37 1793x 287.897 us 3.22% 278.999 us 0.46% 35.842G 286.740 GB/s 14.82% 1906x 262.407 us
I32 U64 CONTAINS GMEM GMEM 100000000 300000000 2 0.24194 37 192x 2.621 ms 0.64% 2.612 ms 0.54% 38.282G 306.254 GB/s 15.82% 199x 2.617 ms
I32 U64 CONTAINS GMEM REGS 10000000 300000000 2 0.0059597 37 1831x 282.127 us 3.39% 273.182 us 0.85% 36.606G 292.845 GB/s 15.13% 1946x 258.885 us
I32 U64 CONTAINS GMEM REGS 100000000 300000000 2 0.24194 37 197x 2.552 ms 0.38% 2.543 ms 0.13% 39.323G 314.587 GB/s 16.25% 207x 2.528 ms
I32 U64 CONTAINS L2 GMEM 10000000 300000000 2 0.0059597 37 1873x 276.017 us 3.42% 266.967 us 0.43% 37.458G 299.662 GB/s 15.48% 1961x 260.320 us
I32 U64 CONTAINS L2 GMEM 100000000 300000000 2 0.24194 37 194x 2.593 ms 0.72% 2.584 ms 0.63% 38.706G 309.651 GB/s 16.00% 196x 2.599 ms
I32 U64 CONTAINS L2 REGS 10000000 300000000 2 0.0059597 37 1891x 273.571 us 3.51% 264.540 us 0.81% 37.802G 302.412 GB/s 15.63% 1939x 258.965 us
I32 U64 CONTAINS L2 REGS 100000000 300000000 2 0.24194 37 198x 2.543 ms 0.36% 2.534 ms 0.06% 39.458G 315.667 GB/s 16.31% 204x 2.529 ms

@kkraus14
Copy link
Copy Markdown

kkraus14 commented Aug 7, 2024

@sleeepyjack we would love to see this work pushed forward so we can utilize this. Is there anything that we can do to help here?

@sleeepyjack
Copy link
Copy Markdown
Collaborator Author

@kkraus14 I can move this up on my task list and hammer out a new draft PR tomorrow so we can get started on discussing the last few design questions. I'll keep you posted.

@sleeepyjack
Copy link
Copy Markdown
Collaborator Author

Superseeded by #573

@sleeepyjack sleeepyjack closed this Aug 8, 2024
sleeepyjack added a commit that referenced this pull request Oct 2, 2024
Superseeds #101

Implementation of a GPU "Blocked Bloom Filter".

This PR is an updated/optimized version of #101 and features the
following improvements:

- Incorporate the new library design
- Improve performance by computing the key's bit pattern based on a
single hash value instead of using a double hashing derivative

---------

Co-authored-by: Yunsong Wang <yunsongw@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

In Progress Currently a work in progress topic: build CMake build issue topic: performance Performance related issue type: feature request New feature request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants