L2 residency control helper functions#197
Conversation
| template <typename Iterator> | ||
| void register_l2_persistence( | ||
| cudaStream_t& stream, Iterator begin, Iterator end, float hit_rate = 0.6f, float carve_out = 1.0f) |
There was a problem hiding this comment.
For now we use an iterator pair and assume they represent a continuous memory segment. This is going to look a lot cleaner once we can use span.
There was a problem hiding this comment.
Can you include a description of why this is necessary vs. annotated_ptr?
There was a problem hiding this comment.
Could you place stream as the last parameter? Not necessary but try to be consistent with other cuco APIs.
There was a problem hiding this comment.
The most elegant solution to this problem would be to wrap the storage of the filter as an annotated_ptr. However, this first requires a solution to the annotated_ptr + cuda::atomic/cuda::atomic_ref problem, which is not available yet.
The next best approach would be to set the access property manually using cuda::apply_access_property, which I implemented in the previous version of the benchmark script:
cuCollections/benchmarks/bloom_filter/bloom_filter_bench.cu
Lines 199 to 217 in a1ea293
This however showed little to no performance impact. I have not been able to quite get to the bottom of why this is happening. Another disadvantage of this method is that cuda::apply_access_property does not allow hybrid access policies, e.g. 60% cuda::access_property_persisting + 40% cuda::access_property_normal, which showed the best performance in the benchmarks.
tl;dr This is a workaround that is flexible enough for our purposes (can also be used with other cuco map types) but should be replaced once annotated_ptr works as expected.
There was a problem hiding this comment.
Could you place
streamas the last parameter? Not necessary but try to be consistent with other cuco APIs.
Yes, I can do that for the sake of consistency. I initially put it in the front since it's an in/out parameter, i.e., the stream attributes are mutated.
However, this reminds me that I find the placement of the stream parameter in our API unfortunate. I would put the stream in front of the hashers and equality operators, since you are likely to set this one explicitly more often than the others.
It also has a shorter default value compared to the others.
There was a problem hiding this comment.
Wait, I can't put the stream at the end since there are additional defaulted parameters.
There was a problem hiding this comment.
The most elegant solution to this problem would be to wrap the storage of the filter as an
annotated_ptr. However, this first requires a solution to theannotated_ptr+cuda::atomic/cuda::atomic_refproblem, which is not available yet.The next best approach would be to set the access property manually using
cuda::apply_access_property, which I implemented in the previous version of the benchmark script:cuCollections/benchmarks/bloom_filter/bloom_filter_bench.cu
Lines 199 to 217 in a1ea293
This however showed little to no performance impact. I have not been able to quite get to the bottom of why this is happening. Another disadvantage of this method is that
cuda::apply_access_propertydoes not allow hybrid access policies, e.g. 60%cuda::access_property_persisting+ 40%cuda::access_property_normal, which showed the best performance in the benchmarks.tl;dr This is a workaround that is flexible enough for our purposes (can also be used with other cuco map types) but should be replaced once
annotated_ptrworks as expected.
Hey, apologies for bringing up this old PR. I am also testing the cuda::apply_access_property API (on https://leimao.github.io/blog/CUDA-L2-Persistent-Cache/) and in general the prefetch.global.L2::evict_last PTX instruction that it internally uses, but I am not able to see any effects of this. However, when I use the accessPolicyWindow method from the CUDA API to configure L2 persistence for the same scenario, I do observe performance improvements. Did you figure out the reason or any possible resolution for this issue? Any help would be greatly appreciated.
| // Overwrite the access policy attribute of CUDA Stream | ||
| cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute); | ||
| // Remove any persistent lines in L2$ | ||
| cudaCtxResetPersistingL2Cache(); |
There was a problem hiding this comment.
This resets the cache globally. Maybe undesirable.
|
A usage example can be found here (#101): https://github.com/NVIDIA/cuCollections/blob/6b61279f8870142e81875f8c7a1fbdea04d619b4/examples/bloom_filter/l2_residency_example.cu |
PointKernel
left a comment
There was a problem hiding this comment.
Should we move the file to include/cuco since it's exposed to the public?
| template <typename Iterator> | ||
| void register_l2_persistence( | ||
| cudaStream_t& stream, Iterator begin, Iterator end, float hit_rate = 0.6f, float carve_out = 1.0f) |
There was a problem hiding this comment.
Could you place stream as the last parameter? Not necessary but try to be consistent with other cuco APIs.
I also had this in mind (that's why I added the doxygen docs). I would say yes. However, I don't know how broad the range of applications for this functionality really is. |
This PR adds helper functions that allow for pinning a region in global memory into L2$.
Cherry-picked from #101 so it can be reviewed separately.
Also related: #101 (comment)