Skip to content

Limits the number of different offset types for DeviceMergeSort#3328

Merged
elstehle merged 1 commit intoNVIDIA:mainfrom
elstehle:enh/limit-merge-sort-offset-types
Jan 11, 2025
Merged

Limits the number of different offset types for DeviceMergeSort#3328
elstehle merged 1 commit intoNVIDIA:mainfrom
elstehle:enh/limit-merge-sort-offset-types

Conversation

@elstehle
Copy link
Contributor

@elstehle elstehle commented Jan 10, 2025

Description

Closes #3312

Benchmark results on H100, comparing the offset types

This summary highlights that performance across offset types is pretty similar:

MergeSort.keys

u32/i32 time u32/i32 time (2^28) i64/i32 time i64/i32 time (2^28) u64/i32 time u64/i32 time (2^28)
min 97.92% 99.49% 99.53% 99.53% 99.27% 99.48%
max 104.08% 100.89% 105.89% 101.99% 105.78% 102.57%
avg 99.92% 100.09% 100.92% 100.39% 100.72% 100.48%

MergeSort.pairs

x u32/i32 time u32/i32 time (2^28) i64/i32 time i64/i32 time (2^28) u64/i32 time u64/i32 time (2^28)
min 97.80% 97.81% 97.04% 97.04% 97.43% 97.43%
max 102.82% 102.58% 103.35% 102.67% 103.59% 102.90%
avg 99.71% 99.92% 100.22% 100.33% 100.05% 100.35%
Detailed benchmark results on H100 sort.pairs
KeyT{ct} ValueT{ct} Elements{io} Entropy I32 u32 u32/i32 time i64 i64/i32 time u64 u64/i32 time
I8 I8 2^16 = 65536 1 60.469 62.172 102.82% 62.493 103.35% 62.105 102.71%
I8 I8 2^20 = 1048576 1 171.971 174.273 101.34% 176.582 102.68% 175.85 102.26%
I8 I8 2^24 = 16777216 1 1771 1779 100.45% 1804 101.86% 1799 101.58%
I8 I8 2^28 = 268435456 1 32722 32771 100.15% 33440 102.19% 33347 101.91%
I8 I8 2^16 = 65536 0.201 60.212 61.443 102.04% 61.695 102.46% 61.335 101.87%
I8 I8 2^20 = 1048576 0.201 167.837 168.237 100.24% 170.479 101.57% 169.35 100.90%
I8 I8 2^24 = 16777216 0.201 1689 1694 100.30% 1716 101.60% 1711 101.30%
I8 I8 2^28 = 268435456 0.201 31170 31246 100.24% 31815 102.07% 31729 101.79%
I8 I16 2^16 = 65536 1 61.855 61.386 99.24% 61.909 100.09% 61.53 99.47%
I8 I16 2^20 = 1048576 1 173.999 172.652 99.23% 174.059 100.03% 173.731 99.85%
I8 I16 2^24 = 16777216 1 1760 1760 100.00% 1775 100.85% 1776 100.91%
I8 I16 2^28 = 268435456 1 32091 32118 100.08% 32616 101.64% 32648 101.74%
I8 I16 2^16 = 65536 0.201 60.971 60.476 99.19% 60.963 99.99% 60.6 99.39%
I8 I16 2^20 = 1048576 0.201 166.307 165.372 99.44% 166.748 100.27% 166.414 100.06%
I8 I16 2^24 = 16777216 0.201 1666 1666 100.00% 1679 100.78% 1679 100.78%
I8 I16 2^28 = 268435456 0.201 30519 30543 100.08% 30967 101.47% 30985 101.53%
I8 I32 2^16 = 65536 1 61.401 60.948 99.26% 61.438 100.06% 61.036 99.41%
I8 I32 2^20 = 1048576 1 174.453 173.111 99.23% 174.789 100.19% 174.326 99.93%
I8 I32 2^24 = 16777216 1 1852 1853 100.05% 1839 99.30% 1840 99.35%
I8 I32 2^28 = 268435456 1 34596 34605 100.03% 34390 99.40% 34414 99.47%
I8 I32 2^16 = 65536 0.201 60.35 60.009 99.43% 60.283 99.89% 59.972 99.37%
I8 I32 2^20 = 1048576 0.201 165.483 164.616 99.48% 166.36 100.53% 165.951 100.28%
I8 I32 2^24 = 16777216 0.201 1789 1790 100.06% 1764 98.60% 1764 98.60%
I8 I32 2^28 = 268435456 0.201 33492 33521 100.09% 33152 98.98% 33164 99.02%
I8 I64 2^16 = 65536 1 66.408 66.023 99.42% 66.567 100.24% 66.52 100.17%
I8 I64 2^20 = 1048576 1 205.971 204.717 99.39% 208.454 101.21% 207.996 100.98%
I8 I64 2^24 = 16777216 1 2730 2730 100.00% 2735 100.18% 2734 100.15%
I8 I64 2^28 = 268435456 1 52488 52488 100.00% 52570 100.16% 52567 100.15%
I8 I64 2^16 = 65536 0.201 63.909 63.453 99.29% 64.235 100.51% 64.062 100.24%
I8 I64 2^20 = 1048576 0.201 193.859 193.002 99.56% 196.47 101.35% 195.567 100.88%
I8 I64 2^24 = 16777216 0.201 2688 2688 100.00% 2692 100.15% 2691 100.11%
I8 I64 2^28 = 268435456 0.201 51629 51619 99.98% 51732 100.20% 51691 100.12%
I16 I8 2^16 = 65536 1 65.163 64.677 99.25% 64.94 99.66% 64.662 99.23%
I16 I8 2^20 = 1048576 1 184.111 183.079 99.44% 184.561 100.24% 184.499 100.21%
I16 I8 2^24 = 16777216 1 1885 1881 99.79% 1901 100.85% 1904 101.01%
I16 I8 2^28 = 268435456 1 34856 34760 99.72% 35439 101.67% 35512 101.88%
I16 I8 2^16 = 65536 0.201 64.79 64.413 99.42% 64.6 99.71% 64.361 99.34%
I16 I8 2^20 = 1048576 0.201 177.801 176.825 99.45% 178.261 100.26% 178.001 100.11%
I16 I8 2^24 = 16777216 0.201 1759 1756 99.83% 1778 101.08% 1780 101.19%
I16 I8 2^28 = 268435456 0.201 32009 31963 99.86% 32574 101.77% 32623 101.92%
I16 I16 2^16 = 65536 1 64.217 63.49 98.87% 63.71 99.21% 63.438 98.79%
I16 I16 2^20 = 1048576 1 184.34 181.278 98.34% 181.152 98.27% 180.859 98.11%
I16 I16 2^24 = 16777216 1 1884 1867 99.10% 1890 100.32% 1884 100.00%
I16 I16 2^28 = 268435456 1 34685 34377 99.11% 35091 101.17% 34987 100.87%
I16 I16 2^16 = 65536 0.201 63.897 63.092 98.74% 63.225 98.95% 63.064 98.70%
I16 I16 2^20 = 1048576 0.201 174.862 173.491 99.22% 173.662 99.31% 172.702 98.76%
I16 I16 2^24 = 16777216 0.201 1729 1727 99.88% 1744 100.87% 1739 100.58%
I16 I16 2^28 = 268435456 0.201 31436 31402 99.89% 31889 101.44% 31804 101.17%
I16 I32 2^16 = 65536 1 63.471 63.217 99.60% 62.531 98.52% 62.229 98.04%
I16 I32 2^20 = 1048576 1 183.118 181.636 99.19% 181.413 99.07% 180.83 98.75%
I16 I32 2^24 = 16777216 1 2031 2028 99.85% 2053 101.08% 2051 100.98%
I16 I32 2^28 = 268435456 1 38496 38466 99.92% 39016 101.35% 38993 101.29%
I16 I32 2^16 = 65536 0.201 62.96 62.634 99.48% 62.011 98.49% 61.612 97.86%
I16 I32 2^20 = 1048576 0.201 173.21 172.132 99.38% 172.078 99.35% 171.031 98.74%
I16 I32 2^24 = 16777216 0.201 1972 1970 99.90% 1992 101.01% 1990 100.91%
I16 I32 2^28 = 268435456 0.201 37226 37199 99.93% 37733 101.36% 37715 101.31%
I16 I64 2^16 = 65536 1 70.239 69.664 99.18% 71.071 101.18% 70.808 100.81%
I16 I64 2^20 = 1048576 1 217.62 215.237 98.90% 217.837 100.10% 218.492 100.40%
I16 I64 2^24 = 16777216 1 2983 2976 99.77% 2981 99.93% 2981 99.93%
I16 I64 2^28 = 268435456 1 57573 57385 99.67% 57432 99.76% 57436 99.76%
I16 I64 2^16 = 65536 0.201 68.671 68.079 99.14% 69.468 101.16% 69.094 100.62%
I16 I64 2^20 = 1048576 0.201 202.552 200.335 98.91% 203.229 100.33% 203.869 100.65%
I16 I64 2^24 = 16777216 0.201 2946 2938 99.73% 2943 99.90% 2941 99.83%
I16 I64 2^28 = 268435456 0.201 56736 56548 99.67% 56574 99.71% 56581 99.73%
I32 I8 2^16 = 65536 1 61.947 61.389 99.10% 61.655 99.53% 61.418 99.15%
I32 I8 2^20 = 1048576 1 175.778 174.345 99.18% 175.997 100.12% 175.283 99.72%
I32 I8 2^24 = 16777216 1 1903 1898 99.74% 1912 100.47% 1914 100.58%
I32 I8 2^28 = 268435456 1 35604 35548 99.84% 35876 100.76% 35936 100.93%
I32 I8 2^16 = 65536 0.201 61.927 61.415 99.17% 61.635 99.53% 61.419 99.18%
I32 I8 2^20 = 1048576 0.201 172.177 171.192 99.43% 172.415 100.14% 171.898 99.84%
I32 I8 2^24 = 16777216 0.201 1855 1850 99.73% 1861 100.32% 1862 100.38%
I32 I8 2^28 = 268435456 0.201 34476 34442 99.90% 34621 100.42% 34645 100.49%
I32 I16 2^16 = 65536 1 61.469 61.085 99.38% 61.771 100.49% 61.508 100.06%
I32 I16 2^20 = 1048576 1 175.144 174.042 99.37% 175.55 100.23% 174.548 99.66%
I32 I16 2^24 = 16777216 1 2027 2026 99.95% 2035 100.39% 2033 100.30%
I32 I16 2^28 = 268435456 1 38488 38477 99.97% 38697 100.54% 38681 100.50%
I32 I16 2^16 = 65536 0.201 61.498 61.066 99.30% 61.836 100.55% 61.51 100.02%
I32 I16 2^20 = 1048576 0.201 170.597 169.641 99.44% 170.995 100.23% 169.962 99.63%
I32 I16 2^24 = 16777216 0.201 1988 1986 99.90% 1994 100.30% 1993 100.25%
I32 I16 2^28 = 268435456 0.201 37549 37537 99.97% 37758 100.56% 37746 100.52%
I32 I32 2^16 = 65536 1 61.13 60.775 99.42% 60.869 99.57% 60.534 99.03%
I32 I32 2^20 = 1048576 1 175.796 174.532 99.28% 175.484 99.82% 174.359 99.18%
I32 I32 2^24 = 16777216 1 2319 2317 99.91% 2321 100.09% 2319 100.00%
I32 I32 2^28 = 268435456 1 44815 44809 99.99% 44944 100.29% 44932 100.26%
I32 I32 2^16 = 65536 0.201 61.047 60.693 99.42% 60.666 99.38% 60.457 99.03%
I32 I32 2^20 = 1048576 0.201 169.313 168.314 99.41% 168.892 99.75% 167.805 99.11%
I32 I32 2^24 = 16777216 0.201 2295 2293 99.91% 2295 100.00% 2294 99.96%
I32 I32 2^28 = 268435456 0.201 44265 44260 99.99% 44350 100.19% 44345 100.18%
I32 I64 2^16 = 65536 1 66.24 66.545 100.46% 66.861 100.94% 66.679 100.66%
I32 I64 2^20 = 1048576 1 209.114 205.309 98.18% 214.221 102.44% 213.43 102.06%
I32 I64 2^24 = 16777216 1 3357 3301 98.33% 3352 99.85% 3351 99.82%
I32 I64 2^28 = 268435456 1 65812 64868 98.57% 65826 100.02% 65824 100.02%
I32 I64 2^16 = 65536 0.201 65.834 66.106 100.41% 66.413 100.88% 66.146 100.47%
I32 I64 2^20 = 1048576 0.201 199.061 194.847 97.88% 203.661 102.31% 202.596 101.78%
I32 I64 2^24 = 16777216 0.201 3339 3283 98.32% 3335 99.88% 3334 99.85%
I32 I64 2^28 = 268435456 0.201 65436 64477 98.53% 65417 99.97% 65424 99.98%
I64 I8 2^16 = 65536 1 64.863 64.463 99.38% 65.158 100.45% 64.982 100.18%
I64 I8 2^20 = 1048576 1 208.701 208.781 100.04% 212.329 101.74% 211.39 101.29%
I64 I8 2^24 = 16777216 1 3086 3084 99.94% 3087 100.03% 3086 100.00%
I64 I8 2^28 = 268435456 1 59022 59029 100.01% 59028 100.01% 59030 100.01%
I64 I8 2^16 = 65536 0.201 64.891 64.733 99.76% 65.319 100.66% 64.932 100.06%
I64 I8 2^20 = 1048576 0.201 218.876 218.08 99.64% 221.515 101.21% 220.517 100.75%
I64 I8 2^24 = 16777216 0.201 3101 3100 99.97% 3104 100.10% 3103 100.06%
I64 I8 2^28 = 268435456 0.201 59022 59016 99.99% 59030 100.01% 59026 100.01%
I64 I16 2^16 = 65536 1 66.225 65.579 99.02% 66.251 100.04% 65.835 99.41%
I64 I16 2^20 = 1048576 1 214.779 213.615 99.46% 215.584 100.37% 215.146 100.17%
I64 I16 2^24 = 16777216 1 3342 3341 99.97% 3341 99.97% 3340 99.94%
I64 I16 2^28 = 268435456 1 64248 64251 100.00% 64245 100.00% 64241 99.99%
I64 I16 2^16 = 65536 0.201 66.492 65.944 99.18% 66.418 99.89% 66.09 99.40%
I64 I16 2^20 = 1048576 0.201 225.005 223.872 99.50% 225.115 100.05% 224.705 99.87%
I64 I16 2^24 = 16777216 0.201 3359 3358 99.97% 3358 99.97% 3358 99.97%
I64 I16 2^28 = 268435456 0.201 64230 64228 100.00% 64210 99.97% 64206 99.96%
I64 I32 2^16 = 65536 1 66.853 66.254 99.10% 66.975 100.18% 66.719 99.80%
I64 I32 2^20 = 1048576 1 220.584 219.198 99.37% 219.98 99.73% 219.6 99.55%
I64 I32 2^24 = 16777216 1 3860 3859 99.97% 3859 99.97% 3858 99.95%
I64 I32 2^28 = 268435456 1 74875 74877 100.00% 74871 99.99% 74876 100.00%
I64 I32 2^16 = 65536 0.201 67.2 66.543 99.02% 67.345 100.22% 67.021 99.73%
I64 I32 2^20 = 1048576 0.201 232.095 231.257 99.64% 231.506 99.75% 231.352 99.68%
I64 I32 2^24 = 16777216 0.201 3881 3879 99.95% 3880 99.97% 3879 99.95%
I64 I32 2^28 = 268435456 0.201 74901 74908 100.01% 74894 99.99% 74890 99.99%
I64 I64 2^16 = 65536 1 71.151 71.133 99.97% 71.099 99.93% 70.708 99.38%
I64 I64 2^20 = 1048576 1 254.552 254.578 100.01% 254.541 100.00% 253.679 99.66%
I64 I64 2^24 = 16777216 1 5033 5032 99.98% 5031 99.96% 5031 99.96%
I64 I64 2^28 = 268435456 1 98094 98095 100.00% 98070 99.98% 98081 99.99%
I64 I64 2^16 = 65536 0.201 72.281 72.115 99.77% 72.25 99.96% 71.813 99.35%
I64 I64 2^20 = 1048576 0.201 270.997 270.303 99.74% 270.513 99.82% 269.528 99.46%
I64 I64 2^24 = 16777216 0.201 5062 5061 99.98% 5060 99.96% 5059 99.94%
I64 I64 2^28 = 268435456 0.201 98218 98221 100.00% 98200 99.98% 98203 99.98%
I128 I8 2^16 = 65536 1 74.042 73.617 99.43% 73.573 99.37% 73.07 98.69%
I128 I8 2^20 = 1048576 1 353.016 355.316 100.65% 352.395 99.82% 353.355 100.10%
I128 I8 2^24 = 16777216 1 6046 6045 99.98% 6040 99.90% 6042 99.93%
I128 I8 2^28 = 268435456 1 117026 117024 100.00% 116990 99.97% 116985 99.96%
I128 I8 2^16 = 65536 0.201 74.056 73.824 99.69% 73.732 99.56% 73.319 99.00%
I128 I8 2^20 = 1048576 0.201 349.037 348.581 99.87% 347.431 99.54% 347.867 99.66%
I128 I8 2^24 = 16777216 0.201 6027 6027 100.00% 6022 99.92% 6025 99.97%
I128 I8 2^28 = 268435456 0.201 116271 116257 99.99% 116236 99.97% 116234 99.97%
I128 I16 2^16 = 65536 1 74.519 73.965 99.26% 74.139 99.49% 73.905 99.18%
I128 I16 2^20 = 1048576 1 357.659 357.528 99.96% 356.645 99.72% 356.49 99.67%
I128 I16 2^24 = 16777216 1 6315 6313 99.97% 6310 99.92% 6312 99.95%
I128 I16 2^28 = 268435456 1 122687 122663 99.98% 122692 100.00% 122697 100.01%
I128 I16 2^16 = 65536 0.201 74.603 73.978 99.16% 74.2 99.46% 73.954 99.13%
I128 I16 2^20 = 1048576 0.201 350.734 350.145 99.83% 349.138 99.54% 349.443 99.63%
I128 I16 2^24 = 16777216 0.201 6301 6299 99.97% 6296 99.92% 6300 99.98%
I128 I16 2^28 = 268435456 0.201 121919 121906 99.99% 121915 100.00% 121912 99.99%
I128 I32 2^16 = 65536 1 74.2 73.734 99.37% 74.066 99.82% 73.738 99.38%
I128 I32 2^20 = 1048576 1 367.476 365.169 99.37% 361.483 98.37% 363.991 99.05%
I128 I32 2^24 = 16777216 1 6846 6844 99.97% 6839 99.90% 6842 99.94%
I128 I32 2^28 = 268435456 1 133802 133791 99.99% 133805 100.00% 133806 100.00%
I128 I32 2^16 = 65536 0.201 74.72 73.864 98.85% 74.353 99.51% 73.996 99.03%
I128 I32 2^20 = 1048576 0.201 370.105 367.712 99.35% 366.266 98.96% 367.526 99.30%
I128 I32 2^24 = 16777216 0.201 6836 6835 99.99% 6829 99.90% 6833 99.96%
I128 I32 2^28 = 268435456 0.201 133061 133059 100.00% 133050 99.99% 133048 99.99%
I128 I64 2^16 = 65536 1 76.383 75.9 99.37% 75.937 99.42% 75.408 98.72%
I128 I64 2^20 = 1048576 1 425.401 424.763 99.85% 431.661 101.47% 430.857 101.28%
I128 I64 2^24 = 16777216 1 8006 8004 99.98% 8011 100.06% 8014 100.10%
I128 I64 2^28 = 268435456 1 157048 157035 99.99% 157304 100.16% 157301 100.16%
I128 I64 2^16 = 65536 0.201 76.715 76.214 99.35% 76.234 99.37% 75.665 98.63%
I128 I64 2^20 = 1048576 0.201 425.809 425.006 99.81% 431.345 101.30% 430.825 101.18%
I128 I64 2^24 = 16777216 0.201 8003 8001 99.98% 8007 100.05% 8014 100.14%
I128 I64 2^28 = 268435456 0.201 156339 156336 100.00% 156632 100.19% 156629 100.19%
F32 I8 2^16 = 65536 1 61.645 61.308 99.45% 61.502 99.77% 61.263 99.38%
F32 I8 2^20 = 1048576 1 175.196 174.044 99.34% 175.723 100.30% 175.181 99.99%
F32 I8 2^24 = 16777216 1 1908 1887 98.90% 1899 99.53% 1902 99.69%
F32 I8 2^28 = 268435456 1 35687 35344 99.04% 35661 99.93% 35717 100.08%
F32 I8 2^16 = 65536 0.201 61.455 61.07 99.37% 61.387 99.89% 61.065 99.37%
F32 I8 2^20 = 1048576 0.201 171.073 169.932 99.33% 171.781 100.41% 171.377 100.18%
F32 I8 2^24 = 16777216 0.201 1860 1840 98.92% 1849 99.41% 1850 99.46%
F32 I8 2^28 = 268435456 0.201 34580 34248 99.04% 34416 99.53% 34443 99.60%
F32 I16 2^16 = 65536 1 61.194 60.517 98.89% 60.793 99.34% 60.452 98.79%
F32 I16 2^20 = 1048576 1 174.5 173.16 99.23% 174.605 100.06% 173.537 99.45%
F32 I16 2^24 = 16777216 1 2031 2030 99.95% 2039 100.39% 2037 100.30%
F32 I16 2^28 = 268435456 1 38549 38540 99.98% 38774 100.58% 38754 100.53%
F32 I16 2^16 = 65536 0.201 60.885 60.323 99.08% 60.618 99.56% 60.289 99.02%
F32 I16 2^20 = 1048576 0.201 168.84 167.654 99.30% 169.738 100.53% 168.72 99.93%
F32 I16 2^24 = 16777216 0.201 1992 1990 99.90% 1999 100.35% 1997 100.25%
F32 I16 2^28 = 268435456 0.201 37621 37613 99.98% 37840 100.58% 37826 100.54%
F32 I32 2^16 = 65536 1 60.455 59.969 99.20% 59.916 99.11% 59.537 98.48%
F32 I32 2^20 = 1048576 1 175.024 173.82 99.31% 174.909 99.93% 174.111 99.48%
F32 I32 2^24 = 16777216 1 2333 2330 99.87% 2344 100.47% 2343 100.43%
F32 I32 2^28 = 268435456 1 45065 45047 99.96% 45411 100.77% 45402 100.75%
F32 I32 2^16 = 65536 0.201 60.113 59.565 99.09% 59.451 98.90% 59.264 98.59%
F32 I32 2^20 = 1048576 0.201 168.517 167.118 99.17% 168.938 100.25% 167.901 99.63%
F32 I32 2^24 = 16777216 0.201 2310 2309 99.96% 2321 100.48% 2319 100.39%
F32 I32 2^28 = 268435456 0.201 44552 44541 99.98% 44867 100.71% 44864 100.70%
F32 I64 2^16 = 65536 1 66.874 66.612 99.61% 67.718 101.26% 66.189 98.98%
F32 I64 2^20 = 1048576 1 212.644 211.724 99.57% 214.598 100.92% 214.046 100.66%
F32 I64 2^24 = 16777216 1 3364 3364 100.00% 3366 100.06% 3363 99.97%
F32 I64 2^28 = 268435456 1 65938 65943 100.01% 65961 100.03% 65952 100.02%
F32 I64 2^16 = 65536 0.201 65.951 65.677 99.58% 66.786 101.27% 65.505 99.32%
F32 I64 2^20 = 1048576 0.201 202.944 201.573 99.32% 204.992 101.01% 204.932 100.98%
F32 I64 2^24 = 16777216 0.201 3348 3348 100.00% 3350 100.06% 3347 99.97%
F32 I64 2^28 = 268435456 0.201 65600 65587 99.98% 65597 100.00% 65598 100.00%
F64 I8 2^16 = 65536 1 64.447 63.862 99.09% 64.245 99.69% 64.023 99.34%
F64 I8 2^20 = 1048576 1 208.481 207.645 99.60% 209.471 100.47% 209.217 100.35%
F64 I8 2^24 = 16777216 1 3085 3086 100.03% 3087 100.06% 3086 100.03%
F64 I8 2^28 = 268435456 1 59019 59012 99.99% 59044 100.04% 59043 100.04%
F64 I8 2^16 = 65536 0.201 64.441 63.899 99.16% 64.396 99.93% 64.172 99.58%
F64 I8 2^20 = 1048576 0.201 216.68 215.993 99.68% 217.89 100.56% 217.125 100.21%
F64 I8 2^24 = 16777216 0.201 3093 3094 100.03% 3097 100.13% 3096 100.10%
F64 I8 2^28 = 268435456 0.201 58948 58937 99.98% 58871 99.87% 58981 100.06%
F64 I16 2^16 = 65536 1 64.463 63.961 99.22% 64.529 100.10% 64.143 99.50%
F64 I16 2^20 = 1048576 1 214.411 213.369 99.51% 214.533 100.06% 213.797 99.71%
F64 I16 2^24 = 16777216 1 3342 3343 100.03% 3342 100.00% 3340 99.94%
F64 I16 2^28 = 268435456 1 64286 64286 100.00% 64270 99.98% 64280 99.99%
F64 I16 2^16 = 65536 0.201 64.767 64.209 99.14% 64.606 99.75% 64.232 99.17%
F64 I16 2^20 = 1048576 0.201 222.601 222.166 99.80% 223.098 100.22% 222.373 99.90%
F64 I16 2^24 = 16777216 0.201 3352 3352 100.00% 3351 99.97% 3349 99.91%
F64 I16 2^28 = 268435456 0.201 64176 64184 100.01% 64162 99.98% 64158 99.97%
F64 I32 2^16 = 65536 1 65.066 64.664 99.38% 64.975 99.86% 64.712 99.46%
F64 I32 2^20 = 1048576 1 219.589 217.902 99.23% 220.23 100.29% 217.873 99.22%
F64 I32 2^24 = 16777216 1 3861 3861 100.00% 3859 99.95% 3858 99.92%
F64 I32 2^28 = 268435456 1 74904 74910 100.01% 74905 100.00% 74903 100.00%
F64 I32 2^16 = 65536 0.201 65.568 65.294 99.58% 65.389 99.73% 65.06 99.23%
F64 I32 2^20 = 1048576 0.201 228.72 227.963 99.67% 229.429 100.31% 227.545 99.49%
F64 I32 2^24 = 16777216 0.201 3870 3870 100.00% 3868 99.95% 3867 99.92%
F64 I32 2^28 = 268435456 0.201 74796 74798 100.00% 74786 99.99% 74785 99.99%
F64 I64 2^16 = 65536 1 69.9 69.33 99.18% 69.681 99.69% 69.377 99.25%
F64 I64 2^20 = 1048576 1 255.402 254.788 99.76% 253.73 99.35% 253.354 99.20%
F64 I64 2^24 = 16777216 1 5034 5034 100.00% 5033 99.98% 5031 99.94%
F64 I64 2^28 = 268435456 1 98104 98123 100.02% 98097 99.99% 98097 99.99%
F64 I64 2^16 = 65536 0.201 70.734 70.121 99.13% 70.501 99.67% 70.155 99.18%
F64 I64 2^20 = 1048576 0.201 267.049 266.483 99.79% 265.703 99.50% 264.712 99.12%
F64 I64 2^24 = 16777216 0.201 5045 5045 100.00% 5042 99.94% 5041 99.92%
F64 I64 2^28 = 268435456 0.201 98013 98028 100.02% 98006 99.99% 98006 99.99%
C64 I8 2^16 = 65536 1 208.085 207.977 99.95% 208.066 99.99% 208.453 100.18%
C64 I8 2^20 = 1048576 1 564.478 564.301 99.97% 566.759 100.40% 565.851 100.24%
C64 I8 2^24 = 16777216 1 6822 6830 100.12% 6841 100.28% 6839 100.25%
C64 I8 2^28 = 268435456 1 137348 137496 100.11% 138170 100.60% 137773 100.31%
C64 I8 2^16 = 65536 0.201 322.449 322.035 99.87% 323.104 100.20% 322.604 100.05%
C64 I8 2^20 = 1048576 0.201 879.585 880.189 100.07% 884.521 100.56% 880.899 100.15%
C64 I8 2^24 = 16777216 0.201 12408 12452 100.35% 12516 100.87% 12508 100.81%
C64 I8 2^28 = 268435456 0.201 208819 209601 100.37% 210779 100.94% 210539 100.82%
C64 I16 2^16 = 65536 1 208.341 207.552 99.62% 209.28 100.45% 209.197 100.41%
C64 I16 2^20 = 1048576 1 567.824 566.687 99.80% 569.81 100.35% 571.645 100.67%
C64 I16 2^24 = 16777216 1 6932 6905 99.61% 6920 99.83% 6990 100.84%
C64 I16 2^28 = 268435456 1 139995 138764 99.12% 138841 99.18% 139508 99.65%
C64 I16 2^16 = 65536 0.201 324.128 323.379 99.77% 325.945 100.56% 325.308 100.36%
C64 I16 2^20 = 1048576 0.201 887.645 887.647 100.00% 895.427 100.88% 895.422 100.88%
C64 I16 2^24 = 16777216 0.201 12369 12557 101.52% 12694 102.63% 12722 102.85%
C64 I16 2^28 = 268435456 0.201 208973 211903 101.40% 214552 102.67% 215031 102.90%
C64 I32 2^16 = 65536 1 209.227 208.649 99.72% 209.864 100.30% 209.279 100.02%
C64 I32 2^20 = 1048576 1 575.243 573.046 99.62% 574.36 99.85% 577.525 100.40%
C64 I32 2^24 = 16777216 1 7046 7007 99.45% 7025 99.70% 7050 100.06%
C64 I32 2^28 = 268435456 1 140066 140054 99.99% 140284 100.16% 140810 100.53%
C64 I32 2^16 = 65536 0.201 323.872 322.755 99.66% 325.765 100.58% 324.742 100.27%
C64 I32 2^20 = 1048576 0.201 897.381 891.446 99.34% 896.875 99.94% 897.256 99.99%
C64 I32 2^24 = 16777216 0.201 12812 12530 97.80% 12720 99.28% 12753 99.54%
C64 I32 2^28 = 268435456 0.201 217276 212511 97.81% 215843 99.34% 216371 99.58%
C64 I64 2^16 = 65536 1 215.955 214.916 99.52% 215.9 99.97% 215.998 100.02%
C64 I64 2^20 = 1048576 1 603.612 605.543 100.32% 610.727 101.18% 608.608 100.83%
C64 I64 2^24 = 16777216 1 7414 7451 100.50% 7638 103.02% 7680 103.59%
C64 I64 2^28 = 268435456 1 147588 147841 100.17% 150052 101.67% 150657 102.08%
C64 I64 2^16 = 65536 0.201 333.665 332.798 99.74% 334.537 100.26% 333.773 100.03%
C64 I64 2^20 = 1048576 0.201 922.161 923.434 100.14% 939.801 101.91% 940.523 101.99%
C64 I64 2^24 = 16777216 0.201 12926 13268 102.65% 12616 97.60% 12674 98.05%
C64 I64 2^28 = 268435456 0.201 221625 227338 102.58% 215071 97.04% 215922 97.43%
Detailed benchmark results on H100 sort.keys

<style type="text/css"></style>

T{ct} Elements{io} Entropy   I32 u32 u32/i32 time i64 i64/i32 time u64 u64/i32 time
I8 2^16 = 65536 1   49.911 51.946 104.08% 52.852 105.89% 52.794 105.78%
I8 2^20 = 1048576 1   122.972 125.387 101.96% 127.503 103.68% 126.937 103.22%
I8 2^24 = 16777216 1   840.934 848.612 100.91% 855.068 101.68% 854.571 101.62%
I8 2^28 = 268435456 1   14776 14768 99.95% 14944 101.14% 14943 101.13%
I8 2^16 = 65536 0.201   49.736 50.945 102.43% 52.108 104.77% 51.591 103.73%
I8 2^20 = 1048576 0.201   119.974 121.815 101.53% 124.746 103.98% 124.05 103.40%
I8 2^24 = 16777216 0.201   794.997 799.318 100.54% 806.67 101.47% 805.044 101.26%
I8 2^28 = 268435456 0.201   13737 13750 100.09% 13936 101.45% 13930 101.40%
I16 2^16 = 65536 1   54.43 54.124 99.44% 54.953 100.96% 54.827 100.73%
I16 2^20 = 1048576 1   132.44 132.352 99.93% 134.235 101.36% 134.9 101.86%
I16 2^24 = 16777216 1   970.768 972.345 100.16% 978.736 100.82% 979.577 100.91%
I16 2^28 = 268435456 1   17360 17514 100.89% 17705 101.99% 17807 102.57%
I16 2^16 = 65536 0.201   54.307 53.989 99.41% 55.017 101.31% 54.794 100.90%
I16 2^20 = 1048576 0.201   129.126 127.942 99.08% 131.474 101.82% 131.622 101.93%
I16 2^24 = 16777216 0.201   914.877 914.914 100.00% 923.232 100.91% 921.525 100.73%
I16 2^28 = 268435456 0.201   15542 15568 100.17% 15637 100.61% 15633 100.59%
I32 2^16 = 65536 1   52.46 51.828 98.80% 52.83 100.71% 52.601 100.27%
I32 2^20 = 1048576 1   127.595 126.857 99.42% 129.407 101.42% 128.511 100.72%
I32 2^24 = 16777216 1   1278 1277 99.92% 1280 100.16% 1279 100.08%
I32 2^28 = 268435456 1   23714 23718 100.02% 23733 100.08% 23742 100.12%
I32 2^16 = 65536 0.201   52.606 52.014 98.87% 52.912 100.58% 52.794 100.36%
I32 2^20 = 1048576 0.201   125.541 123.857 98.66% 126.949 101.12% 125.819 100.22%
I32 2^24 = 16777216 0.201   1266 1265 99.92% 1268 100.16% 1266 100.00%
I32 2^28 = 268435456 0.201   22969 22968 100.00% 22965 99.98% 22963 99.97%
I64 2^16 = 65536 1   60.702 59.993 98.83% 61.128 100.70% 60.705 100.00%
I64 2^20 = 1048576 1   185.814 184.984 99.55% 186.317 100.27% 185.768 99.98%
I64 2^24 = 16777216 1   2749 2748 99.96% 2752 100.11% 2751 100.07%
I64 2^28 = 268435456 1   52944 52674 99.49% 52693 99.53% 52667 99.48%
I64 2^16 = 65536 0.201   61.672 60.391 97.92% 61.72 100.08% 61.219 99.27%
I64 2^20 = 1048576 0.201   196.116 193.851 98.85% 196.456 100.17% 195.955 99.92%
I64 2^24 = 16777216 0.201   2770 2761 99.68% 2764 99.78% 2763 99.75%
I64 2^28 = 268435456 0.201   52704 52595 99.79% 52607 99.82% 52601 99.80%
I128 2^16 = 65536 1   71.854 71.305 99.24% 72.619 101.06% 72.313 100.64%
I128 2^20 = 1048576 1   329.35 331.707 100.72% 333.206 101.17% 332.985 101.10%
I128 2^24 = 16777216 1   5718 5715 99.95% 5714 99.93% 5713 99.91%
I128 2^28 = 268435456 1   110798 110770 99.97% 110748 99.95% 110743 99.95%
I128 2^16 = 65536 0.201   71.974 71.32 99.09% 72.707 101.02% 72.353 100.53%
I128 2^20 = 1048576 0.201   322.19 324.03 100.57% 325.3 100.97% 324.488 100.71%
I128 2^24 = 16777216 0.201   5695 5694 99.98% 5693 99.96% 5693 99.96%
I128 2^28 = 268435456 0.201   110039 110018 99.98% 110005 99.97% 109998 99.96%
F32 2^16 = 65536 1   52.28 52.001 99.47% 53.143 101.65% 52.91 101.21%
F32 2^20 = 1048576 1   127.809 127.204 99.53% 130.39 102.02% 129.863 101.61%
F32 2^24 = 16777216 1   1276 1278 100.16% 1281 100.39% 1280 100.31%
F32 2^28 = 268435456 1   23569 23562 99.97% 23576 100.03% 23571 100.01%
F32 2^16 = 65536 0.201   52.585 52.033 98.95% 53.153 101.08% 52.988 100.77%
F32 2^20 = 1048576 0.201   125.9 124.395 98.80% 128.203 101.83% 127.677 101.41%
F32 2^24 = 16777216 0.201   1267 1267 100.00% 1269 100.16% 1268 100.08%
F32 2^28 = 268435456 0.201   23005 23004 100.00% 23003 99.99% 23002 99.99%
F64 2^16 = 65536 1   60.225 59.26 98.40% 60.89 101.10% 60.48 100.42%
F64 2^20 = 1048576 1   185.706 185.101 99.67% 187.54 100.99% 186.721 100.55%
F64 2^24 = 16777216 1   2756 2756 100.00% 2759 100.11% 2757 100.04%
F64 2^28 = 268435456 1   52719 52732 100.02% 52744 100.05% 52743 100.05%
F64 2^16 = 65536 0.201   60.839 59.914 98.48% 61.486 101.06% 61.011 100.28%
F64 2^20 = 1048576 0.201   193.126 192.901 99.88% 195.149 101.05% 194.363 100.64%
F64 2^24 = 16777216 0.201   2759 2759 100.00% 2762 100.11% 2761 100.07%
F64 2^28 = 268435456 0.201   52613 52618 100.01% 52638 100.05% 52633 100.04%
C64 2^16 = 65536 1   201.379 201.49 100.06% 202.386 100.50% 202.198 100.41%
C64 2^20 = 1048576 1   471.037 470.731 99.94% 475.033 100.85% 473.644 100.55%
C64 2^24 = 16777216 1   6300 6314 100.22% 6300 100.00% 6324 100.38%
C64 2^28 = 268435456 1   126559 127456 100.71% 128288 101.37% 129571 102.38%
C64 2^16 = 65536 0.201   319.445 319.099 99.89% 320.483 100.32% 318.318 99.65%
C64 2^20 = 1048576 0.201   794.535 796.631 100.26% 801.399 100.86% 797.999 100.44%
C64 2^24 = 16777216 0.201   12412 12450 100.31% 12447 100.28% 12434 100.18%
C64 2^28 = 268435456 0.201   209144 209901 100.36% 209664 100.25% 209672 100.25%

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@elstehle elstehle requested review from a team as code owners January 10, 2025 16:40
@elstehle elstehle mentioned this pull request Jan 10, 2025
25 tasks
@github-actions
Copy link
Contributor

🟩 CI finished in 1h 29m: Pass: 100%/96 | Total: 1d 19h | Avg: 27m 02s | Max: 1h 00m | Hits: 403%/15012
  • 🟩 cub: Pass: 100%/47 | Total: 1d 06h | Avg: 38m 54s | Max: 1h 00m | Hits: 569%/3900

    🟩 cpu
      🟩 amd64              Pass: 100%/45  | Total:  1d 05h | Avg: 38m 40s | Max:  1h 00m | Hits: 569%/3900  
      🟩 arm64              Pass: 100%/2   | Total:  1h 28m | Avg: 44m 22s | Max: 45m 17s
    🟩 ctk
      🟩 12.0               Pass: 100%/8   | Total:  5h 42m | Avg: 42m 46s | Max: 55m 12s | Hits: 569%/1560  
      🟩 12.5               Pass: 100%/2   | Total:  1h 28m | Avg: 44m 03s | Max: 46m 10s
      🟩 12.6               Pass: 100%/37  | Total: 23h 18m | Avg: 37m 48s | Max:  1h 00m | Hits: 569%/2340  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 46m | Avg: 53m 01s | Max: 54m 47s
      🟩 nvcc12.0           Pass: 100%/8   | Total:  5h 42m | Avg: 42m 46s | Max: 55m 12s | Hits: 569%/1560  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 28m | Avg: 44m 03s | Max: 46m 10s
      🟩 nvcc12.6           Pass: 100%/35  | Total: 21h 32m | Avg: 36m 56s | Max:  1h 00m | Hits: 569%/2340  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 46m | Avg: 53m 01s | Max: 54m 47s
      🟩 nvcc               Pass: 100%/45  | Total:  1d 04h | Avg: 38m 17s | Max:  1h 00m | Hits: 569%/3900  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total:  2h 35m | Avg: 38m 54s | Max: 40m 26s
      🟩 Clang10            Pass: 100%/1   | Total: 41m 48s | Avg: 41m 48s | Max: 41m 48s
      🟩 Clang11            Pass: 100%/1   | Total: 40m 16s | Avg: 40m 16s | Max: 40m 16s
      🟩 Clang12            Pass: 100%/1   | Total: 37m 53s | Avg: 37m 53s | Max: 37m 53s
      🟩 Clang13            Pass: 100%/1   | Total: 37m 47s | Avg: 37m 47s | Max: 37m 47s
      🟩 Clang14            Pass: 100%/1   | Total: 37m 00s | Avg: 37m 00s | Max: 37m 00s
      🟩 Clang15            Pass: 100%/1   | Total: 38m 08s | Avg: 38m 08s | Max: 38m 08s
      🟩 Clang16            Pass: 100%/1   | Total: 38m 47s | Avg: 38m 47s | Max: 38m 47s
      🟩 Clang17            Pass: 100%/1   | Total: 38m 07s | Avg: 38m 07s | Max: 38m 07s
      🟩 Clang18            Pass: 100%/7   | Total:  4h 48m | Avg: 41m 16s | Max: 54m 47s
      🟩 GCC7               Pass: 100%/4   | Total:  2h 37m | Avg: 39m 22s | Max: 41m 10s
      🟩 GCC8               Pass: 100%/1   | Total: 41m 15s | Avg: 41m 15s | Max: 41m 15s
      🟩 GCC9               Pass: 100%/3   | Total:  2h 02m | Avg: 40m 44s | Max: 43m 13s
      🟩 GCC10              Pass: 100%/1   | Total: 42m 46s | Avg: 42m 46s | Max: 42m 46s
      🟩 GCC11              Pass: 100%/1   | Total: 42m 42s | Avg: 42m 42s | Max: 42m 42s
      🟩 GCC12              Pass: 100%/3   | Total:  1h 12m | Avg: 24m 15s | Max: 40m 13s
      🟩 GCC13              Pass: 100%/8   | Total:  3h 49m | Avg: 28m 38s | Max: 43m 27s
      🟩 MSVC14.29          Pass: 100%/3   | Total:  2h 46m | Avg: 55m 30s | Max:  1h 00m | Hits: 569%/2340  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 51m | Avg: 55m 54s | Max: 56m 43s | Hits: 569%/1560  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 28m | Avg: 44m 03s | Max: 46m 10s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total: 12h 34m | Avg: 39m 42s | Max: 54m 47s
      🟩 GCC                Pass: 100%/21  | Total: 11h 48m | Avg: 33m 43s | Max: 43m 27s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 38m | Avg: 55m 40s | Max:  1h 00m | Hits: 569%/3900  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 28m | Avg: 44m 03s | Max: 46m 10s
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 32m 32s | Avg: 16m 16s | Max: 16m 29s
      🟩 v100               Pass: 100%/45  | Total:  1d 05h | Avg: 39m 55s | Max:  1h 00m | Hits: 569%/3900  
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total:  1d 03h | Avg: 41m 40s | Max:  1h 00m | Hits: 569%/3900  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 23m 11s | Avg: 23m 11s | Max: 23m 11s
      🟩 GraphCapture       Pass: 100%/1   | Total: 15m 05s | Avg: 15m 05s | Max: 15m 05s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 04m | Avg: 21m 37s | Max: 24m 43s
      🟩 TestGPU            Pass: 100%/2   | Total: 59m 01s | Avg: 29m 30s | Max: 31m 53s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 32m 32s | Avg: 16m 16s | Max: 16m 29s
      🟩 90a                Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  3h 14m | Avg: 38m 55s | Max: 39m 28s
      🟩 14                 Pass: 100%/3   | Total:  2h 09m | Avg: 43m 10s | Max: 51m 06s | Hits: 569%/780   
      🟩 17                 Pass: 100%/13  | Total:  9h 57m | Avg: 45m 57s | Max:  1h 00m | Hits: 569%/2340  
      🟩 20                 Pass: 100%/26  | Total: 15h 07m | Avg: 34m 53s | Max: 56m 43s | Hits: 569%/780   
    
  • 🟩 thrust: Pass: 100%/46 | Total: 12h 10m | Avg: 15m 53s | Max: 40m 15s | Hits: 345%/11112

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 23m 31s | Avg: 11m 45s | Max: 11m 51s
    🟩 cpu
      🟩 amd64              Pass: 100%/44  | Total: 11h 47m | Avg: 16m 04s | Max: 40m 15s | Hits: 345%/11112 
      🟩 arm64              Pass: 100%/2   | Total: 23m 18s | Avg: 11m 39s | Max: 12m 07s
    🟩 ctk
      🟩 12.0               Pass: 100%/8   | Total:  2h 19m | Avg: 17m 24s | Max: 36m 55s | Hits: 340%/3704  
      🟩 12.5               Pass: 100%/2   | Total: 49m 34s | Avg: 24m 47s | Max: 25m 17s
      🟩 12.6               Pass: 100%/36  | Total:  9h 02m | Avg: 15m 03s | Max: 40m 15s | Hits: 347%/7408  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 23m 26s | Avg: 11m 43s | Max: 12m 10s
      🟩 nvcc12.0           Pass: 100%/8   | Total:  2h 19m | Avg: 17m 24s | Max: 36m 55s | Hits: 340%/3704  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 49m 34s | Avg: 24m 47s | Max: 25m 17s
      🟩 nvcc12.6           Pass: 100%/34  | Total:  8h 38m | Avg: 15m 15s | Max: 40m 15s | Hits: 347%/7408  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 23m 26s | Avg: 11m 43s | Max: 12m 10s
      🟩 nvcc               Pass: 100%/44  | Total: 11h 47m | Avg: 16m 04s | Max: 40m 15s | Hits: 345%/11112 
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total: 48m 34s | Avg: 12m 08s | Max: 12m 50s
      🟩 Clang10            Pass: 100%/1   | Total: 13m 11s | Avg: 13m 11s | Max: 13m 11s
      🟩 Clang11            Pass: 100%/1   | Total: 12m 14s | Avg: 12m 14s | Max: 12m 14s
      🟩 Clang12            Pass: 100%/1   | Total: 12m 51s | Avg: 12m 51s | Max: 12m 51s
      🟩 Clang13            Pass: 100%/1   | Total: 13m 39s | Avg: 13m 39s | Max: 13m 39s
      🟩 Clang14            Pass: 100%/1   | Total: 13m 24s | Avg: 13m 24s | Max: 13m 24s
      🟩 Clang15            Pass: 100%/1   | Total: 13m 22s | Avg: 13m 22s | Max: 13m 22s
      🟩 Clang16            Pass: 100%/1   | Total: 12m 18s | Avg: 12m 18s | Max: 12m 18s
      🟩 Clang17            Pass: 100%/1   | Total: 12m 19s | Avg: 12m 19s | Max: 12m 19s
      🟩 Clang18            Pass: 100%/7   | Total:  1h 22m | Avg: 11m 46s | Max: 14m 50s
      🟩 GCC7               Pass: 100%/4   | Total: 44m 27s | Avg: 11m 06s | Max: 12m 05s
      🟩 GCC8               Pass: 100%/1   | Total: 12m 23s | Avg: 12m 23s | Max: 12m 23s
      🟩 GCC9               Pass: 100%/3   | Total: 35m 38s | Avg: 11m 52s | Max: 12m 24s
      🟩 GCC10              Pass: 100%/1   | Total: 12m 48s | Avg: 12m 48s | Max: 12m 48s
      🟩 GCC11              Pass: 100%/1   | Total: 13m 39s | Avg: 13m 39s | Max: 13m 39s
      🟩 GCC12              Pass: 100%/1   | Total: 14m 33s | Avg: 14m 33s | Max: 14m 33s
      🟩 GCC13              Pass: 100%/8   | Total:  1h 33m | Avg: 11m 44s | Max: 15m 53s
      🟩 MSVC14.29          Pass: 100%/3   | Total:  1h 44m | Avg: 34m 46s | Max: 36m 55s | Hits: 340%/5556  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  1h 55m | Avg: 38m 23s | Max: 40m 15s | Hits: 349%/5556  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 49m 34s | Avg: 24m 47s | Max: 25m 17s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  3h 54m | Avg: 12m 20s | Max: 14m 50s
      🟩 GCC                Pass: 100%/19  | Total:  3h 47m | Avg: 11m 58s | Max: 15m 53s
      🟩 MSVC               Pass: 100%/6   | Total:  3h 39m | Avg: 36m 34s | Max: 40m 15s | Hits: 345%/11112 
      🟩 NVHPC              Pass: 100%/2   | Total: 49m 34s | Avg: 24m 47s | Max: 25m 17s
    🟩 gpu
      🟩 v100               Pass: 100%/46  | Total: 12h 10m | Avg: 15m 53s | Max: 40m 15s | Hits: 345%/11112 
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total: 10h 35m | Avg: 15m 53s | Max: 40m 15s | Hits: 340%/9260  
      🟩 TestCPU            Pass: 100%/3   | Total: 52m 57s | Avg: 17m 39s | Max: 37m 13s | Hits: 365%/1852  
      🟩 TestGPU            Pass: 100%/3   | Total: 42m 23s | Avg: 14m 07s | Max: 15m 53s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total:  7m 39s | Avg:  7m 39s | Max:  7m 39s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total: 55m 45s | Avg: 11m 09s | Max: 12m 08s
      🟩 14                 Pass: 100%/3   | Total: 56m 51s | Avg: 18m 57s | Max: 32m 32s | Hits: 340%/1852  
      🟩 17                 Pass: 100%/13  | Total:  4h 07m | Avg: 19m 00s | Max: 37m 43s | Hits: 340%/5556  
      🟩 20                 Pass: 100%/23  | Total:  5h 47m | Avg: 15m 06s | Max: 40m 15s | Hits: 353%/3704  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 54s | Avg: 4m 27s | Max: 6m 57s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  8m 54s | Avg:  4m 27s | Max:  6m 57s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  8m 54s | Avg:  4m 27s | Max:  6m 57s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 54s | Avg:  4m 27s | Max:  6m 57s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  8m 54s | Avg:  4m 27s | Max:  6m 57s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  8m 54s | Avg:  4m 27s | Max:  6m 57s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  8m 54s | Avg:  4m 27s | Max:  6m 57s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  8m 54s | Avg:  4m 27s | Max:  6m 57s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  1m 57s | Avg:  1m 57s | Max:  1m 57s
      🟩 Test               Pass: 100%/1   | Total:  6m 57s | Avg:  6m 57s | Max:  6m 57s
    
  • 🟩 python: Pass: 100%/1 | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 96)

# Runner
69 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
11 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

Copy link
Contributor

@fbusato fbusato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using choose_offset_t instead of promotion is nice idea!
On the other hand, I'm a fan of using unsigned types for the offsets. Here the performance looks good, but there are cases where we can get regressions.

@elstehle elstehle merged commit cc7c1bb into NVIDIA:main Jan 11, 2025
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 18, 2025
davebayer added a commit to davebayer/cccl that referenced this pull request Jan 20, 2025
implement `add_sat`

split `signed`/`unsigned` implementation, improve implementation for MSVC

improve device `add_sat` implementation

add `add_sat` test

improve generic `add_sat` implementation for signed types

implement `sub_sat`

allow more msvc intrinsics on x86

add op tests

partially implement `mul_sat`

implement `div_sat` and `saturate_cast`

add `saturate_cast` test

simplify `div_sat` test

Deprectate C++11 and C++14 for libcu++ (#3173)

* Deprectate C++11 and C++14 for libcu++

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

Implement `abs` and `div` from `cstdlib` (#3153)

* implement integer abs functions
* improve tests, fix constexpr support
* just use the our implementation
* implement `cuda::std::div`
* prefer host's `div_t` like types
* provide `cuda::std::abs` overloads for floats
* allow fp abs for NVRTC
* silence msvc's warning about conversion from floating point to integral

Fix missing radix sort policies (#3174)

Fixes NVBug 5009941

Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148)

* introduces new arg{min,max} interface with two output iterators

* adds fp inf tests

* fixes docs

* improves code example

* fixes exec space specifier

* trying to fix deprecation warning for more compilers

* inlines unzip operator

* trying to fix deprecation warning for nvhpc

* integrates supression fixes in diagnostics

* pre-ctk 11.5 deprecation suppression

* fixes icc

* fix for pre-ctk11.5

* cleans up deprecation suppression

* cleanup

Extend tuning documentation (#3179)

Add codespell pre-commit hook, fix typos in CCCL (#3168)

* Add codespell pre-commit hook
* Automatic changes from codespell.
* Manual changes.

Fix parameter space for TUNE_LOAD in scan benchmark (#3176)

fix various old compiler checks (#3178)

implement C++26 `std::projected` (#3175)

Fix pre-commit config for codespell and remaining typos (#3182)

Massive cleanup of our config (#3155)

Fix UB in atomics with automatic storage (#2586)

* Adds specialized local cuda atomics and injects them into most atomics paths.

Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com>
Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com>

* Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478

* Remove extraneous double brackets in unformatted code.

* Merge unsafe atomic logic into `__cuda_is_local`.

* Use `const_cast` for type conversions in cuda_local.h

* Fix build issues from interface changes

* Fix missing __nanosleep on sm70-

* Guard __isLocal from NVHPC

* Use PTX instead of running nothing from NVHPC

* fixup /s/nvrtc/nvhpc

* Fixup missing CUDA ifdef surrounding device code

* Fix codegen

* Bypass some sort of compiler bug on GCC7

* Apply suggestions from code review

* Use unsafe automatic storage atomics in codegen tests

---------

Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com>
Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com>
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Refactor the source code layout for `cuda.parallel` (#3177)

* Refactor the source layout for cuda.parallel

* Add copyright

* Address review feedback

* Don't import anything into `experimental` namespace

* fix import

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

new type-erased memory resources (#2824)

s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186)

Document address stability of `thrust::transform` (#3181)

* Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS
* Reformat and fix UnaryFunction/BinaryFunction in transform docs
* Mention transform can use proclaim_copyable_arguments
* Document cuda::proclaims_copyable_arguments better
* Deprecate depending on transform functor argument addresses

Fixes: #3053

turn off cuda version check for clangd (#3194)

[STF] jacobi example based on parallel_for (#3187)

* Simple jacobi example with parallel for and reductions

* clang-format

* remove useless capture list

fixes pre-nv_diag suppression issues (#3189)

Prefer c2h::type_name over c2h::demangle (#3195)

Fix memcpy_async* tests (#3197)

* memcpy_async_tx: Fix bug in test

Two bugs, one of which occurs in practice:

1. There is a missing fence.proxy.space::global between the writes to
   global memory and the memcpy_async_tx. (Occurs in practice)

2. The end of the kernel should be fenced with `__syncthreads()`,
   because the barrier is invalidated in the destructor. If other
   threads are still waiting on it, there will be UB. (Has not yet
   manifested itself)

* cp_async_bulk_tensor: Pre-emptively fence more in test

Add type annotations and mypy checks for `cuda.parallel`  (#3180)

* Refactor the source layout for cuda.parallel

* Add initial type annotations

* Update pre-commit config

* More typing

* Fix bad merge

* Fix TYPE_CHECKING and numpy annotations

* typing bindings.py correctly

* Address review feedback

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Fix rendering of cuda.parallel docs (#3192)

* Fix pre-commit config for codespell and remaining typos

* Fix rendering of docs for cuda.parallel

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Enable PDL for DeviceMergeSortBlockSortKernel (#3199)

The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC.
This commit enables PDL when launching the kernel.

Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647)

* adds benchmarks for reduce::arg{min,max}

* preliminary streaming arg-extremum reduction

* fixes implicit conversion

* uses streaming dispatch class

* changes arg benches to use new streaming reduce

* streaming arg-extrema reduction

* fixes style

* fixes compilation failures

* cleanups

* adds rst style comments

* declare vars const and use clamp

* consolidates argmin argmax benchmarks

* fixes thrust usage

* drops offset type in arg-extrema benchmarks

* fixes clang cuda

* exec space macros

* switch to signed global offset type for slightly better perf

* clarifies documentation

* applies minor benchmark style changes from review comments

* fixes interface documentation and comments

* list-init accumulating output op

* improves style, comments, and tests

* cleans up aggregate init

* renames dispatch class usage in benchmarks

* fixes merge conflicts

* addresses review comments

* addresses review comments

* fixes assertion

* removes superseded implementation

* changes large problem tests to use new interface

* removes obsolete tests for deprecated interface

Fixes for Python 3.7 docs environment (#3206)

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Adds support for large number of items to `DeviceTransform` (#3172)

* moves large problem test helper to common file

* adds support for large num items to device transform

* adds tests for large number of items to device interface

* fixes format

* addresses review comments

cp_async_bulk: Fix test (#3198)

* memcpy_async_tx: Fix bug in test

Two bugs, one of which occurs in practice:

1. There is a missing fence.proxy.space::global between the writes to
   global memory and the memcpy_async_tx. (Occurs in practice)

2. The end of the kernel should be fenced with `__syncthreads()`,
   because the barrier is invalidated in the destructor. If other
   threads are still waiting on it, there will be UB. (Has not yet
   manifested itself)

* cp_async_bulk_tensor: Pre-emptively fence more in test

* cp_async_bulk: Fix test

The global memory pointer could be misaligned.

cudax fixes for msvc 14.41 (#3200)

avoid instantiating class templates in `is_same` implementation when possible (#3203)

Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209)

* Fix: make launchers a CUB detail; make kernel source functions hidden.

* [pre-commit.ci] auto code formatting

* Address review comments, fix which macro gets fixed.

help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202)

unify macros and cmake options that control the suppression of deprecation warnings (#3220)

* unify macros and cmake options that control the suppression of deprecation warnings

* suppress nvcc warning #186 in thrust header tests

* suppress c++ dialect deprecation warnings in libcudacxx header tests

Fx thread-reduce performance regression (#3225)

cuda.parallel: In-memory caching of build objects (#3216)

* Define __eq__ and __hash__ for Iterators

* Define cache_with_key utility and use it to cache Reduce objects

* Add tests for caching Reduce objects

* Tighten up types

* Updates to support 3.7

* Address review feedback

* Introduce IteratorKind to hold iterator type information

* Use the .kind to generate an abi_name

* Remove __eq__ and __hash__ methods from IteratorBase

* Move helper function

* Formatting

* Don't unpack tuple in cache key

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Just enough ranges for c++14 `span` (#3211)

use generalized concepts portability macros to simplify the `range` concept (#3217)

fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR`

Use Ruff to sort imports (#3230)

* Update pyproject.tomls for import sorting

* Update files after running pre-commit

* Move ruff config to pyproject.toml

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

fix tuning_scan sm90 config issue (#3236)

Co-authored-by: Shijie Chen <shijiec@nvidia.com>

[STF] Logical token (#3196)

* Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs.

* Add missing files

* Check if a task implementation can match a prototype where the void_interface arguments are ignored

* Implement ctx.abstract_logical_data() which relies on a void data interface

* Illustrate how to use abstract handles in local contexts

* Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages

* Small improvements in the examples

* Do not try to allocate or move void data

* Do not use I as a variable

* fix linkage error

* rename abtract_logical_data into logical_token

* Document logical token

* fix spelling error

* fix sphinx error

* reflect name changes

* use meaningful variable names

* simplify logical_token implementation because writeback is already disabled

* add a unit test for token elision

* implement token elision in host_launch

* Remove unused type

* Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens

* Much simpler is_tuple_invocable_with_filtered implementation

* Fix buggy test

* Factorize code

* Document that we can ignore tokens for task and host_launch

* Documentation for logical data freeze

Fix ReduceByKey tuning (#3240)

Fix RLE tuning (#3239)

cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233)

* Forbid non-contiguous arrays as inputs (or outputs)

* Implement a more robust way to check for contiguity

* Don't bother if cublas unavailable

* Fix how we check for zero-element arrays

* sort imports

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

expands support for more offset types in segmented benchmark (#3231)

Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253)

* Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects

* Do not add option twice

ptx: Add add_instruction.py (#3190)

This file helps create the necessary structure for new PTX instructions.

Co-authored-by: Allard Hendriksen <ahendriksen@nvidia.com>

Bump main to 2.9.0. (#3247)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Drop cub::Mutex (#3251)

Fixes: #3250

Remove legacy macros from CUB util_arch.cuh (#3257)

Fixes: #3256

Remove thrust::[unary|binary]_traits (#3260)

Fixes: #3259

Architecture and OS identification macros (#3237)

Bump main to 3.0.0. (#3265)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Drop thrust not1 and not2 (#3264)

Fixes: #3263

CCCL Internal macro documentation (#3238)

Deprecate GridBarrier and GridBarrierLifetime (#3258)

Fixes: #1389

Require at least gcc7 (#3268)

Fixes: #3267

Drop thrust::[unary|binary]_function (#3274)

Fixes: #3273

Drop ICC from CI (#3277)

[STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270)

* Add a test to reproduce a bug observed with parallel_for on a host place

* clang-format

* use _CCCL_ASSERT

* Attempt to debug

* do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead

* fix lambda expression

* clang-format

Enable thrust::identity test for non-MSVC (#3281)

This seems to be an oversight when the test was added

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Enable PDL in triple chevron launch (#3282)

It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed
to _CCCL_HAS_PDL during the review introducing the feature.

Disambiguate line continuations and macro continuations in <nv/target> (#3244)

Drop VS 2017 from CI (#3287)

Fixes: #3286

Drop ICC support in code (#3279)

* Drop ICC from code

Fixes: #3278

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Make CUB NVRTC commandline arguments come from a cmake template (#3292)

Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295)

Use process isolation instead of default hyper-v for Windows. (#3294)

Try improving build times by using process isolation instead of hyper-v

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

[pre-commit.ci] pre-commit autoupdate (#3248)

* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6)
- [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6)
- [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1)

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Drop Thrust legacy arch macros (#3298)

Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS

Drop Thrust's compiler_fence.h (#3300)

Drop CTK 11.x from CI (#3275)

* Add cuda12.0-gcc7 devcontainer
* Move MSVC2017 jobs to CTK 12.6
Those is the only combination where rapidsai has devcontainers
* Add /Zc:__cplusplus for the libcudacxx tests
* Only add excape hatch for affected CTKs
* Workaround missing cudaLaunchKernelEx on MSVC
cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK.
* Workaround nvcc+MSVC issue
* Regenerate devcontainers

Fixes: #3249

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Drop CUB's util_compiler.cuh (#3302)

All contained macros were deprecated

Update packman and repo_docs versions (#3293)

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Drop Thrust's deprecated compiler macros (#3301)

Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305)

Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506)

* adds support for large number of items to three-way partition

* adapts interface to use choose_signed_offset_t

* integrates applicable feedback from device-select pr

* changes behavior for empty problems

* unifies grid constant macro

* fixes kernel template specialization mismatch

* integrates _CCCL_GRID_CONSTANT changes

* resolve merge conflicts

* fixes checks in test

* fixes test verification

* improves tests

* makes few improvements to streaming dispatch

* improves code comment on test

* fixes unrelated compiler error

* minor style improvements

Refactor scan tunings (#3262)

Require C++17 for compiling Thrust and CUB (#3255)

* Issue an unsuppressable warning when compiling with < C++17
* Remove C++11/14 presets
* Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers
* Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14]
* Remove CUB_ENABLE_DIALECT_CPP[11|14]
* Update CI runs
* Remove C++11/14 CI runs for CUB and Thrust
* Raise compiler minimum versions for C++17
* Update ReadMe
* Drop Thrust's cpp14_required.h
* Add escape hatch for C++17 removal

Fixes: #3252

Implement `views::empty` (#3254)

* Disable pair conversion of subrange with clang in C++17

* Fix namespace views

* Implement `views::empty`

This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view

Refactor `limits` and `climits` (#3221)

* implement builtins for huge val, nan and nans

* change `INFINITY` and `NAN` implementation for NVRTC

cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311)

* Add tests demonstrating usage of different iterators

* Update documentation of reduce_into by merging import code snippet with the rest of the example

* Add documentation for current iterators

* Run pre-commit checks and update accordingly

* Fix comments to refer to the proper lines in the code snippets in the docs

Drop clang<14 from CI, update devcontainers. (#3309)

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

[STF] Cleanup task dependencies object constructors (#3291)

* Define tag types for access modes

* - Rework how we build task_dep objects based on access mode tags
- pack_state is now responsible for using a const_cast for read only data

* Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums

* It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes

Disable test with a gcc-14 regression (#3297)

Deprecate Thrust's cpp_compatibility.h macros (#3299)

Remove dropped function objects from docs (#3319)

Document `NV_TARGET` macros (#3313)

[STF] Define ctx.pick_stream() which was missing for the unified context (#3326)

* Define ctx.pick_stream() which was missing for the unified context

* clang-format

Deprecate cub::IterateThreadStore (#3337)

Drop CUB's BinaryFlip operator (#3332)

Deprecate cub::Swap (#3333)

Clarify transform output can overlap input (#3323)

Drop CUB APIs with a debug_synchronous parameter (#3330)

Fixes: #3329

Drop CUB's util_compiler.cuh for real (#3340)

PR #3302 planned to drop the file, but only dropped its content. This
was an oversight. So let's drop the entire file.

Drop cub::ValueCache (#3346)

limits offset types for merge sort (#3328)

Drop CDPv1 (#3344)

Fixes: #3341

Drop thrust::void_t (#3362)

Use cuda::std::addressof in Thrust (#3363)

Fix all_of documentation for empty ranges (#3358)

all_of always returns true on an empty range.

[STF] Do not keep track of dangling events in a CUDA graph backend (#3327)

* Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when
the CUDA graph completes. Therefore keeping track of "dangling events" is a
waste of time and resources.

* replace can_ignore_dangling_events by track_dangling_events which leads to more readable code

* When not storing the dangling events, we must still perform the deinit operations that were producing these events !

Extract scan kernels into NVRTC-compilable header (#3334)

* Extract scan kernels into NVRTC-compilable header

* Update cub/cub/device/dispatch/dispatch_scan.cuh

Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com>

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com>

Drop deprecated aliases in Thrust functional (#3272)

Fixes: #3271

Drop cub::DivideAndRoundUp (#3347)

Use cuda::std::min/max in Thrust (#3364)

Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361)

* implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16`

Cleanup util_arch (#2773)

Deprecate thrust::null_type (#3367)

Deprecate cub::DeviceSpmv (#3320)

Fixes: #896

Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* fixes spelling

* adds tests for large number of segments

* fixes narrowing conversion in tests

* addresses review comments

* fixes includes

Compile basic infra test with C++17 (#3377)

Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements

Exit with error when RAPIDS CI fails. (#3385)

cuda.parallel: Support structured types as algorithm inputs (#3218)

* Introduce gpu_struct decorator and typing

* Enable `reduce` to accept arrays of structs as inputs

* Add test for reducing arrays-of-struct

* Update documentation

* Use a numpy array rather than ctypes object

* Change zeros -> empty for output array and temp storage

* Add a TODO for typing GpuStruct

* Documentation udpates

* Remove test_reduce_struct_type from test_reduce.py

* Revert to `to_cccl_value()` accepting ndarray + GpuStruct

* Bump copyrights

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Deprecate thrust::async (#3324)

Fixes: #100

Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342)

Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314)

* add compiler-specific path
* fix device code path
* add _CCC_ASSUME

Deprecate thrust::numeric_limits (#3366)

Replace `typedef` with `using` in libcu++ (#3368)

Deprecate thrust::optional (#3307)

Fixes: #3306

Upgrade to Catch2 3.8  (#3310)

Fixes: #1724

refactor `<cuda/std/cstdint>` (#3325)

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

Update CODEOWNERS (#3331)

* Update CODEOWNERS

* Update CODEOWNERS

* Update CODEOWNERS

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fix sign-compare warning (#3408)

Implement more cmath functions to be usable on host and device (#3382)

* Implement more cmath functions to be usable on host and device

* Implement math roots functions

* Implement exponential functions

Redefine and deprecate thrust::remove_cvref (#3394)

* Redefine and deprecate thrust::remove_cvref

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Fix assert definition for NVHPC due to constexpr issues (#3418)

NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it.

Fix this by always using the host definition which should also work on device.

Fixes #3411

Extend CUB reduce benchmarks (#3401)

* Rename max.cu to custom.cu, since it uses a custom operator
* Extend types covered my min.cu to all fundamental types
* Add some notes on how to collect tuning parameters

Fixes: #3283

Update upload-pages-artifact to v3 (#3423)

* Update upload-pages-artifact to v3

* Empty commit

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Replace and deprecate thrust::cuda_cub::terminate (#3421)

`std::linalg` accessors and `transposed_layout` (#2962)

Add round up/down to multiple (#3234)

[FEA]: Introduce Python module with CCCL headers (#3201)

* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative

* Run `copy_cccl_headers_to_aude_include()` before `setup()`

* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.

* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel

* Bug fix: cuda/_include only exists after shutil.copytree() ran.

* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py

* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)

* Replace := operator (needs Python 3.8+)

* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md

* Restore original README.md: `pip3 install -e` now works on first pass.

* cuda_cccl/README.md: FOR INTERNAL USE ONLY

* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917)

Command used: ci/update_version.sh 2 8 0

* Modernize pyproject.toml, setup.py

Trigger for this change:

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996

* Install CCCL headers under cuda.cccl.include

Trigger for this change:

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562

Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.

* Factor out cuda_cccl/cuda/cccl/include_paths.py

* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative

* Add missing Copyright notice.

* Add missing __init__.py (cuda.cccl)

* Add `"cuda.cccl"` to `autodoc.mock_imports`

* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)

* Add # TODO: move this to a module-level import

* Modernize cuda_cooperative/pyproject.toml, setup.py

* Convert cuda_cooperative to use hatchling as build backend.

* Revert "Convert cuda_cooperative to use hatchling as build backend."

This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe.

* Move numpy from [build-system] requires -> [project] dependencies

* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH

* Remove copy_license() and use license_files=["../../LICENSE"] instead.

* Further modernize cuda_cccl/setup.py to use pathlib

* Trivial simplifications in cuda_cccl/pyproject.toml

* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code

* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml

* Add taplo-pre-commit to .pre-commit-config.yaml

* taplo-pre-commit auto-fixes

* Use pathlib in cuda_cooperative/setup.py

* CCCL_PYTHON_PATH in cuda_cooperative/setup.py

* Modernize cuda_parallel/pyproject.toml, setup.py

* Use pathlib in cuda_parallel/setup.py

* Add `# TOML lint & format` comment.

* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml

* Use pathlib in cuda/cccl/include_paths.py

* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)

* Fixes after git merge main

* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'

```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
  /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>

  Traceback (most recent call last):
    File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
      bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
                                                       ^^^^^^^^^^^^^^^^^
  AttributeError: '_Reduce' object has no attribute 'build_result'

    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```

* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`

* Introduce cuda_cooperative/constraints.txt

* Also add cuda_parallel/constraints.txt

* Add `--constraint constraints.txt` in ci/test_python.sh

* Update Copyright dates

* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)

For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.

* Remove unused cuda_parallel jinja2 dependency (noticed by chance).

* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.

* Make cuda_cooperative, cuda_parallel testing completely independent.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"

This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2.

Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971

* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Restore original ci/matrix.yaml [skip-rapids]

* Use for loop in test_python.sh to avoid code duplication.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]

* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"

This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d.

* Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460)

* Address feedback by @leofang

---------

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

cuda.parallel: Add optional stream argument to reduce_into() (#3348)

* Add optional stream argument to reduce_into()

* Add tests to check for reduce_into() stream behavior

* Move protocol related utils to separate file and rework __cuda_stream__ error messages

* Fix synchronization issue in stream test and add one more invalid stream test case

* Rename cuda stream validation function after removing leading underscore

* Unpack values from __cuda_stream__ instead of indexing

* Fix linting errors

* Handle TypeError when unpacking invalid __cuda_stream__ return

* Use stream to allocate cupy memory in new stream test

Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434)

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes #3404

move to c++17, finalize device optimization

fix msvc compilation, update tests

Deprectate C++11 and C++14 for libcu++ (#3173)

* Deprectate C++11 and C++14 for libcu++

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

Implement `abs` and `div` from `cstdlib` (#3153)

* implement integer abs functions
* improve tests, fix constexpr support
* just use the our implementation
* implement `cuda::std::div`
* prefer host's `div_t` like types
* provide `cuda::std::abs` overloads for floats
* allow fp abs for NVRTC
* silence msvc's warning about conversion from floating point to integral

Fix missing radix sort policies (#3174)

Fixes NVBug 5009941

Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148)

* introduces new arg{min,max} interface with two output iterators

* adds fp inf tests

* fixes docs

* improves code example

* fixes exec space specifier

* trying to fix deprecation warning for more compilers

* inlines unzip operator

* trying to fix deprecation warning for nvhpc

* integrates supression fixes in diagnostics

* pre-ctk 11.5 deprecation suppression

* fixes icc

* fix for pre-ctk11.5

* cleans up deprecation suppression

* cleanup

Extend tuning documentation (#3179)

Add codespell pre-commit hook, fix typos in CCCL (#3168)

* Add codespell pre-commit hook
* Automatic changes from codespell.
* Manual changes.

Fix parameter space for TUNE_LOAD in scan benchmark (#3176)

fix various old compiler checks (#3178)

implement C++26 `std::projected` (#3175)

Fix pre-commit config for codespell and remaining typos (#3182)

Massive cleanup of our config (#3155)

Fix UB in atomics with automatic storage (#2586)

* Adds specialized local cuda atomics and injects them into most atomics paths.

Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com>
Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com>

* Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478

* Remove extraneous double brackets in unformatted code.

* Merge unsafe atomic logic into `__cuda_is_local`.

* Use `const_cast` for type conversions in cuda_local.h

* Fix build issues from interface changes

* Fix missing __nanosleep on sm70-

* Guard __isLocal from NVHPC

* Use PTX instead of running nothing from NVHPC

* fixup /s/nvrtc/nvhpc

* Fixup missing CUDA ifdef surrounding device code

* Fix codegen

* Bypass some sort of compiler bug on GCC7

* Apply suggestions from code review

* Use unsafe automatic storage atomics in codegen tests

---------

Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com>
Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com>
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Refactor the source code layout for `cuda.parallel` (#3177)

* Refactor the source layout for cuda.parallel

* Add copyright

* Address review feedback

* Don't import anything into `experimental` namespace

* fix import

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

new type-erased memory resources (#2824)

s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186)

Document address stability of `thrust::transform` (#3181)

* Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS
* Reformat and fix UnaryFunction/BinaryFunction in transform docs
* Mention transform can use proclaim_copyable_arguments
* Document cuda::proclaims_copyable_arguments better
* Deprecate depending on transform functor argument addresses

Fixes: #3053

turn off cuda version check for clangd (#3194)

[STF] jacobi example based on parallel_for (#3187)

* Simple jacobi example with parallel for and reductions

* clang-format

* remove useless capture list

fixes pre-nv_diag suppression issues (#3189)

Prefer c2h::type_name over c2h::demangle (#3195)

Fix memcpy_async* tests (#3197)

* memcpy_async_tx: Fix bug in test

Two bugs, one of which occurs in practice:

1. There is a missing fence.proxy.space::global between the writes to
   global memory and the memcpy_async_tx. (Occurs in practice)

2. The end of the kernel should be fenced with `__syncthreads()`,
   because the barrier is invalidated in the destructor. If other
   threads are still waiting on it, there will be UB. (Has not yet
   manifested itself)

* cp_async_bulk_tensor: Pre-emptively fence more in test

Add type annotations and mypy checks for `cuda.parallel`  (#3180)

* Refactor the source layout for cuda.parallel

* Add initial type annotations

* Update pre-commit config

* More typing

* Fix bad merge

* Fix TYPE_CHECKING and numpy annotations

* typing bindings.py correctly

* Address review feedback

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Fix rendering of cuda.parallel docs (#3192)

* Fix pre-commit config for codespell and remaining typos

* Fix rendering of docs for cuda.parallel

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Enable PDL for DeviceMergeSortBlockSortKernel (#3199)

The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC.
This commit enables PDL when launching the kernel.

Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647)

* adds benchmarks for reduce::arg{min,max}

* preliminary streaming arg-extremum reduction

* fixes implicit conversion

* uses streaming dispatch class

* changes arg benches to use new streaming reduce

* streaming arg-extrema reduction

* fixes style

* fixes compilation failures

* cleanups

* adds rst style comments

* declare vars const and use clamp

* consolidates argmin argmax benchmarks

* fixes thrust usage

* drops offset type in arg-extrema benchmarks

* fixes clang cuda

* exec space macros

* switch to signed global offset type for slightly better perf

* clarifies documentation

* applies minor benchmark style changes from review comments

* fixes interface documentation and comments

* list-init accumulating output op

* improves style, comments, and tests

* cleans up aggregate init

* renames dispatch class usage in benchmarks

* fixes merge conflicts

* addresses review comments

* addresses review comments

* fixes assertion

* removes superseded implementation

* changes large problem tests to use new interface

* removes obsolete tests for deprecated interface

Fixes for Python 3.7 docs environment (#3206)

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Adds support for large number of items to `DeviceTransform` (#3172)

* moves large problem test helper to common file

* adds support for large num items to device transform

* adds tests for large number of items to device interface

* fixes format

* addresses review comments

cp_async_bulk: Fix test (#3198)

* memcpy_async_tx: Fix bug in test

Two bugs, one of which occurs in practice:

1. There is a missing fence.proxy.space::global between the writes to
   global memory and the memcpy_async_tx. (Occurs in practice)

2. The end of the kernel should be fenced with `__syncthreads()`,
   because the barrier is invalidated in the destructor. If other
   threads are still waiting on it, there will be UB. (Has not yet
   manifested itself)

* cp_async_bulk_tensor: Pre-emptively fence more in test

* cp_async_bulk: Fix test

The global memory pointer could be misaligned.

cudax fixes for msvc 14.41 (#3200)

avoid instantiating class templates in `is_same` implementation when possible (#3203)

Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209)

* Fix: make launchers a CUB detail; make kernel source functions hidden.

* [pre-commit.ci] auto code formatting

* Address review comments, fix which macro gets fixed.

help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202)

unify macros and cmake options that control the suppression of deprecation warnings (#3220)

* unify macros and cmake options that control the suppression of deprecation warnings

* suppress nvcc warning #186 in thrust header tests

* suppress c++ dialect deprecation warnings in libcudacxx header tests

Fx thread-reduce performance regression (#3225)

cuda.parallel: In-memory caching of build objects (#3216)

* Define __eq__ and __hash__ for Iterators

* Define cache_with_key utility and use it to cache Reduce objects

* Add tests for caching Reduce objects

* Tighten up types

* Updates to support 3.7

* Address review feedback

* Introduce IteratorKind to hold iterator type information

* Use the .kind to generate an abi_name

* Remove __eq__ and __hash__ methods from IteratorBase

* Move helper function

* Formatting

* Don't unpack tuple in cache key

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Just enough ranges for c++14 `span` (#3211)

use generalized concepts portability macros to simplify the `range` concept (#3217)

fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR`

Use Ruff to sort imports (#3230)

* Update pyproject.tomls for import sorting

* Update files after running pre-commit

* Move ruff config to pyproject.toml

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

fix tuning_scan sm90 config issue (#3236)

Co-authored-by: Shijie Chen <shijiec@nvidia.com>

[STF] Logical token (#3196)

* Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs.

* Add missing files

* Check if a task implementation can match a prototype where the void_interface arguments are ignored

* Implement ctx.abstract_logical_data() which relies on a void data interface

* Illustrate how to use abstract handles in local contexts

* Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages

* Small improvements in the examples

* Do not try to allocate or move void data

* Do not use I as a variable

* fix linkage error

* rename abtract_logical_data into logical_token

* Document logical token

* fix spelling error

* fix sphinx error

* reflect name changes

* use meaningful variable names

* simplify logical_token implementation because writeback is already disabled

* add a unit test for token elision

* implement token elision in host_launch

* Remove unused type

* Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens

* Much simpler is_tuple_invocable_with_filtered implementation

* Fix buggy test

* Factorize code

* Document that we can ignore tokens for task and host_launch

* Documentation for logical data freeze

Fix ReduceByKey tuning (#3240)

Fix RLE tuning (#3239)

cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233)

* Forbid non-contiguous arrays as inputs (or outputs)

* Implement a more robust way to check for contiguity

* Don't bother if cublas unavailable

* Fix how we check for zero-element arrays

* sort imports

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

expands support for more offset types in segmented benchmark (#3231)

Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253)

* Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects

* Do not add option twice

ptx: Add add_instruction.py (#3190)

This file helps create the necessary structure for new PTX instructions.

Co-authored-by: Allard Hendriksen <ahendriksen@nvidia.com>

Bump main to 2.9.0. (#3247)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Drop cub::Mutex (#3251)

Fixes: #3250

Remove legacy macros from CUB util_arch.cuh (#3257)

Fixes: #3256

Remove thrust::[unary|binary]_traits (#3260)

Fixes: #3259

Architecture and OS identification macros (#3237)

Bump main to 3.0.0. (#3265)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Drop thrust not1 and not2 (#3264)

Fixes: #3263

CCCL Internal macro documentation (#3238)

Deprecate GridBarrier and GridBarrierLifetime (#3258)

Fixes: #1389

Require at least gcc7 (#3268)

Fixes: #3267

Drop thrust::[unary|binary]_function (#3274)

Fixes: #3273

Drop ICC from CI (#3277)

[STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270)

* Add a test to reproduce a bug observed with parallel_for on a host place

* clang-format

* use _CCCL_ASSERT

* Attempt to debug

* do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead

* fix lambda expression

* clang-format

Enable thrust::identity test for non-MSVC (#3281)

This seems to be an oversight when the test was added

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Enable PDL in triple chevron launch (#3282)

It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed
to _CCCL_HAS_PDL during the review introducing the feature.

Disambiguate line continuations and macro continuations in <nv/target> (#3244)

Drop VS 2017 from CI (#3287)

Fixes: #3286

Drop ICC support in code (#3279)

* Drop ICC from code

Fixes: #3278

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Make CUB NVRTC commandline arguments come from a cmake template (#3292)

Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295)

Use process isolation instead of default hyper-v for Windows. (#3294)

Try improving build times by using process isolation instead of hyper-v

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

[pre-commit.ci] pre-commit autoupdate (#3248)

* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6)
- [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6)
- [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1)

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Drop Thrust legacy arch macros (#3298)

Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS

Drop Thrust's compiler_fence.h (#3300)

Drop CTK 11.x from CI (#3275)

* Add cuda12.0-gcc7 devcontainer
* Move MSVC2017 jobs to CTK 12.6
Those is the only combination where rapidsai has devcontainers
* Add /Zc:__cplusplus for the libcudacxx tests
* Only add excape hatch for affected CTKs
* Workaround missing cudaLaunchKernelEx on MSVC
cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK.
* Workaround nvcc+MSVC issue
* Regenerate devcontainers

Fixes: #3249

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Update packman and repo_docs versions (#3293)

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Drop Thrust's deprecated compiler macros (#3301)

Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305)

Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506)

* adds support for large number of items to three-way partition

* adapts interface to use choose_signed_offset_t

* integrates applicable feedback from device-select pr

* changes behavior for empty problems

* unifies grid constant macro

* fixes kernel template specialization mismatch

* integrates _CCCL_GRID_CONSTANT changes

* resolve merge conflicts

* fixes checks in test

* fixes test verification

* improves tests

* makes few improvements to streaming dispatch

* improves code comment on test

* fixes unrelated compiler error

* minor style improvements

Refactor scan tunings (#3262)

Require C++17 for compiling Thrust and CUB (#3255)

* Issue an unsuppressable warning when compiling with < C++17
* Remove C++11/14 presets
* Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers
* Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14]
* Remove CUB_ENABLE_DIALECT_CPP[11|14]
* Update CI runs
* Remove C++11/14 CI runs for CUB and Thrust
* Raise compiler minimum versions for C++17
* Update ReadMe
* Drop Thrust's cpp14_required.h
* Add escape hatch for C++17 removal

Fixes: #3252

Implement `views::empty` (#3254)

* Disable pair conversion of subrange with clang in C++17

* Fix namespace views

* Implement `views::empty`

This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view

Refactor `limits` and `climits` (#3221)

* implement builtins for huge val, nan and nans

* change `INFINITY` and `NAN` implementation for NVRTC

cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311)

* Add tests demonstrating usage of different iterators

* Update documentation of reduce_into by merging import code snippet with the rest of the example

* Add documentation for current iterators

* Run pre-commit checks and update accordingly

* Fix comments to refer to the proper lines in the code snippets in the docs

Drop clang<14 from CI, update devcontainers. (#3309)

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

[STF] Cleanup task dependencies object constructors (#3291)

* Define tag types for access modes

* - Rework how we build task_dep objects based on access mode tags
- pack_state is now responsible for using a const_cast for read only data

* Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums

* It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes

Disable test with a gcc-14 regression (#3297)

Deprecate Thrust's cpp_compatibility.h macros (#3299)

Remove dropped function objects from docs (#3319)

Document `NV_TARGET` macros (#3313)

[STF] Define ctx.pick_stream() which was missing for the unified context (#3326)

* Define ctx.pick_stream() which was missing for the unified context

* clang-format

Deprecate cub::IterateThreadStore (#3337)

Drop CUB's BinaryFlip operator (#3332)

Deprecate cub::Swap (#3333)

Clarify transform output can overlap input (#3323)

Drop CUB APIs with a debug_synchronous parameter (#3330)

Fixes: #3329

Drop CUB's util_compiler.cuh for real (#3340)

PR #3302 planned to drop the file, but only dropped its content. This
was an oversight. So let's drop the entire file.

Drop cub::ValueCache (#3346)

limits offset types for merge sort (#3328)

Drop CDPv1 (#3344)

Fixes: #3341

Drop thrust::void_t (#3362)

Use cuda::std::addressof in Thrust (#3363)

Fix all_of documentation for empty ranges (#3358)

all_of always returns true on an empty range.

[STF] Do not keep track of dangling events in a CUDA graph backend (#3327)

* Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when
the CUDA graph completes. Therefore keeping track of "dangling events" is a
waste of time and resources.

* replace can_ignore_dangling_events by track_dangling_events which leads to more readable code

* When not storing the dangling events, we must still perform the deinit operations that were producing these events !

Extract scan kernels into NVRTC-compilable header (#3334)

* Extract scan kernels into NVRTC-compilable header

* Update cub/cub/device/dispatch/dispatch_scan.cuh

Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com>

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com>

Drop deprecated aliases in Thrust functional (#3272)

Fixes: #3271

Drop cub::DivideAndRoundUp (#3347)

Use cuda::std::min/max in Thrust (#3364)

Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361)

* implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16`

Cleanup util_arch (#2773)

Deprecate thrust::null_type (#3367)

Deprecate cub::DeviceSpmv (#3320)

Fixes: #896

Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* fixes spelling

* adds tests for large number of segments

* fixes narrowing conversion in tests

* addresses review comments

* fixes includes

Compile basic infra test with C++17 (#3377)

Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements

Exit with error when RAPIDS CI fails. (#3385)

cuda.parallel: Support structured types as algorithm inputs (#3218)

* Introduce gpu_struct decorator and typing

* Enable `reduce` to accept arrays of structs as inputs

* Add test for reducing arrays-of-struct

* Update documentation

* Use a numpy array rather than ctypes object

* Change zeros -> empty for output array and temp storage

* Add a TODO for typing GpuStruct

* Documentation udpates

* Remove test_reduce_struct_type from test_reduce.py

* Revert to `to_cccl_value()` accepting ndarray + GpuStruct

* Bump copyrights

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Deprecate thrust::async (#3324)

Fixes: #100

Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342)

Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314)

* add compiler-specific path
* fix device code path
* add _CCC_ASSUME

Deprecate thrust::numeric_limits (#3366)

Replace `typedef` with `using` in libcu++ (#3368)

Deprecate thrust::optional (#3307)

Fixes: #3306

Upgrade to Catch2 3.8  (#3310)

Fixes: #1724

refactor `<cuda/std/cstdint>` (#3325)

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

Update CODEOWNERS (#3331)

* Update CODEOWNERS

* Update CODEOWNERS

* Update CODEOWNERS

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fix sign-compare warning (#3408)

Implement more cmath functions to be usable on host and device (#3382)

* Implement more cmath functions to be usable on host and device

* Implement math roots functions

* Implement exponential functions

Redefine and deprecate thrust::remove_cvref (#3394)

* Redefine and deprecate thrust::remove_cvref

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Fix assert definition for NVHPC due to constexpr issues (#3418)

NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it.

Fix this by always using the host definition which should also work on device.

Fixes #3411

Extend CUB reduce benchmarks (#3401)

* Rename max.cu to custom.cu, since it uses a custom operator
* Extend types covered my min.cu to all fundamental types
* Add some notes on how to collect tuning parameters

Fixes: #3283

Update upload-pages-artifact to v3 (#3423)

* Update upload-pages-artifact to v3

* Empty commit

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Replace and deprecate thrust::cuda_cub::terminate (#3421)

`std::linalg` accessors and `transposed_layout` (#2962)

Add round up/down to multiple (#3234)

[FEA]: Introduce Python module with CCCL headers (#3201)

* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative

* Run `copy_cccl_headers_to_aude_include()` before `setup()`

* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.

* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel

* Bug fix: cuda/_include only exists after shutil.copytree() ran.

* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py

* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)

* Replace := operator (needs Python 3.8+)

* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md

* Restore original README.md: `pip3 install -e` now works on first pass.

* cuda_cccl/README.md: FOR INTERNAL USE ONLY

* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917)

Command used: ci/update_version.sh 2 8 0

* Modernize pyproject.toml, setup.py

Trigger for this change:

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996

* Install CCCL headers under cuda.cccl.include

Trigger for this change:

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562

Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.

* Factor out cuda_cccl/cuda/cccl/include_paths.py

* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative

* Add missing Copyright notice.

* Add missing __init__.py (cuda.cccl)

* Add `"cuda.cccl"` to `autodoc.mock_imports`

* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)

* Add # TODO: move this to a module-level import

* Modernize cuda_cooperative/pyproject.toml, setup.py

* Convert cuda_cooperative to use hatchling as build backend.

* Revert "Convert cuda_cooperative to use hatchling as build backend."

This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe.

* Move numpy from [build-system] requires -> [project] dependencies

* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH

* Remove copy_license() and use license_files=["../../LICENSE"] instead.

* Further modernize cuda_cccl/setup.py to use pathlib

* Trivial simplifications in cuda_cccl/pyproject.toml

* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code

* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml

* Add taplo-pre-commit to .pre-commit-config.yaml

* taplo-pre-commit auto-fixes

* Use pathlib in cuda_cooperative/setup.py

* CCCL_PYTHON_PATH in cuda_cooperative/setup.py

* Modernize cuda_parallel/pyproject.toml, setup.py

* Use pathlib in cuda_parallel/setup.py

* Add `# TOML lint & format` comment.

* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml

* Use pathlib in cuda/cccl/include_paths.py

* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)

* Fixes after git merge main

* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'

```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
  /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>

  Traceback (most recent call last):
    File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
      bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
                                                       ^^^^^^^^^^^^^^^^^
  AttributeError: '_Reduce' object has no attribute 'build_result'

    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```

* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`

* Introduce cuda_cooperative/constraints.txt

* Also add cuda_parallel/constraints.txt

* Add `--constraint constraints.txt` in ci/test_python.sh

* Update Copyright dates

* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)

For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.

* Remove unused cuda_parallel jinja2 dependency (noticed by chance).

* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.

* Make cuda_cooperative, cuda_parallel testing completely independent.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"

This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2.

Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971

* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Restore original ci/matrix.yaml [skip-rapids]

* Use for loop in test_python.sh to avoid code duplication.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]

* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"

This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d.

* Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460)

* Address feedback by @leofang

---------

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

cuda.parallel: Add optional stream argument to reduce_into() (#3348)

* Add optional stream argument to reduce_into()

* Add tests to check for reduce_into() stream behavior

* Move protocol related utils to separate file and rework __cuda_stream__ error messages

* Fix synchronization issue in stream test and add one more invalid stream test case

* Rename cuda stream validation function after removing leading underscore

* Unpack values from __cuda_stream__ instead of indexing

* Fix linting errors

* Handle TypeError when unpacking invalid __cuda_stream__ return

* Use stream to allocate cupy memory in new stream test

Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434)

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes #3404

Fix CI issues (#3443)

update docs

fix review

restrict allowed types

replace constexpr implementations with generic

optimize `__is_arithmetic_integral`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Reduce the number of offset types for DeviceMergeSort

3 participants