Limits the number of different offset types for `DeviceMergeSort` by elstehle · Pull Request #3328 · NVIDIA/cccl

elstehle · 2025-01-10T16:40:57Z

Description

Benchmark results on H100, comparing the offset types

This summary highlights that performance across offset types is pretty similar:

MergeSort.keys

	u32/i32 time	u32/i32 time (2^28)	i64/i32 time	i64/i32 time (2^28)	u64/i32 time	u64/i32 time (2^28)
min	97.92%	99.49%	99.53%	99.53%	99.27%	99.48%
max	104.08%	100.89%	105.89%	101.99%	105.78%	102.57%
avg	99.92%	100.09%	100.92%	100.39%	100.72%	100.48%

MergeSort.pairs

x	u32/i32 time	u32/i32 time (2^28)	i64/i32 time	i64/i32 time (2^28)	u64/i32 time	u64/i32 time (2^28)
min	97.80%	97.81%	97.04%	97.04%	97.43%	97.43%
max	102.82%	102.58%	103.35%	102.67%	103.59%	102.90%
avg	99.71%	99.92%	100.22%	100.33%	100.05%	100.35%

Detailed benchmark results on H100 sort.pairs

KeyT{ct}	ValueT{ct}	Elements{io}	Entropy	I32	u32	u32/i32 time	i64	i64/i32 time	u64	u64/i32 time
I8	I8	2^16 = 65536	1	60.469	62.172	102.82%	62.493	103.35%	62.105	102.71%
I8	I8	2^20 = 1048576	1	171.971	174.273	101.34%	176.582	102.68%	175.85	102.26%
I8	I8	2^24 = 16777216	1	1771	1779	100.45%	1804	101.86%	1799	101.58%
I8	I8	2^28 = 268435456	1	32722	32771	100.15%	33440	102.19%	33347	101.91%
I8	I8	2^16 = 65536	0.201	60.212	61.443	102.04%	61.695	102.46%	61.335	101.87%
I8	I8	2^20 = 1048576	0.201	167.837	168.237	100.24%	170.479	101.57%	169.35	100.90%
I8	I8	2^24 = 16777216	0.201	1689	1694	100.30%	1716	101.60%	1711	101.30%
I8	I8	2^28 = 268435456	0.201	31170	31246	100.24%	31815	102.07%	31729	101.79%
I8	I16	2^16 = 65536	1	61.855	61.386	99.24%	61.909	100.09%	61.53	99.47%
I8	I16	2^20 = 1048576	1	173.999	172.652	99.23%	174.059	100.03%	173.731	99.85%
I8	I16	2^24 = 16777216	1	1760	1760	100.00%	1775	100.85%	1776	100.91%
I8	I16	2^28 = 268435456	1	32091	32118	100.08%	32616	101.64%	32648	101.74%
I8	I16	2^16 = 65536	0.201	60.971	60.476	99.19%	60.963	99.99%	60.6	99.39%
I8	I16	2^20 = 1048576	0.201	166.307	165.372	99.44%	166.748	100.27%	166.414	100.06%
I8	I16	2^24 = 16777216	0.201	1666	1666	100.00%	1679	100.78%	1679	100.78%
I8	I16	2^28 = 268435456	0.201	30519	30543	100.08%	30967	101.47%	30985	101.53%
I8	I32	2^16 = 65536	1	61.401	60.948	99.26%	61.438	100.06%	61.036	99.41%
I8	I32	2^20 = 1048576	1	174.453	173.111	99.23%	174.789	100.19%	174.326	99.93%
I8	I32	2^24 = 16777216	1	1852	1853	100.05%	1839	99.30%	1840	99.35%
I8	I32	2^28 = 268435456	1	34596	34605	100.03%	34390	99.40%	34414	99.47%
I8	I32	2^16 = 65536	0.201	60.35	60.009	99.43%	60.283	99.89%	59.972	99.37%
I8	I32	2^20 = 1048576	0.201	165.483	164.616	99.48%	166.36	100.53%	165.951	100.28%
I8	I32	2^24 = 16777216	0.201	1789	1790	100.06%	1764	98.60%	1764	98.60%
I8	I32	2^28 = 268435456	0.201	33492	33521	100.09%	33152	98.98%	33164	99.02%
I8	I64	2^16 = 65536	1	66.408	66.023	99.42%	66.567	100.24%	66.52	100.17%
I8	I64	2^20 = 1048576	1	205.971	204.717	99.39%	208.454	101.21%	207.996	100.98%
I8	I64	2^24 = 16777216	1	2730	2730	100.00%	2735	100.18%	2734	100.15%
I8	I64	2^28 = 268435456	1	52488	52488	100.00%	52570	100.16%	52567	100.15%
I8	I64	2^16 = 65536	0.201	63.909	63.453	99.29%	64.235	100.51%	64.062	100.24%
I8	I64	2^20 = 1048576	0.201	193.859	193.002	99.56%	196.47	101.35%	195.567	100.88%
I8	I64	2^24 = 16777216	0.201	2688	2688	100.00%	2692	100.15%	2691	100.11%
I8	I64	2^28 = 268435456	0.201	51629	51619	99.98%	51732	100.20%	51691	100.12%
I16	I8	2^16 = 65536	1	65.163	64.677	99.25%	64.94	99.66%	64.662	99.23%
I16	I8	2^20 = 1048576	1	184.111	183.079	99.44%	184.561	100.24%	184.499	100.21%
I16	I8	2^24 = 16777216	1	1885	1881	99.79%	1901	100.85%	1904	101.01%
I16	I8	2^28 = 268435456	1	34856	34760	99.72%	35439	101.67%	35512	101.88%
I16	I8	2^16 = 65536	0.201	64.79	64.413	99.42%	64.6	99.71%	64.361	99.34%
I16	I8	2^20 = 1048576	0.201	177.801	176.825	99.45%	178.261	100.26%	178.001	100.11%
I16	I8	2^24 = 16777216	0.201	1759	1756	99.83%	1778	101.08%	1780	101.19%
I16	I8	2^28 = 268435456	0.201	32009	31963	99.86%	32574	101.77%	32623	101.92%
I16	I16	2^16 = 65536	1	64.217	63.49	98.87%	63.71	99.21%	63.438	98.79%
I16	I16	2^20 = 1048576	1	184.34	181.278	98.34%	181.152	98.27%	180.859	98.11%
I16	I16	2^24 = 16777216	1	1884	1867	99.10%	1890	100.32%	1884	100.00%
I16	I16	2^28 = 268435456	1	34685	34377	99.11%	35091	101.17%	34987	100.87%
I16	I16	2^16 = 65536	0.201	63.897	63.092	98.74%	63.225	98.95%	63.064	98.70%
I16	I16	2^20 = 1048576	0.201	174.862	173.491	99.22%	173.662	99.31%	172.702	98.76%
I16	I16	2^24 = 16777216	0.201	1729	1727	99.88%	1744	100.87%	1739	100.58%
I16	I16	2^28 = 268435456	0.201	31436	31402	99.89%	31889	101.44%	31804	101.17%
I16	I32	2^16 = 65536	1	63.471	63.217	99.60%	62.531	98.52%	62.229	98.04%
I16	I32	2^20 = 1048576	1	183.118	181.636	99.19%	181.413	99.07%	180.83	98.75%
I16	I32	2^24 = 16777216	1	2031	2028	99.85%	2053	101.08%	2051	100.98%
I16	I32	2^28 = 268435456	1	38496	38466	99.92%	39016	101.35%	38993	101.29%
I16	I32	2^16 = 65536	0.201	62.96	62.634	99.48%	62.011	98.49%	61.612	97.86%
I16	I32	2^20 = 1048576	0.201	173.21	172.132	99.38%	172.078	99.35%	171.031	98.74%
I16	I32	2^24 = 16777216	0.201	1972	1970	99.90%	1992	101.01%	1990	100.91%
I16	I32	2^28 = 268435456	0.201	37226	37199	99.93%	37733	101.36%	37715	101.31%
I16	I64	2^16 = 65536	1	70.239	69.664	99.18%	71.071	101.18%	70.808	100.81%
I16	I64	2^20 = 1048576	1	217.62	215.237	98.90%	217.837	100.10%	218.492	100.40%
I16	I64	2^24 = 16777216	1	2983	2976	99.77%	2981	99.93%	2981	99.93%
I16	I64	2^28 = 268435456	1	57573	57385	99.67%	57432	99.76%	57436	99.76%
I16	I64	2^16 = 65536	0.201	68.671	68.079	99.14%	69.468	101.16%	69.094	100.62%
I16	I64	2^20 = 1048576	0.201	202.552	200.335	98.91%	203.229	100.33%	203.869	100.65%
I16	I64	2^24 = 16777216	0.201	2946	2938	99.73%	2943	99.90%	2941	99.83%
I16	I64	2^28 = 268435456	0.201	56736	56548	99.67%	56574	99.71%	56581	99.73%
I32	I8	2^16 = 65536	1	61.947	61.389	99.10%	61.655	99.53%	61.418	99.15%
I32	I8	2^20 = 1048576	1	175.778	174.345	99.18%	175.997	100.12%	175.283	99.72%
I32	I8	2^24 = 16777216	1	1903	1898	99.74%	1912	100.47%	1914	100.58%
I32	I8	2^28 = 268435456	1	35604	35548	99.84%	35876	100.76%	35936	100.93%
I32	I8	2^16 = 65536	0.201	61.927	61.415	99.17%	61.635	99.53%	61.419	99.18%
I32	I8	2^20 = 1048576	0.201	172.177	171.192	99.43%	172.415	100.14%	171.898	99.84%
I32	I8	2^24 = 16777216	0.201	1855	1850	99.73%	1861	100.32%	1862	100.38%
I32	I8	2^28 = 268435456	0.201	34476	34442	99.90%	34621	100.42%	34645	100.49%
I32	I16	2^16 = 65536	1	61.469	61.085	99.38%	61.771	100.49%	61.508	100.06%
I32	I16	2^20 = 1048576	1	175.144	174.042	99.37%	175.55	100.23%	174.548	99.66%
I32	I16	2^24 = 16777216	1	2027	2026	99.95%	2035	100.39%	2033	100.30%
I32	I16	2^28 = 268435456	1	38488	38477	99.97%	38697	100.54%	38681	100.50%
I32	I16	2^16 = 65536	0.201	61.498	61.066	99.30%	61.836	100.55%	61.51	100.02%
I32	I16	2^20 = 1048576	0.201	170.597	169.641	99.44%	170.995	100.23%	169.962	99.63%
I32	I16	2^24 = 16777216	0.201	1988	1986	99.90%	1994	100.30%	1993	100.25%
I32	I16	2^28 = 268435456	0.201	37549	37537	99.97%	37758	100.56%	37746	100.52%
I32	I32	2^16 = 65536	1	61.13	60.775	99.42%	60.869	99.57%	60.534	99.03%
I32	I32	2^20 = 1048576	1	175.796	174.532	99.28%	175.484	99.82%	174.359	99.18%
I32	I32	2^24 = 16777216	1	2319	2317	99.91%	2321	100.09%	2319	100.00%
I32	I32	2^28 = 268435456	1	44815	44809	99.99%	44944	100.29%	44932	100.26%
I32	I32	2^16 = 65536	0.201	61.047	60.693	99.42%	60.666	99.38%	60.457	99.03%
I32	I32	2^20 = 1048576	0.201	169.313	168.314	99.41%	168.892	99.75%	167.805	99.11%
I32	I32	2^24 = 16777216	0.201	2295	2293	99.91%	2295	100.00%	2294	99.96%
I32	I32	2^28 = 268435456	0.201	44265	44260	99.99%	44350	100.19%	44345	100.18%
I32	I64	2^16 = 65536	1	66.24	66.545	100.46%	66.861	100.94%	66.679	100.66%
I32	I64	2^20 = 1048576	1	209.114	205.309	98.18%	214.221	102.44%	213.43	102.06%
I32	I64	2^24 = 16777216	1	3357	3301	98.33%	3352	99.85%	3351	99.82%
I32	I64	2^28 = 268435456	1	65812	64868	98.57%	65826	100.02%	65824	100.02%
I32	I64	2^16 = 65536	0.201	65.834	66.106	100.41%	66.413	100.88%	66.146	100.47%
I32	I64	2^20 = 1048576	0.201	199.061	194.847	97.88%	203.661	102.31%	202.596	101.78%
I32	I64	2^24 = 16777216	0.201	3339	3283	98.32%	3335	99.88%	3334	99.85%
I32	I64	2^28 = 268435456	0.201	65436	64477	98.53%	65417	99.97%	65424	99.98%
I64	I8	2^16 = 65536	1	64.863	64.463	99.38%	65.158	100.45%	64.982	100.18%
I64	I8	2^20 = 1048576	1	208.701	208.781	100.04%	212.329	101.74%	211.39	101.29%
I64	I8	2^24 = 16777216	1	3086	3084	99.94%	3087	100.03%	3086	100.00%
I64	I8	2^28 = 268435456	1	59022	59029	100.01%	59028	100.01%	59030	100.01%
I64	I8	2^16 = 65536	0.201	64.891	64.733	99.76%	65.319	100.66%	64.932	100.06%
I64	I8	2^20 = 1048576	0.201	218.876	218.08	99.64%	221.515	101.21%	220.517	100.75%
I64	I8	2^24 = 16777216	0.201	3101	3100	99.97%	3104	100.10%	3103	100.06%
I64	I8	2^28 = 268435456	0.201	59022	59016	99.99%	59030	100.01%	59026	100.01%
I64	I16	2^16 = 65536	1	66.225	65.579	99.02%	66.251	100.04%	65.835	99.41%
I64	I16	2^20 = 1048576	1	214.779	213.615	99.46%	215.584	100.37%	215.146	100.17%
I64	I16	2^24 = 16777216	1	3342	3341	99.97%	3341	99.97%	3340	99.94%
I64	I16	2^28 = 268435456	1	64248	64251	100.00%	64245	100.00%	64241	99.99%
I64	I16	2^16 = 65536	0.201	66.492	65.944	99.18%	66.418	99.89%	66.09	99.40%
I64	I16	2^20 = 1048576	0.201	225.005	223.872	99.50%	225.115	100.05%	224.705	99.87%
I64	I16	2^24 = 16777216	0.201	3359	3358	99.97%	3358	99.97%	3358	99.97%
I64	I16	2^28 = 268435456	0.201	64230	64228	100.00%	64210	99.97%	64206	99.96%
I64	I32	2^16 = 65536	1	66.853	66.254	99.10%	66.975	100.18%	66.719	99.80%
I64	I32	2^20 = 1048576	1	220.584	219.198	99.37%	219.98	99.73%	219.6	99.55%
I64	I32	2^24 = 16777216	1	3860	3859	99.97%	3859	99.97%	3858	99.95%
I64	I32	2^28 = 268435456	1	74875	74877	100.00%	74871	99.99%	74876	100.00%
I64	I32	2^16 = 65536	0.201	67.2	66.543	99.02%	67.345	100.22%	67.021	99.73%
I64	I32	2^20 = 1048576	0.201	232.095	231.257	99.64%	231.506	99.75%	231.352	99.68%
I64	I32	2^24 = 16777216	0.201	3881	3879	99.95%	3880	99.97%	3879	99.95%
I64	I32	2^28 = 268435456	0.201	74901	74908	100.01%	74894	99.99%	74890	99.99%
I64	I64	2^16 = 65536	1	71.151	71.133	99.97%	71.099	99.93%	70.708	99.38%
I64	I64	2^20 = 1048576	1	254.552	254.578	100.01%	254.541	100.00%	253.679	99.66%
I64	I64	2^24 = 16777216	1	5033	5032	99.98%	5031	99.96%	5031	99.96%
I64	I64	2^28 = 268435456	1	98094	98095	100.00%	98070	99.98%	98081	99.99%
I64	I64	2^16 = 65536	0.201	72.281	72.115	99.77%	72.25	99.96%	71.813	99.35%
I64	I64	2^20 = 1048576	0.201	270.997	270.303	99.74%	270.513	99.82%	269.528	99.46%
I64	I64	2^24 = 16777216	0.201	5062	5061	99.98%	5060	99.96%	5059	99.94%
I64	I64	2^28 = 268435456	0.201	98218	98221	100.00%	98200	99.98%	98203	99.98%
I128	I8	2^16 = 65536	1	74.042	73.617	99.43%	73.573	99.37%	73.07	98.69%
I128	I8	2^20 = 1048576	1	353.016	355.316	100.65%	352.395	99.82%	353.355	100.10%
I128	I8	2^24 = 16777216	1	6046	6045	99.98%	6040	99.90%	6042	99.93%
I128	I8	2^28 = 268435456	1	117026	117024	100.00%	116990	99.97%	116985	99.96%
I128	I8	2^16 = 65536	0.201	74.056	73.824	99.69%	73.732	99.56%	73.319	99.00%
I128	I8	2^20 = 1048576	0.201	349.037	348.581	99.87%	347.431	99.54%	347.867	99.66%
I128	I8	2^24 = 16777216	0.201	6027	6027	100.00%	6022	99.92%	6025	99.97%
I128	I8	2^28 = 268435456	0.201	116271	116257	99.99%	116236	99.97%	116234	99.97%
I128	I16	2^16 = 65536	1	74.519	73.965	99.26%	74.139	99.49%	73.905	99.18%
I128	I16	2^20 = 1048576	1	357.659	357.528	99.96%	356.645	99.72%	356.49	99.67%
I128	I16	2^24 = 16777216	1	6315	6313	99.97%	6310	99.92%	6312	99.95%
I128	I16	2^28 = 268435456	1	122687	122663	99.98%	122692	100.00%	122697	100.01%
I128	I16	2^16 = 65536	0.201	74.603	73.978	99.16%	74.2	99.46%	73.954	99.13%
I128	I16	2^20 = 1048576	0.201	350.734	350.145	99.83%	349.138	99.54%	349.443	99.63%
I128	I16	2^24 = 16777216	0.201	6301	6299	99.97%	6296	99.92%	6300	99.98%
I128	I16	2^28 = 268435456	0.201	121919	121906	99.99%	121915	100.00%	121912	99.99%
I128	I32	2^16 = 65536	1	74.2	73.734	99.37%	74.066	99.82%	73.738	99.38%
I128	I32	2^20 = 1048576	1	367.476	365.169	99.37%	361.483	98.37%	363.991	99.05%
I128	I32	2^24 = 16777216	1	6846	6844	99.97%	6839	99.90%	6842	99.94%
I128	I32	2^28 = 268435456	1	133802	133791	99.99%	133805	100.00%	133806	100.00%
I128	I32	2^16 = 65536	0.201	74.72	73.864	98.85%	74.353	99.51%	73.996	99.03%
I128	I32	2^20 = 1048576	0.201	370.105	367.712	99.35%	366.266	98.96%	367.526	99.30%
I128	I32	2^24 = 16777216	0.201	6836	6835	99.99%	6829	99.90%	6833	99.96%
I128	I32	2^28 = 268435456	0.201	133061	133059	100.00%	133050	99.99%	133048	99.99%
I128	I64	2^16 = 65536	1	76.383	75.9	99.37%	75.937	99.42%	75.408	98.72%
I128	I64	2^20 = 1048576	1	425.401	424.763	99.85%	431.661	101.47%	430.857	101.28%
I128	I64	2^24 = 16777216	1	8006	8004	99.98%	8011	100.06%	8014	100.10%
I128	I64	2^28 = 268435456	1	157048	157035	99.99%	157304	100.16%	157301	100.16%
I128	I64	2^16 = 65536	0.201	76.715	76.214	99.35%	76.234	99.37%	75.665	98.63%
I128	I64	2^20 = 1048576	0.201	425.809	425.006	99.81%	431.345	101.30%	430.825	101.18%
I128	I64	2^24 = 16777216	0.201	8003	8001	99.98%	8007	100.05%	8014	100.14%
I128	I64	2^28 = 268435456	0.201	156339	156336	100.00%	156632	100.19%	156629	100.19%
F32	I8	2^16 = 65536	1	61.645	61.308	99.45%	61.502	99.77%	61.263	99.38%
F32	I8	2^20 = 1048576	1	175.196	174.044	99.34%	175.723	100.30%	175.181	99.99%
F32	I8	2^24 = 16777216	1	1908	1887	98.90%	1899	99.53%	1902	99.69%
F32	I8	2^28 = 268435456	1	35687	35344	99.04%	35661	99.93%	35717	100.08%
F32	I8	2^16 = 65536	0.201	61.455	61.07	99.37%	61.387	99.89%	61.065	99.37%
F32	I8	2^20 = 1048576	0.201	171.073	169.932	99.33%	171.781	100.41%	171.377	100.18%
F32	I8	2^24 = 16777216	0.201	1860	1840	98.92%	1849	99.41%	1850	99.46%
F32	I8	2^28 = 268435456	0.201	34580	34248	99.04%	34416	99.53%	34443	99.60%
F32	I16	2^16 = 65536	1	61.194	60.517	98.89%	60.793	99.34%	60.452	98.79%
F32	I16	2^20 = 1048576	1	174.5	173.16	99.23%	174.605	100.06%	173.537	99.45%
F32	I16	2^24 = 16777216	1	2031	2030	99.95%	2039	100.39%	2037	100.30%
F32	I16	2^28 = 268435456	1	38549	38540	99.98%	38774	100.58%	38754	100.53%
F32	I16	2^16 = 65536	0.201	60.885	60.323	99.08%	60.618	99.56%	60.289	99.02%
F32	I16	2^20 = 1048576	0.201	168.84	167.654	99.30%	169.738	100.53%	168.72	99.93%
F32	I16	2^24 = 16777216	0.201	1992	1990	99.90%	1999	100.35%	1997	100.25%
F32	I16	2^28 = 268435456	0.201	37621	37613	99.98%	37840	100.58%	37826	100.54%
F32	I32	2^16 = 65536	1	60.455	59.969	99.20%	59.916	99.11%	59.537	98.48%
F32	I32	2^20 = 1048576	1	175.024	173.82	99.31%	174.909	99.93%	174.111	99.48%
F32	I32	2^24 = 16777216	1	2333	2330	99.87%	2344	100.47%	2343	100.43%
F32	I32	2^28 = 268435456	1	45065	45047	99.96%	45411	100.77%	45402	100.75%
F32	I32	2^16 = 65536	0.201	60.113	59.565	99.09%	59.451	98.90%	59.264	98.59%
F32	I32	2^20 = 1048576	0.201	168.517	167.118	99.17%	168.938	100.25%	167.901	99.63%
F32	I32	2^24 = 16777216	0.201	2310	2309	99.96%	2321	100.48%	2319	100.39%
F32	I32	2^28 = 268435456	0.201	44552	44541	99.98%	44867	100.71%	44864	100.70%
F32	I64	2^16 = 65536	1	66.874	66.612	99.61%	67.718	101.26%	66.189	98.98%
F32	I64	2^20 = 1048576	1	212.644	211.724	99.57%	214.598	100.92%	214.046	100.66%
F32	I64	2^24 = 16777216	1	3364	3364	100.00%	3366	100.06%	3363	99.97%
F32	I64	2^28 = 268435456	1	65938	65943	100.01%	65961	100.03%	65952	100.02%
F32	I64	2^16 = 65536	0.201	65.951	65.677	99.58%	66.786	101.27%	65.505	99.32%
F32	I64	2^20 = 1048576	0.201	202.944	201.573	99.32%	204.992	101.01%	204.932	100.98%
F32	I64	2^24 = 16777216	0.201	3348	3348	100.00%	3350	100.06%	3347	99.97%
F32	I64	2^28 = 268435456	0.201	65600	65587	99.98%	65597	100.00%	65598	100.00%
F64	I8	2^16 = 65536	1	64.447	63.862	99.09%	64.245	99.69%	64.023	99.34%
F64	I8	2^20 = 1048576	1	208.481	207.645	99.60%	209.471	100.47%	209.217	100.35%
F64	I8	2^24 = 16777216	1	3085	3086	100.03%	3087	100.06%	3086	100.03%
F64	I8	2^28 = 268435456	1	59019	59012	99.99%	59044	100.04%	59043	100.04%
F64	I8	2^16 = 65536	0.201	64.441	63.899	99.16%	64.396	99.93%	64.172	99.58%
F64	I8	2^20 = 1048576	0.201	216.68	215.993	99.68%	217.89	100.56%	217.125	100.21%
F64	I8	2^24 = 16777216	0.201	3093	3094	100.03%	3097	100.13%	3096	100.10%
F64	I8	2^28 = 268435456	0.201	58948	58937	99.98%	58871	99.87%	58981	100.06%
F64	I16	2^16 = 65536	1	64.463	63.961	99.22%	64.529	100.10%	64.143	99.50%
F64	I16	2^20 = 1048576	1	214.411	213.369	99.51%	214.533	100.06%	213.797	99.71%
F64	I16	2^24 = 16777216	1	3342	3343	100.03%	3342	100.00%	3340	99.94%
F64	I16	2^28 = 268435456	1	64286	64286	100.00%	64270	99.98%	64280	99.99%
F64	I16	2^16 = 65536	0.201	64.767	64.209	99.14%	64.606	99.75%	64.232	99.17%
F64	I16	2^20 = 1048576	0.201	222.601	222.166	99.80%	223.098	100.22%	222.373	99.90%
F64	I16	2^24 = 16777216	0.201	3352	3352	100.00%	3351	99.97%	3349	99.91%
F64	I16	2^28 = 268435456	0.201	64176	64184	100.01%	64162	99.98%	64158	99.97%
F64	I32	2^16 = 65536	1	65.066	64.664	99.38%	64.975	99.86%	64.712	99.46%
F64	I32	2^20 = 1048576	1	219.589	217.902	99.23%	220.23	100.29%	217.873	99.22%
F64	I32	2^24 = 16777216	1	3861	3861	100.00%	3859	99.95%	3858	99.92%
F64	I32	2^28 = 268435456	1	74904	74910	100.01%	74905	100.00%	74903	100.00%
F64	I32	2^16 = 65536	0.201	65.568	65.294	99.58%	65.389	99.73%	65.06	99.23%
F64	I32	2^20 = 1048576	0.201	228.72	227.963	99.67%	229.429	100.31%	227.545	99.49%
F64	I32	2^24 = 16777216	0.201	3870	3870	100.00%	3868	99.95%	3867	99.92%
F64	I32	2^28 = 268435456	0.201	74796	74798	100.00%	74786	99.99%	74785	99.99%
F64	I64	2^16 = 65536	1	69.9	69.33	99.18%	69.681	99.69%	69.377	99.25%
F64	I64	2^20 = 1048576	1	255.402	254.788	99.76%	253.73	99.35%	253.354	99.20%
F64	I64	2^24 = 16777216	1	5034	5034	100.00%	5033	99.98%	5031	99.94%
F64	I64	2^28 = 268435456	1	98104	98123	100.02%	98097	99.99%	98097	99.99%
F64	I64	2^16 = 65536	0.201	70.734	70.121	99.13%	70.501	99.67%	70.155	99.18%
F64	I64	2^20 = 1048576	0.201	267.049	266.483	99.79%	265.703	99.50%	264.712	99.12%
F64	I64	2^24 = 16777216	0.201	5045	5045	100.00%	5042	99.94%	5041	99.92%
F64	I64	2^28 = 268435456	0.201	98013	98028	100.02%	98006	99.99%	98006	99.99%
C64	I8	2^16 = 65536	1	208.085	207.977	99.95%	208.066	99.99%	208.453	100.18%
C64	I8	2^20 = 1048576	1	564.478	564.301	99.97%	566.759	100.40%	565.851	100.24%
C64	I8	2^24 = 16777216	1	6822	6830	100.12%	6841	100.28%	6839	100.25%
C64	I8	2^28 = 268435456	1	137348	137496	100.11%	138170	100.60%	137773	100.31%
C64	I8	2^16 = 65536	0.201	322.449	322.035	99.87%	323.104	100.20%	322.604	100.05%
C64	I8	2^20 = 1048576	0.201	879.585	880.189	100.07%	884.521	100.56%	880.899	100.15%
C64	I8	2^24 = 16777216	0.201	12408	12452	100.35%	12516	100.87%	12508	100.81%
C64	I8	2^28 = 268435456	0.201	208819	209601	100.37%	210779	100.94%	210539	100.82%
C64	I16	2^16 = 65536	1	208.341	207.552	99.62%	209.28	100.45%	209.197	100.41%
C64	I16	2^20 = 1048576	1	567.824	566.687	99.80%	569.81	100.35%	571.645	100.67%
C64	I16	2^24 = 16777216	1	6932	6905	99.61%	6920	99.83%	6990	100.84%
C64	I16	2^28 = 268435456	1	139995	138764	99.12%	138841	99.18%	139508	99.65%
C64	I16	2^16 = 65536	0.201	324.128	323.379	99.77%	325.945	100.56%	325.308	100.36%
C64	I16	2^20 = 1048576	0.201	887.645	887.647	100.00%	895.427	100.88%	895.422	100.88%
C64	I16	2^24 = 16777216	0.201	12369	12557	101.52%	12694	102.63%	12722	102.85%
C64	I16	2^28 = 268435456	0.201	208973	211903	101.40%	214552	102.67%	215031	102.90%
C64	I32	2^16 = 65536	1	209.227	208.649	99.72%	209.864	100.30%	209.279	100.02%
C64	I32	2^20 = 1048576	1	575.243	573.046	99.62%	574.36	99.85%	577.525	100.40%
C64	I32	2^24 = 16777216	1	7046	7007	99.45%	7025	99.70%	7050	100.06%
C64	I32	2^28 = 268435456	1	140066	140054	99.99%	140284	100.16%	140810	100.53%
C64	I32	2^16 = 65536	0.201	323.872	322.755	99.66%	325.765	100.58%	324.742	100.27%
C64	I32	2^20 = 1048576	0.201	897.381	891.446	99.34%	896.875	99.94%	897.256	99.99%
C64	I32	2^24 = 16777216	0.201	12812	12530	97.80%	12720	99.28%	12753	99.54%
C64	I32	2^28 = 268435456	0.201	217276	212511	97.81%	215843	99.34%	216371	99.58%
C64	I64	2^16 = 65536	1	215.955	214.916	99.52%	215.9	99.97%	215.998	100.02%
C64	I64	2^20 = 1048576	1	603.612	605.543	100.32%	610.727	101.18%	608.608	100.83%
C64	I64	2^24 = 16777216	1	7414	7451	100.50%	7638	103.02%	7680	103.59%
C64	I64	2^28 = 268435456	1	147588	147841	100.17%	150052	101.67%	150657	102.08%
C64	I64	2^16 = 65536	0.201	333.665	332.798	99.74%	334.537	100.26%	333.773	100.03%
C64	I64	2^20 = 1048576	0.201	922.161	923.434	100.14%	939.801	101.91%	940.523	101.99%
C64	I64	2^24 = 16777216	0.201	12926	13268	102.65%	12616	97.60%	12674	98.05%
C64	I64	2^28 = 268435456	0.201	221625	227338	102.58%	215071	97.04%	215922	97.43%

Detailed benchmark results on H100 sort.keys

T{ct}	Elements{io}	Entropy	I32	u32	u32/i32 time	i64	i64/i32 time	u64	u64/i32 time
I8	2^16 = 65536	1	49.911	51.946	104.08%	52.852	105.89%	52.794	105.78%
I8	2^20 = 1048576	1	122.972	125.387	101.96%	127.503	103.68%	126.937	103.22%
I8	2^24 = 16777216	1	840.934	848.612	100.91%	855.068	101.68%	854.571	101.62%
I8	2^28 = 268435456	1	14776	14768	99.95%	14944	101.14%	14943	101.13%
I8	2^16 = 65536	0.201	49.736	50.945	102.43%	52.108	104.77%	51.591	103.73%
I8	2^20 = 1048576	0.201	119.974	121.815	101.53%	124.746	103.98%	124.05	103.40%
I8	2^24 = 16777216	0.201	794.997	799.318	100.54%	806.67	101.47%	805.044	101.26%
I8	2^28 = 268435456	0.201	13737	13750	100.09%	13936	101.45%	13930	101.40%
I16	2^16 = 65536	1	54.43	54.124	99.44%	54.953	100.96%	54.827	100.73%
I16	2^20 = 1048576	1	132.44	132.352	99.93%	134.235	101.36%	134.9	101.86%
I16	2^24 = 16777216	1	970.768	972.345	100.16%	978.736	100.82%	979.577	100.91%
I16	2^28 = 268435456	1	17360	17514	100.89%	17705	101.99%	17807	102.57%
I16	2^16 = 65536	0.201	54.307	53.989	99.41%	55.017	101.31%	54.794	100.90%
I16	2^20 = 1048576	0.201	129.126	127.942	99.08%	131.474	101.82%	131.622	101.93%
I16	2^24 = 16777216	0.201	914.877	914.914	100.00%	923.232	100.91%	921.525	100.73%
I16	2^28 = 268435456	0.201	15542	15568	100.17%	15637	100.61%	15633	100.59%
I32	2^16 = 65536	1	52.46	51.828	98.80%	52.83	100.71%	52.601	100.27%
I32	2^20 = 1048576	1	127.595	126.857	99.42%	129.407	101.42%	128.511	100.72%
I32	2^24 = 16777216	1	1278	1277	99.92%	1280	100.16%	1279	100.08%
I32	2^28 = 268435456	1	23714	23718	100.02%	23733	100.08%	23742	100.12%
I32	2^16 = 65536	0.201	52.606	52.014	98.87%	52.912	100.58%	52.794	100.36%
I32	2^20 = 1048576	0.201	125.541	123.857	98.66%	126.949	101.12%	125.819	100.22%
I32	2^24 = 16777216	0.201	1266	1265	99.92%	1268	100.16%	1266	100.00%
I32	2^28 = 268435456	0.201	22969	22968	100.00%	22965	99.98%	22963	99.97%
I64	2^16 = 65536	1	60.702	59.993	98.83%	61.128	100.70%	60.705	100.00%
I64	2^20 = 1048576	1	185.814	184.984	99.55%	186.317	100.27%	185.768	99.98%
I64	2^24 = 16777216	1	2749	2748	99.96%	2752	100.11%	2751	100.07%
I64	2^28 = 268435456	1	52944	52674	99.49%	52693	99.53%	52667	99.48%
I64	2^16 = 65536	0.201	61.672	60.391	97.92%	61.72	100.08%	61.219	99.27%
I64	2^20 = 1048576	0.201	196.116	193.851	98.85%	196.456	100.17%	195.955	99.92%
I64	2^24 = 16777216	0.201	2770	2761	99.68%	2764	99.78%	2763	99.75%
I64	2^28 = 268435456	0.201	52704	52595	99.79%	52607	99.82%	52601	99.80%
I128	2^16 = 65536	1	71.854	71.305	99.24%	72.619	101.06%	72.313	100.64%
I128	2^20 = 1048576	1	329.35	331.707	100.72%	333.206	101.17%	332.985	101.10%
I128	2^24 = 16777216	1	5718	5715	99.95%	5714	99.93%	5713	99.91%
I128	2^28 = 268435456	1	110798	110770	99.97%	110748	99.95%	110743	99.95%
I128	2^16 = 65536	0.201	71.974	71.32	99.09%	72.707	101.02%	72.353	100.53%
I128	2^20 = 1048576	0.201	322.19	324.03	100.57%	325.3	100.97%	324.488	100.71%
I128	2^24 = 16777216	0.201	5695	5694	99.98%	5693	99.96%	5693	99.96%
I128	2^28 = 268435456	0.201	110039	110018	99.98%	110005	99.97%	109998	99.96%
F32	2^16 = 65536	1	52.28	52.001	99.47%	53.143	101.65%	52.91	101.21%
F32	2^20 = 1048576	1	127.809	127.204	99.53%	130.39	102.02%	129.863	101.61%
F32	2^24 = 16777216	1	1276	1278	100.16%	1281	100.39%	1280	100.31%
F32	2^28 = 268435456	1	23569	23562	99.97%	23576	100.03%	23571	100.01%
F32	2^16 = 65536	0.201	52.585	52.033	98.95%	53.153	101.08%	52.988	100.77%
F32	2^20 = 1048576	0.201	125.9	124.395	98.80%	128.203	101.83%	127.677	101.41%
F32	2^24 = 16777216	0.201	1267	1267	100.00%	1269	100.16%	1268	100.08%
F32	2^28 = 268435456	0.201	23005	23004	100.00%	23003	99.99%	23002	99.99%
F64	2^16 = 65536	1	60.225	59.26	98.40%	60.89	101.10%	60.48	100.42%
F64	2^20 = 1048576	1	185.706	185.101	99.67%	187.54	100.99%	186.721	100.55%
F64	2^24 = 16777216	1	2756	2756	100.00%	2759	100.11%	2757	100.04%
F64	2^28 = 268435456	1	52719	52732	100.02%	52744	100.05%	52743	100.05%
F64	2^16 = 65536	0.201	60.839	59.914	98.48%	61.486	101.06%	61.011	100.28%
F64	2^20 = 1048576	0.201	193.126	192.901	99.88%	195.149	101.05%	194.363	100.64%
F64	2^24 = 16777216	0.201	2759	2759	100.00%	2762	100.11%	2761	100.07%
F64	2^28 = 268435456	0.201	52613	52618	100.01%	52638	100.05%	52633	100.04%
C64	2^16 = 65536	1	201.379	201.49	100.06%	202.386	100.50%	202.198	100.41%
C64	2^20 = 1048576	1	471.037	470.731	99.94%	475.033	100.85%	473.644	100.55%
C64	2^24 = 16777216	1	6300	6314	100.22%	6300	100.00%	6324	100.38%
C64	2^28 = 268435456	1	126559	127456	100.71%	128288	101.37%	129571	102.38%
C64	2^16 = 65536	0.201	319.445	319.099	99.89%	320.483	100.32%	318.318	99.65%
C64	2^20 = 1048576	0.201	794.535	796.631	100.26%	801.399	100.86%	797.999	100.44%
C64	2^24 = 16777216	0.201	12412	12450	100.31%	12447	100.28%	12434	100.18%
C64	2^28 = 268435456	0.201	209144	209901	100.36%	209664	100.25%	209672	100.25%

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

github-actions · 2025-01-10T18:11:27Z

🟩 CI finished in 1h 29m: Pass: 100%/96 | Total: 1d 19h | Avg: 27m 02s | Max: 1h 00m | Hits: 403%/15012

🟩 cub: Pass: 100%/47 | Total: 1d 06h | Avg: 38m 54s | Max: 1h 00m | Hits: 569%/3900

🟩 cpu
  🟩 amd64              Pass: 100%/45  | Total:  1d 05h | Avg: 38m 40s | Max:  1h 00m | Hits: 569%/3900  
  🟩 arm64              Pass: 100%/2   | Total:  1h 28m | Avg: 44m 22s | Max: 45m 17s
🟩 ctk
  🟩 12.0               Pass: 100%/8   | Total:  5h 42m | Avg: 42m 46s | Max: 55m 12s | Hits: 569%/1560  
  🟩 12.5               Pass: 100%/2   | Total:  1h 28m | Avg: 44m 03s | Max: 46m 10s
  🟩 12.6               Pass: 100%/37  | Total: 23h 18m | Avg: 37m 48s | Max:  1h 00m | Hits: 569%/2340  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 46m | Avg: 53m 01s | Max: 54m 47s
  🟩 nvcc12.0           Pass: 100%/8   | Total:  5h 42m | Avg: 42m 46s | Max: 55m 12s | Hits: 569%/1560  
  🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 28m | Avg: 44m 03s | Max: 46m 10s
  🟩 nvcc12.6           Pass: 100%/35  | Total: 21h 32m | Avg: 36m 56s | Max:  1h 00m | Hits: 569%/2340  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 46m | Avg: 53m 01s | Max: 54m 47s
  🟩 nvcc               Pass: 100%/45  | Total:  1d 04h | Avg: 38m 17s | Max:  1h 00m | Hits: 569%/3900  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total:  2h 35m | Avg: 38m 54s | Max: 40m 26s
  🟩 Clang10            Pass: 100%/1   | Total: 41m 48s | Avg: 41m 48s | Max: 41m 48s
  🟩 Clang11            Pass: 100%/1   | Total: 40m 16s | Avg: 40m 16s | Max: 40m 16s
  🟩 Clang12            Pass: 100%/1   | Total: 37m 53s | Avg: 37m 53s | Max: 37m 53s
  🟩 Clang13            Pass: 100%/1   | Total: 37m 47s | Avg: 37m 47s | Max: 37m 47s
  🟩 Clang14            Pass: 100%/1   | Total: 37m 00s | Avg: 37m 00s | Max: 37m 00s
  🟩 Clang15            Pass: 100%/1   | Total: 38m 08s | Avg: 38m 08s | Max: 38m 08s
  🟩 Clang16            Pass: 100%/1   | Total: 38m 47s | Avg: 38m 47s | Max: 38m 47s
  🟩 Clang17            Pass: 100%/1   | Total: 38m 07s | Avg: 38m 07s | Max: 38m 07s
  🟩 Clang18            Pass: 100%/7   | Total:  4h 48m | Avg: 41m 16s | Max: 54m 47s
  🟩 GCC7               Pass: 100%/4   | Total:  2h 37m | Avg: 39m 22s | Max: 41m 10s
  🟩 GCC8               Pass: 100%/1   | Total: 41m 15s | Avg: 41m 15s | Max: 41m 15s
  🟩 GCC9               Pass: 100%/3   | Total:  2h 02m | Avg: 40m 44s | Max: 43m 13s
  🟩 GCC10              Pass: 100%/1   | Total: 42m 46s | Avg: 42m 46s | Max: 42m 46s
  🟩 GCC11              Pass: 100%/1   | Total: 42m 42s | Avg: 42m 42s | Max: 42m 42s
  🟩 GCC12              Pass: 100%/3   | Total:  1h 12m | Avg: 24m 15s | Max: 40m 13s
  🟩 GCC13              Pass: 100%/8   | Total:  3h 49m | Avg: 28m 38s | Max: 43m 27s
  🟩 MSVC14.29          Pass: 100%/3   | Total:  2h 46m | Avg: 55m 30s | Max:  1h 00m | Hits: 569%/2340  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 51m | Avg: 55m 54s | Max: 56m 43s | Hits: 569%/1560  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 28m | Avg: 44m 03s | Max: 46m 10s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total: 12h 34m | Avg: 39m 42s | Max: 54m 47s
  🟩 GCC                Pass: 100%/21  | Total: 11h 48m | Avg: 33m 43s | Max: 43m 27s
  🟩 MSVC               Pass: 100%/5   | Total:  4h 38m | Avg: 55m 40s | Max:  1h 00m | Hits: 569%/3900  
  🟩 NVHPC              Pass: 100%/2   | Total:  1h 28m | Avg: 44m 03s | Max: 46m 10s
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 32m 32s | Avg: 16m 16s | Max: 16m 29s
  🟩 v100               Pass: 100%/45  | Total:  1d 05h | Avg: 39m 55s | Max:  1h 00m | Hits: 569%/3900  
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total:  1d 03h | Avg: 41m 40s | Max:  1h 00m | Hits: 569%/3900  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 23m 11s | Avg: 23m 11s | Max: 23m 11s
  🟩 GraphCapture       Pass: 100%/1   | Total: 15m 05s | Avg: 15m 05s | Max: 15m 05s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 04m | Avg: 21m 37s | Max: 24m 43s
  🟩 TestGPU            Pass: 100%/2   | Total: 59m 01s | Avg: 29m 30s | Max: 31m 53s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 32m 32s | Avg: 16m 16s | Max: 16m 29s
  🟩 90a                Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
🟩 std
  🟩 11                 Pass: 100%/5   | Total:  3h 14m | Avg: 38m 55s | Max: 39m 28s
  🟩 14                 Pass: 100%/3   | Total:  2h 09m | Avg: 43m 10s | Max: 51m 06s | Hits: 569%/780   
  🟩 17                 Pass: 100%/13  | Total:  9h 57m | Avg: 45m 57s | Max:  1h 00m | Hits: 569%/2340  
  🟩 20                 Pass: 100%/26  | Total: 15h 07m | Avg: 34m 53s | Max: 56m 43s | Hits: 569%/780

🟩 thrust: Pass: 100%/46 | Total: 12h 10m | Avg: 15m 53s | Max: 40m 15s | Hits: 345%/11112

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 23m 31s | Avg: 11m 45s | Max: 11m 51s
🟩 cpu
  🟩 amd64              Pass: 100%/44  | Total: 11h 47m | Avg: 16m 04s | Max: 40m 15s | Hits: 345%/11112 
  🟩 arm64              Pass: 100%/2   | Total: 23m 18s | Avg: 11m 39s | Max: 12m 07s
🟩 ctk
  🟩 12.0               Pass: 100%/8   | Total:  2h 19m | Avg: 17m 24s | Max: 36m 55s | Hits: 340%/3704  
  🟩 12.5               Pass: 100%/2   | Total: 49m 34s | Avg: 24m 47s | Max: 25m 17s
  🟩 12.6               Pass: 100%/36  | Total:  9h 02m | Avg: 15m 03s | Max: 40m 15s | Hits: 347%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 23m 26s | Avg: 11m 43s | Max: 12m 10s
  🟩 nvcc12.0           Pass: 100%/8   | Total:  2h 19m | Avg: 17m 24s | Max: 36m 55s | Hits: 340%/3704  
  🟩 nvcc12.5           Pass: 100%/2   | Total: 49m 34s | Avg: 24m 47s | Max: 25m 17s
  🟩 nvcc12.6           Pass: 100%/34  | Total:  8h 38m | Avg: 15m 15s | Max: 40m 15s | Hits: 347%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 23m 26s | Avg: 11m 43s | Max: 12m 10s
  🟩 nvcc               Pass: 100%/44  | Total: 11h 47m | Avg: 16m 04s | Max: 40m 15s | Hits: 345%/11112 
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total: 48m 34s | Avg: 12m 08s | Max: 12m 50s
  🟩 Clang10            Pass: 100%/1   | Total: 13m 11s | Avg: 13m 11s | Max: 13m 11s
  🟩 Clang11            Pass: 100%/1   | Total: 12m 14s | Avg: 12m 14s | Max: 12m 14s
  🟩 Clang12            Pass: 100%/1   | Total: 12m 51s | Avg: 12m 51s | Max: 12m 51s
  🟩 Clang13            Pass: 100%/1   | Total: 13m 39s | Avg: 13m 39s | Max: 13m 39s
  🟩 Clang14            Pass: 100%/1   | Total: 13m 24s | Avg: 13m 24s | Max: 13m 24s
  🟩 Clang15            Pass: 100%/1   | Total: 13m 22s | Avg: 13m 22s | Max: 13m 22s
  🟩 Clang16            Pass: 100%/1   | Total: 12m 18s | Avg: 12m 18s | Max: 12m 18s
  🟩 Clang17            Pass: 100%/1   | Total: 12m 19s | Avg: 12m 19s | Max: 12m 19s
  🟩 Clang18            Pass: 100%/7   | Total:  1h 22m | Avg: 11m 46s | Max: 14m 50s
  🟩 GCC7               Pass: 100%/4   | Total: 44m 27s | Avg: 11m 06s | Max: 12m 05s
  🟩 GCC8               Pass: 100%/1   | Total: 12m 23s | Avg: 12m 23s | Max: 12m 23s
  🟩 GCC9               Pass: 100%/3   | Total: 35m 38s | Avg: 11m 52s | Max: 12m 24s
  🟩 GCC10              Pass: 100%/1   | Total: 12m 48s | Avg: 12m 48s | Max: 12m 48s
  🟩 GCC11              Pass: 100%/1   | Total: 13m 39s | Avg: 13m 39s | Max: 13m 39s
  🟩 GCC12              Pass: 100%/1   | Total: 14m 33s | Avg: 14m 33s | Max: 14m 33s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 33m | Avg: 11m 44s | Max: 15m 53s
  🟩 MSVC14.29          Pass: 100%/3   | Total:  1h 44m | Avg: 34m 46s | Max: 36m 55s | Hits: 340%/5556  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  1h 55m | Avg: 38m 23s | Max: 40m 15s | Hits: 349%/5556  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 49m 34s | Avg: 24m 47s | Max: 25m 17s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  3h 54m | Avg: 12m 20s | Max: 14m 50s
  🟩 GCC                Pass: 100%/19  | Total:  3h 47m | Avg: 11m 58s | Max: 15m 53s
  🟩 MSVC               Pass: 100%/6   | Total:  3h 39m | Avg: 36m 34s | Max: 40m 15s | Hits: 345%/11112 
  🟩 NVHPC              Pass: 100%/2   | Total: 49m 34s | Avg: 24m 47s | Max: 25m 17s
🟩 gpu
  🟩 v100               Pass: 100%/46  | Total: 12h 10m | Avg: 15m 53s | Max: 40m 15s | Hits: 345%/11112 
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total: 10h 35m | Avg: 15m 53s | Max: 40m 15s | Hits: 340%/9260  
  🟩 TestCPU            Pass: 100%/3   | Total: 52m 57s | Avg: 17m 39s | Max: 37m 13s | Hits: 365%/1852  
  🟩 TestGPU            Pass: 100%/3   | Total: 42m 23s | Avg: 14m 07s | Max: 15m 53s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total:  7m 39s | Avg:  7m 39s | Max:  7m 39s
🟩 std
  🟩 11                 Pass: 100%/5   | Total: 55m 45s | Avg: 11m 09s | Max: 12m 08s
  🟩 14                 Pass: 100%/3   | Total: 56m 51s | Avg: 18m 57s | Max: 32m 32s | Hits: 340%/1852  
  🟩 17                 Pass: 100%/13  | Total:  4h 07m | Avg: 19m 00s | Max: 37m 43s | Hits: 340%/5556  
  🟩 20                 Pass: 100%/23  | Total:  5h 47m | Avg: 15m 06s | Max: 40m 15s | Hits: 353%/3704

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 54s | Avg: 4m 27s | Max: 6m 57s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  8m 54s | Avg:  4m 27s | Max:  6m 57s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  8m 54s | Avg:  4m 27s | Max:  6m 57s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 54s | Avg:  4m 27s | Max:  6m 57s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  8m 54s | Avg:  4m 27s | Max:  6m 57s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  8m 54s | Avg:  4m 27s | Max:  6m 57s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  8m 54s | Avg:  4m 27s | Max:  6m 57s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  8m 54s | Avg:  4m 27s | Max:  6m 57s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  1m 57s | Avg:  1m 57s | Max:  1m 57s
  🟩 Test               Pass: 100%/1   | Total:  6m 57s | Avg:  6m 57s | Max:  6m 57s

🟩 python: Pass: 100%/1 | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 96)

#	Runner
69	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
11	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

fbusato

using choose_offset_t instead of promotion is nice idea!
On the other hand, I'm a fan of using unsigned types for the offsets. Here the performance looks good, but there are cases where we can get regressions.

implement `add_sat` split `signed`/`unsigned` implementation, improve implementation for MSVC improve device `add_sat` implementation add `add_sat` test improve generic `add_sat` implementation for signed types implement `sub_sat` allow more msvc intrinsics on x86 add op tests partially implement `mul_sat` implement `div_sat` and `saturate_cast` add `saturate_cast` test simplify `div_sat` test Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com> Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com> Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <shijiec@nvidia.com> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <ahendriksen@nvidia.com> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Drop CUB's util_compiler.cuh (#3302) All contained macros were deprecated Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com> --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 move to c++17, finalize device optimization fix msvc compilation, update tests Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com> Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com> Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <shijiec@nvidia.com> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <ahendriksen@nvidia.com> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com> --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 Fix CI issues (#3443) update docs fix review restrict allowed types replace constexpr implementations with generic optimize `__is_arithmetic_integral`

limits offset types for merge sort

b7d2f75

elstehle requested review from a team as code owners January 10, 2025 16:40

elstehle requested review from bernhardmgruber, fbusato and gonidelis January 10, 2025 16:40

elstehle mentioned this pull request Jan 10, 2025

[EPIC]: CUB large input support #50

Open

25 tasks

bernhardmgruber approved these changes Jan 10, 2025

View reviewed changes

fbusato approved these changes Jan 11, 2025

View reviewed changes

elstehle merged commit cc7c1bb into NVIDIA:main Jan 11, 2025

elstehle mentioned this pull request Jan 17, 2025

Uses unsigned offset types in thrust's sort algorithm calling into DispatchMergeSort #3437

Merged

davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 18, 2025

limits offset types for merge sort (NVIDIA#3328)

d08e7fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limits the number of different offset types for `DeviceMergeSort`#3328

Limits the number of different offset types for `DeviceMergeSort`#3328
elstehle merged 1 commit intoNVIDIA:mainfrom
elstehle:enh/limit-merge-sort-offset-types

elstehle commented Jan 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jan 10, 2025

🟩 cub: Pass: 100%/47 | Total: 1d 06h | Avg: 38m 54s | Max: 1h 00m | Hits: 569%/3900

🟩 thrust: Pass: 100%/46 | Total: 12h 10m | Avg: 15m 53s | Max: 40m 15s | Hits: 345%/11112

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 54s | Avg: 4m 27s | Max: 6m 57s

🟩 python: Pass: 100%/1 | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 96)

Uh oh!

fbusato left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

elstehle commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Benchmark results on H100, comparing the offset types

MergeSort.keys

MergeSort.pairs

Checklist

Uh oh!

github-actions bot commented Jan 10, 2025

🟩 cub: Pass: 100%/47 | Total: 1d 06h | Avg: 38m 54s | Max: 1h 00m | Hits: 569%/3900

🟩 thrust: Pass: 100%/46 | Total: 12h 10m | Avg: 15m 53s | Max: 40m 15s | Hits: 345%/11112

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 54s | Avg: 4m 27s | Max: 6m 57s

🟩 python: Pass: 100%/1 | Total: 27m 13s | Avg: 27m 13s | Max: 27m 13s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 96)

Uh oh!

fbusato left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

elstehle commented Jan 10, 2025 •

edited

Loading