Commit 1b0a2ba
authored
Update cmake_cuda_architecture to control package size (#23671)
### Description
<!-- Describe your changes. -->
Action item:
* ~~Add LTO support when cuda 12.8 & Relocatable Device Code
(RDC)/separate_compilation are enabled, to reduce potential perf
regression~~LTO needs further testing
* Reduce nuget/whl package size by selecting devices & their cuda
binary/PTX assembly during ORT build;
* make sure ORT nuget package < 250 MB, python wheel < 300 MB
* Suggest creating internal repo to publish pre-built package with
Blackwell sm100/120 SASS and sm120 PTX to repo like
[onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell),
since the package size will be much larger than nuget/pypi repo limit
* Considering the most popular datacenter/consumer GPUs, here's the
cuda_arch list for linux/windows:
* With this change, perf on next release ORT is optimal on Linux with
Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py
whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX
2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture
GPUs are compatible.
*
| OS | cmake_cuda_architecture | package size |
| ------------- | ------------------------------------------ |
------------ |
| Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB |
| Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB |
| Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual |
197 MB |
| Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204
MB |
* [TODO] Vaildate on Windows CUDA CI pipeline with cu128
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Address discussed topics in
#23562 and
#23309
#### Stats
| libonnxruntime_providers_cuda lib size | Main 75;80;90 |
75-real;80-real;90-virtual | 75-real;80;90-virtual |
75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 |
75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 |
| -------------------------------------- | ----------------- |
-------------------------- | --------------------- |
------------------------------------------------ | ------------------ |
------------- | ------------------ | -------------------------- |
| Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | |
| Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB
|
| nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| ---------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A |
| Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB |
| whl size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| -------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A |
| Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB
|
### Reference
https://developer.nvidia.com/cuda-gpus
[Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link
Time
Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/)
[PTX
Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility)
[Application Compatibility on the NVIDIA Ada GPU
Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture)
[Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA
12.8, PyTorch, TensorRT, and
Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330)
### Track some failed/unfinished experiments to control package size:
1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on
package size;
2. ORT packaging uses 7z to pack the package, which can only use zip's
deflate compression. In such format, setting compression ratio to ultra
`-mx=9` doesn't help much to control size (7z's LZMA compression is much
better but not supported by nuget/pypi)
3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library
size by ~50% (Haven't tested on perf yet). This needs further
validation.1 parent 8eb5513 commit 1b0a2ba
5 files changed
Lines changed: 6 additions & 6 deletions
File tree
- tools/ci_build/github
- azure-pipelines/stages
- linux
Lines changed: 2 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
53 | | - | |
| 53 | + | |
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
| |||
68 | 68 | | |
69 | 69 | | |
70 | 70 | | |
71 | | - | |
| 71 | + | |
72 | 72 | | |
73 | 73 | | |
74 | 74 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
59 | | - | |
| 59 | + | |
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
| 5 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
73 | | - | |
| 73 | + | |
74 | 74 | | |
75 | 75 | | |
76 | 76 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
| 6 | + | |
0 commit comments