[Relax][Pass] Lowering passes for GPU IPC memory and allreduce#16759
Merged
tqchen merged 1 commit intoapache:mainfrom Mar 21, 2024
Merged
[Relax][Pass] Lowering passes for GPU IPC memory and allreduce#16759tqchen merged 1 commit intoapache:mainfrom
tqchen merged 1 commit intoapache:mainfrom
Conversation
0af4ac4 to
0211a3a
Compare
tqchen
approved these changes
Mar 21, 2024
This PR introduces the lowering passes for GPU IPC memory and all-reduce. It contains the following changes: 1. a pass `IPCAllreduceRewrite` which rewrites `"runtime.disco.allreduce"` to `"runtime.disco.cuda_ipc.custom_allreduce"`, and rewrites the storage scopes of the all-reduce inputs's from "global" to "ipc_memory" accordingly. 2. memory planning enhancement, making the planning be aware of storage scopes. So each storage scope will be planned independently. 3. a pass `LowerGPUIPCAllocStorage` that rewrites the storage allocation of IPC memory from builtin ops to calls to function `"runtime.disco.cuda_ipc.alloc_storage"`. 4. supports the op `relax.builtin.alloc_tensor` with storage scope. The default storage scope is `"global"`. We write the new passes in Python for experiment and fast development. These are good demos showing we can have efficient development with the architecture enabled by TVM.
0211a3a to
3b8183f
Compare
tqchen
approved these changes
Mar 21, 2024
thaisacs
pushed a commit
to thaisacs/tvm
that referenced
this pull request
Apr 3, 2024
…e#16759) This PR introduces the lowering passes for GPU IPC memory and all-reduce. It contains the following changes: 1. a pass `IPCAllreduceRewrite` which rewrites `"runtime.disco.allreduce"` to `"runtime.disco.cuda_ipc.custom_allreduce"`, and rewrites the storage scopes of the all-reduce inputs's from "global" to "ipc_memory" accordingly. 2. memory planning enhancement, making the planning be aware of storage scopes. So each storage scope will be planned independently. 3. a pass `LowerGPUIPCAllocStorage` that rewrites the storage allocation of IPC memory from builtin ops to calls to function `"runtime.disco.cuda_ipc.alloc_storage"`. 4. supports the op `relax.builtin.alloc_tensor` with storage scope. The default storage scope is `"global"`. We write the new passes in Python for experiment and fast development. These are good demos showing we can have efficient development with the architecture enabled by TVM.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces the lowering passes for GPU IPC memory and all-reduce. It contains the following changes:
a pass
IPCAllreduceRewritewhich rewrites"runtime.disco.allreduce"to"runtime.disco.cuda_ipc.custom_allreduce", and rewrites the storage scopes of the all-reduce inputs's from "global" to "ipc_memory" accordingly.memory planning enhancement, making the planning be aware of storage scopes. So each storage scope will be planned independently.
a pass
LowerGPUIPCAllocStoragethat rewrites the storage allocation of IPC memory from builtin ops to calls to function"runtime.disco.cuda_ipc.alloc_storage".supports the op
relax.builtin.alloc_tensorwith storage scope. The default storage scope is"global".We write the new passes in Python for experiment and fast development. These are good demos showing we can have efficient development with the architecture enabled by TVM.