Add complex64 and complex128 gpu support for tensor_scatter_nd_add#40585
Merged
tensorflow-copybara merged 1 commit intoJul 8, 2020
Merged
Conversation
This PR adds complex64 and complex128 gpu support for tensor_scatter_nd_add, as was raised in 40577. This PR fixes 40577. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
jaingaurav
reviewed
Jul 3, 2020
jaingaurav
left a comment
Contributor
There was a problem hiding this comment.
Is there a test case that covers this already?
jaingaurav
approved these changes
Jul 6, 2020
copybara-service Bot
pushed a commit
that referenced
this pull request
Apr 14, 2026
…to PreparedTransfer Imported from GitHub PR openxla/xla#40585 📝 Summary of Changes This PR is the first in a sequence of PRs that will refactor cross-host data transfer implementations to eventually rely on a shared helper function `CrossHostTransferBuffers`. `CrossHostTransferBuffers` is planned to eventually be integrated into the PJRT APIs to enable receiving data into preallocated receive buffers (this feature is being planned in collaboration with @gspschmid, @emilyfertig, and @pschuh). As a first step, this PR unifies `Prepared{Send,Receive}` structs into a single `PreparedTransfer` struct. This PR also moves waiting for dependency events out of NCCL group calls since the benefit of a NCCL group section is to aggregate the collectives launched inside of it (unrelated to waiting on dependency events). 🎯 Justification It is difficult to achieve good comm/compute overlap with cross-host data transfers as the current implementation always allocates receive-buffers 'just-in-time', and because the GPU memory allocator blocks on the compute stream. `CrossHostTransferBuffers` will enable users to receive into preallocated receive buffers, making it easier to avoid the allocator blocking issue. This PR is a first step towards implementing `CrossHostTransferBuffers`. 🚀 Kind of Contribution ♻️ Cleanup (eventually ✨ New Feature) 🧪 Unit Tests: This PR only refactors the implementation of `CrossHost{Send/Receive}Buffers`, so the pre-existing unit tests for those methods already test this PR. 🧪 Execution Tests: Verified that [these 4 correctness tests](https://gist.github.com/rao-ashish/24ac0df0cb18243c649ac535964b31b8) continue to pass. Copybara import of the project: -- 8d21ba239d7f46fabaf864c48b52d1cb655d0f10 by Ashish Rao <asrao@nvidia.com>: Unify Prepared{Send,Receive} into PreparedTransfer Merging this change closes #40585 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#40585 from rao-ashish:asrao/cross_host_refactor_v2_1 8d21ba239d7f46fabaf864c48b52d1cb655d0f10 PiperOrigin-RevId: 899628303
copybara-service Bot
pushed a commit
that referenced
this pull request
Apr 14, 2026
…to PreparedTransfer Imported from GitHub PR openxla/xla#40585 📝 Summary of Changes This PR is the first in a sequence of PRs that will refactor cross-host data transfer implementations to eventually rely on a shared helper function `CrossHostTransferBuffers`. `CrossHostTransferBuffers` is planned to eventually be integrated into the PJRT APIs to enable receiving data into preallocated receive buffers (this feature is being planned in collaboration with @gspschmid, @emilyfertig, and @pschuh). As a first step, this PR unifies `Prepared{Send,Receive}` structs into a single `PreparedTransfer` struct. This PR also moves waiting for dependency events out of NCCL group calls since the benefit of a NCCL group section is to aggregate the collectives launched inside of it (unrelated to waiting on dependency events). 🎯 Justification It is difficult to achieve good comm/compute overlap with cross-host data transfers as the current implementation always allocates receive-buffers 'just-in-time', and because the GPU memory allocator blocks on the compute stream. `CrossHostTransferBuffers` will enable users to receive into preallocated receive buffers, making it easier to avoid the allocator blocking issue. This PR is a first step towards implementing `CrossHostTransferBuffers`. 🚀 Kind of Contribution ♻️ Cleanup (eventually ✨ New Feature) 🧪 Unit Tests: This PR only refactors the implementation of `CrossHost{Send/Receive}Buffers`, so the pre-existing unit tests for those methods already test this PR. 🧪 Execution Tests: Verified that [these 4 correctness tests](https://gist.github.com/rao-ashish/24ac0df0cb18243c649ac535964b31b8) continue to pass. Copybara import of the project: -- 8d21ba239d7f46fabaf864c48b52d1cb655d0f10 by Ashish Rao <asrao@nvidia.com>: Unify Prepared{Send,Receive} into PreparedTransfer Merging this change closes #40585 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#40585 from rao-ashish:asrao/cross_host_refactor_v2_1 8d21ba239d7f46fabaf864c48b52d1cb655d0f10 PiperOrigin-RevId: 899628303
copybara-service Bot
pushed a commit
that referenced
this pull request
Apr 14, 2026
…to PreparedTransfer Imported from GitHub PR openxla/xla#40585 📝 Summary of Changes This PR is the first in a sequence of PRs that will refactor cross-host data transfer implementations to eventually rely on a shared helper function `CrossHostTransferBuffers`. `CrossHostTransferBuffers` is planned to eventually be integrated into the PJRT APIs to enable receiving data into preallocated receive buffers (this feature is being planned in collaboration with @gspschmid, @emilyfertig, and @pschuh). As a first step, this PR unifies `Prepared{Send,Receive}` structs into a single `PreparedTransfer` struct. This PR also moves waiting for dependency events out of NCCL group calls since the benefit of a NCCL group section is to aggregate the collectives launched inside of it (unrelated to waiting on dependency events). 🎯 Justification It is difficult to achieve good comm/compute overlap with cross-host data transfers as the current implementation always allocates receive-buffers 'just-in-time', and because the GPU memory allocator blocks on the compute stream. `CrossHostTransferBuffers` will enable users to receive into preallocated receive buffers, making it easier to avoid the allocator blocking issue. This PR is a first step towards implementing `CrossHostTransferBuffers`. 🚀 Kind of Contribution ♻️ Cleanup (eventually ✨ New Feature) 🧪 Unit Tests: This PR only refactors the implementation of `CrossHost{Send/Receive}Buffers`, so the pre-existing unit tests for those methods already test this PR. 🧪 Execution Tests: Verified that [these 4 correctness tests](https://gist.github.com/rao-ashish/24ac0df0cb18243c649ac535964b31b8) continue to pass. Copybara import of the project: -- 8d21ba239d7f46fabaf864c48b52d1cb655d0f10 by Ashish Rao <asrao@nvidia.com>: Unify Prepared{Send,Receive} into PreparedTransfer Merging this change closes #40585 PiperOrigin-RevId: 899784485
copybara-service Bot
pushed a commit
that referenced
this pull request
Apr 15, 2026
…vice into ScheduleTransfersOnLocalDevice Imported from GitHub PR openxla/xla#40919 📝 Summary of Changes This PR builds on [XLA #40585](openxla/xla#40585), and is the next in a sequence of PRs that will refactor cross-host data transfer implementations to eventually rely on a shared helper function `CrossHostTransferBuffers`. `CrossHostTransferBuffers` is planned to eventually be integrated into the PJRT APIs to enable receiving data into preallocated receive buffers (this feature is being planned in collaboration with @gspschmid, @emilyfertig, and @pschuh). As part of implementing `CrossHostTransferBuffers`, this PR introduces a `CrossHostTransferSpec` struct and refactors `ScheduleSendsOnLocalDevice` into a more general `ScheduleTransfersOnLocalDevice`. This PR also cleans up some of the error handling around the `prepare_transfers` closure inside `ScheduleTransfersOnLocalDevice` (formerly the `setup_sends` closure inside `ScheduleSendsOnLocalDevice`). Previously, if we got an error when we tried to extract the `LocalDeviceState`, we failed to set the `transfer_event` as an error. The current changes make sure that the error from `prepare_transfers` is always propagated through the transfer event. 🎯 Justification It is difficult to achieve good comm/compute overlap with cross-host data transfers as the current implementation always allocates receive-buffers 'just-in-time', and because the GPU memory allocator blocks on the compute stream. `CrossHostTransferBuffers` will enable users to receive into preallocated receive buffers, making it easier to avoid the allocator blocking issue. This PR is a step towards implementing `CrossHostTransferBuffers`. 🚀 Kind of Contribution ♻️ Cleanup (eventually ✨ New Feature) 🧪 Unit Tests: This PR only refactors the implementation of `CrossHost{Send/Receive}Buffers`, so the pre-existing unit tests for those methods already test this PR. 🧪 Execution Tests: Verified that [these 4 correctness tests](https://gist.github.com/rao-ashish/24ac0df0cb18243c649ac535964b31b8) continue to pass. Copybara import of the project: -- fcf2fa1139441fda4b3ce339bc1c222f93ae7023 by Ashish Rao <asrao@nvidia.com>: Refactor ScheduleSendsOnLocalDevice into more generic ScheduleTransfersOnLocalDevice -- 426c90fd3ef312eb161b6a9575383ed6c733ec2e by Ashish Rao <asrao@nvidia.com>: Add comment clarifying rank ids; add const keywords Merging this change closes #40919 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#40919 from rao-ashish:asrao/cross_host_refactor_v2_2 426c90fd3ef312eb161b6a9575383ed6c733ec2e PiperOrigin-RevId: 900247155
copybara-service Bot
pushed a commit
that referenced
this pull request
Apr 15, 2026
…vice into ScheduleTransfersOnLocalDevice Imported from GitHub PR openxla/xla#40919 📝 Summary of Changes This PR builds on [XLA #40585](openxla/xla#40585), and is the next in a sequence of PRs that will refactor cross-host data transfer implementations to eventually rely on a shared helper function `CrossHostTransferBuffers`. `CrossHostTransferBuffers` is planned to eventually be integrated into the PJRT APIs to enable receiving data into preallocated receive buffers (this feature is being planned in collaboration with @gspschmid, @emilyfertig, and @pschuh). As part of implementing `CrossHostTransferBuffers`, this PR introduces a `CrossHostTransferSpec` struct and refactors `ScheduleSendsOnLocalDevice` into a more general `ScheduleTransfersOnLocalDevice`. This PR also cleans up some of the error handling around the `prepare_transfers` closure inside `ScheduleTransfersOnLocalDevice` (formerly the `setup_sends` closure inside `ScheduleSendsOnLocalDevice`). Previously, if we got an error when we tried to extract the `LocalDeviceState`, we failed to set the `transfer_event` as an error. The current changes make sure that the error from `prepare_transfers` is always propagated through the transfer event. 🎯 Justification It is difficult to achieve good comm/compute overlap with cross-host data transfers as the current implementation always allocates receive-buffers 'just-in-time', and because the GPU memory allocator blocks on the compute stream. `CrossHostTransferBuffers` will enable users to receive into preallocated receive buffers, making it easier to avoid the allocator blocking issue. This PR is a step towards implementing `CrossHostTransferBuffers`. 🚀 Kind of Contribution ♻️ Cleanup (eventually ✨ New Feature) 🧪 Unit Tests: This PR only refactors the implementation of `CrossHost{Send/Receive}Buffers`, so the pre-existing unit tests for those methods already test this PR. 🧪 Execution Tests: Verified that [these 4 correctness tests](https://gist.github.com/rao-ashish/24ac0df0cb18243c649ac535964b31b8) continue to pass. Copybara import of the project: -- fcf2fa1139441fda4b3ce339bc1c222f93ae7023 by Ashish Rao <asrao@nvidia.com>: Refactor ScheduleSendsOnLocalDevice into more generic ScheduleTransfersOnLocalDevice -- 426c90fd3ef312eb161b6a9575383ed6c733ec2e by Ashish Rao <asrao@nvidia.com>: Add comment clarifying rank ids; add const keywords Merging this change closes #40919 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#40919 from rao-ashish:asrao/cross_host_refactor_v2_2 426c90fd3ef312eb161b6a9575383ed6c733ec2e PiperOrigin-RevId: 900247155
copybara-service Bot
pushed a commit
that referenced
this pull request
Apr 15, 2026
…vice into ScheduleTransfersOnLocalDevice Imported from GitHub PR openxla/xla#40919 📝 Summary of Changes This PR builds on [XLA #40585](openxla/xla#40585), and is the next in a sequence of PRs that will refactor cross-host data transfer implementations to eventually rely on a shared helper function `CrossHostTransferBuffers`. `CrossHostTransferBuffers` is planned to eventually be integrated into the PJRT APIs to enable receiving data into preallocated receive buffers (this feature is being planned in collaboration with @gspschmid, @emilyfertig, and @pschuh). As part of implementing `CrossHostTransferBuffers`, this PR introduces a `CrossHostTransferSpec` struct and refactors `ScheduleSendsOnLocalDevice` into a more general `ScheduleTransfersOnLocalDevice`. This PR also cleans up some of the error handling around the `prepare_transfers` closure inside `ScheduleTransfersOnLocalDevice` (formerly the `setup_sends` closure inside `ScheduleSendsOnLocalDevice`). Previously, if we got an error when we tried to extract the `LocalDeviceState`, we failed to set the `transfer_event` as an error. The current changes make sure that the error from `prepare_transfers` is always propagated through the transfer event. 🎯 Justification It is difficult to achieve good comm/compute overlap with cross-host data transfers as the current implementation always allocates receive-buffers 'just-in-time', and because the GPU memory allocator blocks on the compute stream. `CrossHostTransferBuffers` will enable users to receive into preallocated receive buffers, making it easier to avoid the allocator blocking issue. This PR is a step towards implementing `CrossHostTransferBuffers`. 🚀 Kind of Contribution ♻️ Cleanup (eventually ✨ New Feature) 🧪 Unit Tests: This PR only refactors the implementation of `CrossHost{Send/Receive}Buffers`, so the pre-existing unit tests for those methods already test this PR. 🧪 Execution Tests: Verified that [these 4 correctness tests](https://gist.github.com/rao-ashish/24ac0df0cb18243c649ac535964b31b8) continue to pass. Copybara import of the project: -- fcf2fa1139441fda4b3ce339bc1c222f93ae7023 by Ashish Rao <asrao@nvidia.com>: Refactor ScheduleSendsOnLocalDevice into more generic ScheduleTransfersOnLocalDevice -- 426c90fd3ef312eb161b6a9575383ed6c733ec2e by Ashish Rao <asrao@nvidia.com>: Add comment clarifying rank ids; add const keywords Merging this change closes #40919 PiperOrigin-RevId: 900297557
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds complex64 and complex128 gpu support for tensor_scatter_nd_add,
as was raised in #40577.
This PR fixes #40577.
Signed-off-by: Yong Tang yong.tang.github@outlook.com