Skip to content

Add distributed Scatter/Gather #2113

Merged
bvanessen merged 31 commits into
LBANN:developfrom
szaman19:nvhsmem-scatter-gather
Feb 1, 2023
Merged

Add distributed Scatter/Gather #2113
bvanessen merged 31 commits into
LBANN:developfrom
szaman19:nvhsmem-scatter-gather

Conversation

@szaman19
Copy link
Copy Markdown
Collaborator

@szaman19 szaman19 commented Jun 7, 2022

  • Enables distconv implementations of Scatter and Gather layers
  • Implements NVSHMEM based RMA kernels for scatter/gather on DiHydrogen tensors
  • Adds example applications in applications/graph/DistConvGNN/synthetic for benchmarking distributed Scatter, Gather, and GCN
  • Adds unit tests

To do:

  • Fix error in distconv identity layer causing mismatched mini-batch dimension

@szaman19 szaman19 marked this pull request as ready for review July 1, 2022 17:32
@szaman19
Copy link
Copy Markdown
Collaborator Author

szaman19 commented Nov 9, 2022

Tagging as this affected by #2145

@bvanessen bvanessen requested a review from benson31 December 22, 2022 19:20
@bvanessen bvanessen self-requested a review January 11, 2023 23:07
Comment thread include/lbann/layers/transform/distconv/distconv_gather.hpp Outdated
Comment thread include/lbann/layers/transform/distconv/distconv_scatter.hpp Outdated
Comment on lines +171 to +192
auto dims_to_str = [] (const std::vector<int>& dims) -> std::string {
std::ostringstream ss;
for (size_t i=0; i<dims.size(); ++i) {
ss << (i>0 ? "x" : "") << dims[i];
}
return ss.str();
};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have a function for this? If no, we should have one at the library utils level...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fantastic. Not your problem, but we should probably functionify that...

Comment thread src/layers/transform/distconv/CMakeLists.txt Outdated
Comment thread src/layers/transform/distconv/distconv_gather.cpp Outdated
Comment thread src/layers/transform/distconv/distconv_gather.cpp Outdated
Comment thread src/layers/transform/CMakeLists.txt Outdated
# Add the subdirectories
add_subdirectory(cereal_registration)

if (LBANN_HAS_DISTCONV)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we need NVSHMEM protection here, too?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for now. Possibly not true in the future

Copy link
Copy Markdown
Collaborator

@benson31 benson31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to verify this builds +distconv~nvshmem.

@szaman19 szaman19 force-pushed the nvhsmem-scatter-gather branch from 82797a8 to a9d0ffc Compare February 1, 2023 17:07
- Adds distconv-side scatter-gather classes
- Adds scatterNVSHMEM class to handle NVSHMEM memory using DC NVSHMEM formalism
- Updated CMakeList to include files in compilation
- Added test CI code for distconv scatter
Updated the CI test for distconv scatter
- Adds kernels implementing RMA ops on shared memory buffers for scatter gathers
- Adds some helper functions on distconv utils
- Appropriate plumbing to integrate LBANN scatter with DiHydrogen scatter
- Fixed distconv-lbann layer mismatch
- Added hybrid data-parallel distconv support
- Added synthetic GCN example
- Fixed ci tests
- Added fix to the CI
- Added a debug to data type distconv adapter to diagnose mismatched mini-batch dimension
in the identity layer
removed debug prints
@szaman19 szaman19 force-pushed the nvhsmem-scatter-gather branch from 6e012e0 to f72b463 Compare February 1, 2023 17:55
- Removed extraneous prinouts
- Added some more reasonable prints in debug mode
@bvanessen bvanessen merged commit e91fef8 into LBANN:develop Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants