Add distributed Scatter/Gather #2113
Conversation
|
Tagging as this affected by #2145 |
| auto dims_to_str = [] (const std::vector<int>& dims) -> std::string { | ||
| std::ostringstream ss; | ||
| for (size_t i=0; i<dims.size(); ++i) { | ||
| ss << (i>0 ? "x" : "") << dims[i]; | ||
| } | ||
| return ss.str(); | ||
| }; |
There was a problem hiding this comment.
Don't we have a function for this? If no, we should have one at the library utils level...
There was a problem hiding this comment.
Not that I know of, but this pattern does get used quite a bit. Ex:
https://github.com/LLNL/lbann/blob/4acbbe5fcb0b48af35a1afa48e628d789ecaf0f1/include/lbann/layers/transform/concatenate.hpp#L226
There was a problem hiding this comment.
That's fantastic. Not your problem, but we should probably functionify that...
| # Add the subdirectories | ||
| add_subdirectory(cereal_registration) | ||
|
|
||
| if (LBANN_HAS_DISTCONV) |
There was a problem hiding this comment.
Did we need NVSHMEM protection here, too?
There was a problem hiding this comment.
Yes, for now. Possibly not true in the future
benson31
left a comment
There was a problem hiding this comment.
I would like to verify this builds +distconv~nvshmem.
82797a8 to
a9d0ffc
Compare
- Adds distconv-side scatter-gather classes - Adds scatterNVSHMEM class to handle NVSHMEM memory using DC NVSHMEM formalism
- Updated CMakeList to include files in compilation
- Added test CI code for distconv scatter
…for getting local PE and total PE
Updated the CI test for distconv scatter
- Adds kernels implementing RMA ops on shared memory buffers for scatter gathers - Adds some helper functions on distconv utils - Appropriate plumbing to integrate LBANN scatter with DiHydrogen scatter
- Added readme with quick instructions
…sion to be different from input
- Fixed distconv-lbann layer mismatch - Added hybrid data-parallel distconv support - Added synthetic GCN example - Fixed ci tests
- Added fix to the CI - Added a debug to data type distconv adapter to diagnose mismatched mini-batch dimension in the identity layer
removed debug prints
Passing CI tests
Added fixed for the nvshmem intialization at setup time Added fix for mismatched nvshmem malloc size at setup time that causes undefined behavior
- Passing CI test now
… sure the participant buffers are contiguous.
6e012e0 to
f72b463
Compare
- Removed extraneous prinouts - Added some more reasonable prints in debug mode
applications/graph/DistConvGNN/syntheticfor benchmarking distributed Scatter, Gather, and GCNTo do: