Each node on the system has 4 GH200 connected with NVLink and communication between nodes happens via the HPE Slingshot Interconnect. The aws-ofi-nccl plugin is used in order to take advantage of the Slingshot Interconnect. Various versions of the plugin have been tried in order to make sure that the performance drop is not related to it.
For example, on 8 nodes (32 GH200) the performance of the sendrecv_perf nccl-test (using 4 tasks per node), where each task is executed using: sendrecv_perf -g 1 -b 512M -e 512M is the following for NCCL 2.19.4:
# Avg bus bandwidth : 24.2382
While for NCCL 2.20.5, it is:
# Avg bus bandwidth : 18.3153
The numbers above are in GB/s. Up to 4 nodes (16 GH200), the performance is the same.