Skip to content

Performance drop on >= 8 nodes (32 GH200) for NCCL >= 2.20  #1272

@teojgo

Description

@teojgo

Each node on the system has 4 GH200 connected with NVLink and communication between nodes happens via the HPE Slingshot Interconnect. The aws-ofi-nccl plugin is used in order to take advantage of the Slingshot Interconnect. Various versions of the plugin have been tried in order to make sure that the performance drop is not related to it.

For example, on 8 nodes (32 GH200) the performance of the sendrecv_perf nccl-test (using 4 tasks per node), where each task is executed using: sendrecv_perf -g 1 -b 512M -e 512M is the following for NCCL 2.19.4:

# Avg bus bandwidth    : 24.2382 

While for NCCL 2.20.5, it is:

# Avg bus bandwidth    : 18.3153 

The numbers above are in GB/s. Up to 4 nodes (16 GH200), the performance is the same.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions