English | 中文
Ascend Implementation of DeepEP
Supported Hardware Models: Atlas A2 and A3 Series Products Platform: aarch64/x86 Supporting Software
- Driver Ascend HDK 25.0.RC1.1, CANN Community Edition 8.2.RC1.alpha003 and later versions (refer to the "CANN Software Installation Guide" to install the CANN development kit package, as well as the supporting firmware and drivers)
- Before installing CANN software, you need to install the relevant dependency list
- Python >= 3.9
- PyTorch >= 2.5.1, torch-npu >= 2.5.1-7.0.0
DeepEP-Ascend supports both A2 and A3 and needs to generate packages separately on A2 and A3.
- Prepare the CANN environment variables (modify according to the installation path)
source /usr/local/Ascend/ascend-toolkit/set_env.sh- Build the project
Before executing the engineering build script build.sh, modify
_ASCEND_INSTALL_PATHon line 7 of build.sh according to the CANN installation path.
- A3
# Building Project bash build.sh -a deepep - A2
# Building Project bash build.sh -a deepep2
- Pip install the
.whlfile into your Python environment
pip install output/deep_ep*.whl
# Link to the deep_ep_cpp.*.so file
cd "$(pip show deep-ep | grep -E '^Location:' | awk '{print $2}')" && ln -s deep_ep/deep_ep_cpp*.so && cd -
# (Optional) Confirm whether the import can be successfully
python -c "import deep_ep; print(deep_ep.__path__)"- Execute the environment variables for CANN (modify according to the installation path)
source /usr/local/Ascend/ascend-toolkit/set_env.sh- In the Python project, import
deep_ep.
-
The A2
low_latency_dispatchandlow_latency_combineoperators support two types of internal operators: non-hierarchical and hierarchical.In the implementation of hierarchical operators, intra-node communication uses HCCS, while inter-node communication uses RDMA. In the implementation of non-hierarchical operators, both intra-node and inter-node communications use pure RDMA.
By default, the non-hierarchical operator is executed. If the environment variables
HCCL_INTRA_PCIE_ENABLE=1andHCCL_INTRA_ROCE_ENABLE=0are configured, the hierarchical operator will be executed instead.A3 no need for hierarchical kernel implementation. Intra-node and inter-node communication uses pure HCCS communication.
Execute deepep-related test scripts
python3 tests/python/deepep/test_fused_deep_moe.py
python3 tests/python/deepep/test_intranode.py
python3 tests/python/deepep/test_low_latency.py
# Execute under A2 dual-node setup, test internode
# you need to set the primary node IP in run_test_internode.sh first.
bash run_test_internode.sh- If installing the
.whlfile results in the inability to importdeep_epin the project, check whether it is correctly installed in thesite-packagesdirectory of the current Python environment; View installation path:
pip show deep-ep
- If after installing the
.whl, you encounter an issue wheredeep_ep_cppis not found, you need to create a symbolic link of thedeep_ep_cpp*.sofiles from thesite-packages/deep_epdirectory to thesite-packagesdirectory; Execute the following command in thesite-packagesdirectory:
ln -s deep_ep/deep_ep_cpp*.so