Skip to content

Fix ep deployment issues#4084

Merged
lvhan028 merged 16 commits intoInternLM:mainfrom
CUHKSZzxy:fix-ep
Nov 27, 2025
Merged

Fix ep deployment issues#4084
lvhan028 merged 16 commits intoInternLM:mainfrom
CUHKSZzxy:fix-ep

Conversation

@CUHKSZzxy
Copy link
Copy Markdown
Collaborator

@CUHKSZzxy CUHKSZzxy commented Oct 30, 2025

Related

DeepEP mode switching and buffer clear updates, credit to @SHshenhao
Those fixes in the current PR refer to the following PR

And huge thanks to collaborators who upgraded DeepEP, DeepGEMM in DLBlas

Modifications

  1. Upgrade DeepEP / DeepGEMM / DLBlas / FlashMLA
  • DeepEP -> v1.2.1
  • DeepGEMM -> v2.1.1.post3
  • DLBlas -> v0.0.6
  • FlashMLA -> commit 1408756 (no official release, pick the latest one)
  1. Bring back GDRCopy
  1. Expose DeepEP num_sms env var

Default DeepEP buffer num sms will raise the following errors on H200 multi-nodes. Therefore, we expose this environment variable to users for configuration. A feasible value on H200 is DEEPEP_BUFFER_NUM_SMS=16.

csrc/kernels/internode.cu:386, condition: ibgda_get_state()->num_rc_per_pe == num_channels or ibgda_get_state()->num_rc_per_pe >= num_sms

This is a known issue in deepep

  1. Fix DeepEP mode in CUDA graph

Flip DeepEP mode between prefill and decode, and also clear the buffer (performed by the DLBLas side when setting to low latency). Otherwise, it will trigger CUDA illegal memory access in deepep or the following deepgemm kernel, as known in

  1. Others
  • Add some deep_gemm CUDA dependencies
  • Pin torch version to avoid build / runtime version mismatch (leads to undefined symbol for deep_gemm)
  • Add vim
  • Add some comments

@CUHKSZzxy CUHKSZzxy changed the title Fix ep Fix ep deployment issues Oct 30, 2025
@CUHKSZzxy CUHKSZzxy marked this pull request as draft October 30, 2025 03:12
@windreamer windreamer self-requested a review October 30, 2025 03:47
@CUHKSZzxy CUHKSZzxy marked this pull request as ready for review November 25, 2025 08:15
@lvhan028 lvhan028 requested a review from windreamer November 25, 2025 13:00
@lvhan028
Copy link
Copy Markdown
Collaborator

@CUHKSZzxy test_docker workflow failed

rm -rf /var/lib/apt/lists/*

# install GDRCopy
GDRCOPY_VERSION=2.5.1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it affect on A100 platform?

Copy link
Copy Markdown
Collaborator Author

@CUHKSZzxy CUHKSZzxy Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GDRCopy is used in the case of DeepEP, theoretically wont affect non-Hopper devices. But haven't tested on A100 yet

Copy link
Copy Markdown
Collaborator

@grimoire grimoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lvhan028 lvhan028 merged commit a8d7cb0 into InternLM:main Nov 27, 2025
15 checks passed
@lvhan028 lvhan028 mentioned this pull request Nov 27, 2025
@CUHKSZzxy CUHKSZzxy deleted the fix-ep branch November 27, 2025 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants