Refactor dp tp by grimoire · Pull Request #4004 · InternLM/lmdeploy

grimoire · 2025-09-24T04:39:38Z

requirements

fix bug: dp+tp warmup #3991

enable different TP for Attention/MLP/MoE

lvhan028 · 2025-10-09T07:43:51Z

lmdeploy/serve/proxy/proxy.py

        # Prefill
        prefill_request_dict = copy.deepcopy(request_dict)
        prefill_request_dict['max_tokens'] = 1
+        prefill_request_dict['max_completion_tokens'] = 1


what's max_completion_tokens used for? What's the difference between prefill_request_dict['max_completion_tokens'] and prefill_request_dict['max_tokens']

https://github.com/InternLM/lmdeploy/blob/49f632483e93cfd3d09ef743508c07a68a763e26/lmdeploy/serve/openai/api_server.py#L312C7-L312C28

grimoire · 2025-11-02T09:44:15Z

Fixed

lvhan028 · 2025-11-08T02:52:04Z

h800,

lmdeploy serve proxy
export LMDEPLOY_DP_MASTER_ADDR=0.0.0.0
export LMDEPLOY_DP_MASTER_PORT=8888
lmdeploy serve api_server Qwen/Qwen3-235B-A22B --dp 2 --tp 8 --max-batch-size 64 --cache-max-entry-count 0.6 --max-prefill-token-num 4096 --proxy-url http://0.0.0.0:8000

# oc evaluation
opencompass workspace/eval/qwen3_235b_infer.py -m infer -w workspace/eval/qwen3-235b-dp2-tp8 -r latest

Got OOM

(RayWorkerWrapper pid=4133398) Traceback (most recent call last):
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 810, in _on_finish_callback
(RayWorkerWrapper pid=4133398)     task.result()
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 780, in _async_loop_background
(RayWorkerWrapper pid=4133398)     await self._async_step_background(**forward_inputs, )
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 704, in _async_step_background
(RayWorkerWrapper pid=4133398)     output = await self._async_model_forward(
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 540, in _async_model_forward
(RayWorkerWrapper pid=4133398)     ret = await __long_context_single_forward(inputs, max_seqlen)
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 506, in __long_context_single_forward
(RayWorkerWrapper pid=4133398)     inp.build_dp_meta()
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 307, in build_dp_meta
(RayWorkerWrapper pid=4133398)     self.dp_meta = DPMeta.build(self.input_ids.numel())
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 44, in build
(RayWorkerWrapper pid=4133398)     tp_sizes = cls._gather_tp_sizes(mlp_tp, seqlen, dist_ctx, layer_type='mlp')
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 32, in _gather_tp_sizes
(RayWorkerWrapper pid=4133398)     dist.all_gather_object(tp_sizes, seqlen, group=gather_group)
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/distributed.py", line 414, in all_gather_object
(RayWorkerWrapper pid=4133398)     return dist.all_gather_object(object_list, obj, group=group)
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/miniconda3/envs/lmdeploy-0.10.0/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
(RayWorkerWrapper pid=4133398)     return func(*args, **kwargs)
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/miniconda3/envs/lmdeploy-0.10.0/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3171, in all_gather_object
(RayWorkerWrapper pid=4133398)     input_tensor.resize_(max_object_size)
(RayWorkerWrapper pid=4133398) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory.
(RayWorkerWrapper pid=4133404) [Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7 [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)

lvhan028 · 2025-11-10T03:28:07Z

When serving the Qwen/Qwen3-235B-A22B-FP8 model, an unbalanced CUDA memory occupation is observed. In contrast, this issue is not present with the Qwen/Qwen3-235B-A22B model.

lmdeploy serve api_server Qwen/Qwen3-235B-A22B-FP8 --dp 2 --tp 8 --backend pytorch --proxy-url http://0.0.0.0:8000
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L20Y                    On  |   00000000:19:00.0 Off |                    0 |
| N/A   32C    P0            132W /  700W |   73311MiB /  81559MiB |      2%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L20Y                    On  |   00000000:3B:00.0 Off |                    0 |
| N/A   29C    P0            130W /  700W |   73309MiB /  81559MiB |      1%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L20Y                    On  |   00000000:4C:00.0 Off |                    0 |
| N/A   29C    P0            132W /  700W |   73311MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L20Y                    On  |   00000000:5D:00.0 Off |                    0 |
| N/A   30C    P0            131W /  700W |   73311MiB /  81559MiB |      1%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA L20Y                    On  |   00000000:8B:00.0 Off |                    0 |
| N/A   30C    P0            127W /  700W |   69779MiB /  81559MiB |      1%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA L20Y                    On  |   00000000:D6:00.0 Off |                    0 |
| N/A   30C    P0            124W /  700W |   69779MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA L20Y                    On  |   00000000:DD:00.0 Off |                    0 |
| N/A   31C    P0            130W /  700W |   69779MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA L20Y                    On  |   00000000:E4:00.0 Off |                    0 |
| N/A   29C    P0            135W /  700W |   69779MiB /  81559MiB |      1%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

grimoire · 2025-11-10T04:17:27Z

When serving the Qwen/Qwen3-235B-A22B-FP8 model, an unbalanced CUDA memory occupation is observed.

FFN dim 1536=128x12 where 128 is the fp8 block size. 12 blocks can not be split in to 8 ranks evenly.

lvhan028 · 2025-11-11T03:09:13Z

Evaluation test failed.

lmdeploy serve api_server Qwen/Qwen3-235B-A22B-Thinking-2507 --tp 8

dataset                       version    metric                      mode    qwen3-235b-thinking-2507
----------------------------  ---------  --------------------------  ------  --------------------------
core_average                  -          -                           -       -
                              -          -                           -       -
Instruction Following         -          -                           -       -
IFEval                        -          -                           -       -
                              -          -                           -       -
General Reasoning             -          -                           -       -
hle_llmjudge                  -          -                           -       -
GPQA_diamond_repeat_4         772ea0     accuracy (4 runs average)   gen     80.93
                              -          -                           -       -
Math Calculation              -          -                           -       -
aime2025_repeat_32            5e9f4f     accuracy (32 runs average)  gen     80.52

https://rank.opencompass.org.cn/leaderboard-llm-academic/?m=REALTIME

Qwen3-235B-A22B-Thinking-2507, gpqa(79.8), aime2025(90.9)
The evalution result can be found on shared/opencompass/oc_academic/qwen3-235b-thinking-2507/pt-tp8/20251110_223206

lvhan028 · 2025-11-14T07:54:17Z

dataset                       version    metric                        mode    qwen3-235b-thinking-2507
----------------------------  ---------  ----------------------------  ------  --------------------------
core_average                  -          -                             -       -
                              -          -                             -       -
Instruction Following         -          -                             -       -
IFEval                        353ae7     Prompt-level-strict-accuracy  gen     88.72
                              -          -                             -       -
General Reasoning             -          -                             -       -
hle_llmjudge                  -          -                             -       -
GPQA_diamond_repeat_4         772ea0     accuracy (4 runs average)     gen     80.05
                              -          -                             -       -
Math Calculation              -          -                             -       -
aime2025_repeat_32            5e9f4f     accuracy (32 runs average)    gen     91.56

FINALLY~!

grimoire added 9 commits September 22, 2025 21:45

WIP

c1f3a6c

moe

5269b62

refactor

7552b6f

fix

1068fa4

fix

6e300e6

vis

6a08f86

fix pd

6d2e839

optimize gather

cb0e875

expose layer tp

ce0d460

grimoire marked this pull request as ready for review September 25, 2025 04:52

grimoire mentioned this pull request Sep 25, 2025

fix bug: dp+tp warmup #3991

Merged

grimoire added 7 commits September 25, 2025 21:30

ep + attn tp allgather

756d59a

fix not aligned weight

d07c2f5

moe microbatch pipeline

5013836

fix

e75fef6

Merge branch 'main' into refactor-dp-tp

9343a55

fix

cc96976

solve conflict

f9f199e

lvhan028 reviewed Oct 9, 2025

View reviewed changes

lvhan028 requested a review from CUHKSZzxy October 9, 2025 08:32

lvhan028 added the improvement label Oct 9, 2025

lvhan028 requested a review from RunningLeon October 10, 2025 04:51

grimoire added 5 commits October 10, 2025 15:09

merge main

24a96bf

patch deep-gemm

f454414

Merge branch 'main' into refactor-dp-tp

1cc20c1

fix dummy

928178f

fix reduce_scatter

d7f58c9

grimoire added 2 commits November 3, 2025 15:29

linear reduce scatter

858bea3

merge main

3174ba8

solve conflict

d8bbfc9

avoid oom

376edd4

fix linear

642e78e

grimoire added 2 commits November 12, 2025 16:15

fix long context

4f2e123

Merge branch 'main' into refactor-dp-tp

265a491

lvhan028 approved these changes Nov 14, 2025

View reviewed changes

lvhan028 merged commit f63730d into InternLM:main Nov 14, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor dp tp#4004

Refactor dp tp#4004
lvhan028 merged 28 commits intoInternLM:mainfrom
grimoire:refactor-dp-tp

grimoire commented Sep 24, 2025 •

edited by lvhan028

Loading

Uh oh!

lvhan028 Oct 9, 2025

Uh oh!

grimoire Oct 9, 2025

Uh oh!

grimoire commented Nov 2, 2025

Uh oh!

lvhan028 commented Nov 8, 2025 •

edited

Loading

Uh oh!

lvhan028 commented Nov 10, 2025

Uh oh!

grimoire commented Nov 10, 2025

Uh oh!

lvhan028 commented Nov 11, 2025 •

edited

Loading

Uh oh!

lvhan028 commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

grimoire commented Sep 24, 2025 • edited by lvhan028 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lvhan028 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

grimoire Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

grimoire commented Nov 2, 2025

Uh oh!

lvhan028 commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lvhan028 commented Nov 10, 2025

Uh oh!

grimoire commented Nov 10, 2025

Uh oh!

lvhan028 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lvhan028 commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grimoire commented Sep 24, 2025 •

edited by lvhan028

Loading

lvhan028 commented Nov 8, 2025 •

edited

Loading

lvhan028 commented Nov 11, 2025 •

edited

Loading