Conversation
| # Prefill | ||
| prefill_request_dict = copy.deepcopy(request_dict) | ||
| prefill_request_dict['max_tokens'] = 1 | ||
| prefill_request_dict['max_completion_tokens'] = 1 |
There was a problem hiding this comment.
what's max_completion_tokens used for? What's the difference between prefill_request_dict['max_completion_tokens'] and prefill_request_dict['max_tokens']
There was a problem hiding this comment.
|
Fixed |
|
h800, lmdeploy serve proxy
export LMDEPLOY_DP_MASTER_ADDR=0.0.0.0
export LMDEPLOY_DP_MASTER_PORT=8888
lmdeploy serve api_server Qwen/Qwen3-235B-A22B --dp 2 --tp 8 --max-batch-size 64 --cache-max-entry-count 0.6 --max-prefill-token-num 4096 --proxy-url http://0.0.0.0:8000
# oc evaluation
opencompass workspace/eval/qwen3_235b_infer.py -m infer -w workspace/eval/qwen3-235b-dp2-tp8 -r latestGot OOM |
|
When serving the Qwen/Qwen3-235B-A22B-FP8 model, an unbalanced CUDA memory occupation is observed. In contrast, this issue is not present with the |
FFN dim 1536=128x12 where 128 is the fp8 block size. 12 blocks can not be split in to 8 ranks evenly. |
|
Evaluation test failed. https://rank.opencompass.org.cn/leaderboard-llm-academic/?m=REALTIME Qwen3-235B-A22B-Thinking-2507, gpqa(79.8), aime2025(90.9) |
FINALLY~! |
requirements
enable different TP for Attention/MLP/MoE