[feat] support async dataflow by resamping in next rollout step#1198
Merged
YanhuiDua merged 20 commits intoInternLM:mainfrom Nov 5, 2025
Merged
[feat] support async dataflow by resamping in next rollout step#1198YanhuiDua merged 20 commits intoInternLM:mainfrom
YanhuiDua merged 20 commits intoInternLM:mainfrom
Conversation
20ab012 to
b863a10
Compare
…dataflow_sglang
YanhuiDua
commented
Nov 3, 2025
f0372d8 to
cbbbcd6
Compare
hhaAndroid
reviewed
Nov 4, 2025
hhaAndroid
reviewed
Nov 4, 2025
34431e2 to
8fffa49
Compare
2b84ed1 to
8fffa49
Compare
hhaAndroid
reviewed
Nov 5, 2025
| tensor_parallel_size=rollout_tp_size, | ||
| expert_parallel_size=rollout_ep_size, | ||
| gpu_memory_utilization=0.8, | ||
| context_length = max_response_length + 2048, |
Collaborator
There was a problem hiding this comment.
Suggested change
| context_length = max_response_length + 2048, | |
| context_length = max_response_length + max_prompt_length, |
| eval_data_path = os.environ["EVAL_DATA_PATH"] | ||
| enable_evaluate = True if eval_data_path != "" else False | ||
| enbale_partial_rollout = int(os.environ.get("ENBALE_PARTIAL_ROLLOUT", "0")) | ||
| max_concurrent = int(os.environ.get("XTUNER_MAX_CONCURRENCY", "512")) |
Collaborator
There was a problem hiding this comment.
Suggested change
| max_concurrent = int(os.environ.get("XTUNER_MAX_CONCURRENCY", "512")) | |
| max_concurrent = int(os.environ.get("XTUNER_MAX_CONCURRENCY", 512)) |
| eval_data_path = os.environ["EVAL_DATA_PATH"] | ||
| enable_evaluate = True if eval_data_path != "" else False | ||
| enbale_partial_rollout = int(os.environ.get("ENBALE_PARTIAL_ROLLOUT", "0")) | ||
| max_concurrent = int(os.environ.get("XTUNER_MAX_CONCURRENCY", "512")) |
Collaborator
There was a problem hiding this comment.
加个 TODO,说明后续要重构为唯一的对外参数,而且是单卡 。内部所有其他参数都基于这个自动折算
hhaAndroid
approved these changes
Nov 5, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR support async dataflow by resamping in next rollout step and use non-stream inference by call abort_request to infer engine. The aborted(paused) request will take "abort" as finish_reason; LMDeploy support this feature in PR: add endpoint /abort_request lmdeploy#4092
This PR standardizes the usage of concurrency parameters for the dataflow process. XTUNER_MAX_CONCURRENCY is now used to control the concurrency of the dataflow itself. The concurrency for the inference engine is calculated based on the dataflow's concurrency, prompt_repeat_k, and tp_size. The concurrency for httpx is set to the inference engine's concurrency multiplied by tp_size. Finally, RAY_MAX_CONCURRENCY is used to control Ray's concurrency, which is set to the dataflow's concurrency multiplied by prompt_repeat_k.