support variable-length trajectory groups end-to-end#6
Open
zeocax wants to merge 4 commits into
Open
Conversation
Track per-instance completed session ids so a group is ready iff every session has emitted a terminal row, rather than counting raw rows. Propagate each session's terminal reward back to its earlier rows (preserving the step reward in extra_info.step_reward). Add an optional max_groups cap on /get_rollout_data so the rollout side can fetch one group at a time.
…le groups Stop routing samples through slime's data_buffer (its add_samples asserts len(group) == n_samples_per_prompt, which fails for variable-length trajectory groups). Accumulate groups in a local list instead, fetch from buffer_server with the new max_groups cap, and dedupe rewards on session-completed rows when filtering by weight version / dapo. Add rl.variable_group_rewards.post_process_rewards for slime's --custom-reward-post-process-path: groups by sample.group_index and applies GRPO-style baseline (and optional std normalization) per group, which works regardless of group size.
…oup rewards Rename SLIME_GLOBAL_BATCH_SIZE / SLIME_ROLLOUT_BATCH_SIZE / SLIME_N_SAMPLES_PER_PROMPT to RL_GLOBAL_BATCH_SIZE / RL_ROLLOUT_GROUP_BATCH_SIZE / RL_GROUP_SIZE, with the script deriving whichever batch-size var is unset and asserting all three are positive. Pass --custom-reward-post-process-path rl.variable_group_rewards.post_process_rewards so per-group GRPO normalization handles the new variable-length groups, and enable --use-dynamic-global-batch-size on the trainer.
Signed-off-by: zeocax <zeocax@zeocax.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
buffer_server: a prompt group is ready when every session has emitted its terminal row (not when row-count reaches K). Terminal session rewards arepropagated back to earlier rows of the same session (original step reward stays in
extra_info.step_reward)./get_rollout_dataaccepts an optionalmax_groupscap.slime_generator: stop routing samples through slime'sdata_buffer— itsadd_samplesassertslen(group) == n_samples_per_prompt, which fails forvariable-length trajectory groups. Accumulate groups in a local list, fetch by group count via the new
max_groups, and dedupe per-group rewards onsession-completed rows when filtering by weight version / dapo.
rl/variable_group_rewards.pyexposingpost_process_rewardsfor slime's--custom-reward-post-process-path. It groups samples bySample.group_indexand applies GRPO-style baseline (and optional std normalization) per group — works regardless of group size.geo3k_vlexample: renameSLIME_*batch-size envs toRL_GLOBAL_BATCH_SIZE/RL_ROLLOUT_GROUP_BATCH_SIZE/RL_GROUP_SIZE(the script deriveswhichever is unset), wire the new custom reward post-process path, and enable
--use-dynamic-global-batch-size.Commits
buffer_server: support variable-length trajectory groupsslime_generator: fetch rollout data by group count and support variable groupsgeo3k_vl example: rename batch-size vars to RL_* and wire variable-group rewards