Skip to content

support variable-length trajectory groups end-to-end#6

Open
zeocax wants to merge 4 commits into
AI45Lab:mainfrom
zeocax:feature/variable-group-trajectories
Open

support variable-length trajectory groups end-to-end#6
zeocax wants to merge 4 commits into
AI45Lab:mainfrom
zeocax:feature/variable-group-trajectories

Conversation

@zeocax
Copy link
Copy Markdown
Collaborator

@zeocax zeocax commented May 21, 2026

Summary

  • buffer_server: a prompt group is ready when every session has emitted its terminal row (not when row-count reaches K). Terminal session rewards are
    propagated back to earlier rows of the same session (original step reward stays in extra_info.step_reward). /get_rollout_data accepts an optional
    max_groups cap.
  • slime_generator: stop routing samples through slime's data_buffer — its add_samples asserts len(group) == n_samples_per_prompt, which fails for
    variable-length trajectory groups. Accumulate groups in a local list, fetch by group count via the new max_groups, and dedupe per-group rewards on
    session-completed rows when filtering by weight version / dapo.
  • Add rl/variable_group_rewards.py exposing post_process_rewards for slime's --custom-reward-post-process-path. It groups samples by
    Sample.group_index and applies GRPO-style baseline (and optional std normalization) per group — works regardless of group size.
  • geo3k_vl example: rename SLIME_* batch-size envs to RL_GLOBAL_BATCH_SIZE / RL_ROLLOUT_GROUP_BATCH_SIZE / RL_GROUP_SIZE (the script derives
    whichever is unset), wire the new custom reward post-process path, and enable --use-dynamic-global-batch-size.

Commits

  1. buffer_server: support variable-length trajectory groups
  2. slime_generator: fetch rollout data by group count and support variable groups
  3. geo3k_vl example: rename batch-size vars to RL_* and wire variable-group rewards

zeocax added 3 commits May 21, 2026 18:43
Track per-instance completed session ids so a group is ready iff every
session has emitted a terminal row, rather than counting raw rows. Propagate
each session's terminal reward back to its earlier rows (preserving the step
reward in extra_info.step_reward). Add an optional max_groups cap on
/get_rollout_data so the rollout side can fetch one group at a time.
…le groups

Stop routing samples through slime's data_buffer (its add_samples asserts
len(group) == n_samples_per_prompt, which fails for variable-length
trajectory groups). Accumulate groups in a local list instead, fetch from
buffer_server with the new max_groups cap, and dedupe rewards on
session-completed rows when filtering by weight version / dapo.

Add rl.variable_group_rewards.post_process_rewards for slime's
--custom-reward-post-process-path: groups by sample.group_index and
applies GRPO-style baseline (and optional std normalization) per group,
which works regardless of group size.
…oup rewards

Rename SLIME_GLOBAL_BATCH_SIZE / SLIME_ROLLOUT_BATCH_SIZE /
SLIME_N_SAMPLES_PER_PROMPT to RL_GLOBAL_BATCH_SIZE /
RL_ROLLOUT_GROUP_BATCH_SIZE / RL_GROUP_SIZE, with the script deriving
whichever batch-size var is unset and asserting all three are positive.

Pass --custom-reward-post-process-path
rl.variable_group_rewards.post_process_rewards so per-group GRPO
normalization handles the new variable-length groups, and enable
--use-dynamic-global-batch-size on the trainer.
@two-tiger two-tiger requested review from WangXuhongCN and two-tiger and removed request for WangXuhongCN May 21, 2026 10:57
Signed-off-by: zeocax <zeocax@zeocax.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant