support variable-length trajectory groups end-to-end by zeocax · Pull Request #6 · AI45Lab/Safactory

zeocax · 2026-05-21T10:47:54Z

Summary

buffer_server: a prompt group is ready when every session has emitted its terminal row (not when row-count reaches K). Terminal session rewards are
propagated back to earlier rows of the same session (original step reward stays in extra_info.step_reward). /get_rollout_data accepts an optional
max_groups cap.
slime_generator: stop routing samples through slime's data_buffer — its add_samples asserts len(group) == n_samples_per_prompt, which fails for
variable-length trajectory groups. Accumulate groups in a local list, fetch by group count via the new max_groups, and dedupe per-group rewards on
session-completed rows when filtering by weight version / dapo.
Add rl/variable_group_rewards.py exposing post_process_rewards for slime's --custom-reward-post-process-path. It groups samples by
Sample.group_index and applies GRPO-style baseline (and optional std normalization) per group — works regardless of group size.
geo3k_vl example: rename SLIME_* batch-size envs to RL_GLOBAL_BATCH_SIZE / RL_ROLLOUT_GROUP_BATCH_SIZE / RL_GROUP_SIZE (the script derives
whichever is unset), wire the new custom reward post-process path, and enable --use-dynamic-global-batch-size.

Commits

buffer_server: support variable-length trajectory groups
slime_generator: fetch rollout data by group count and support variable groups
geo3k_vl example: rename batch-size vars to RL_* and wire variable-group rewards

Track per-instance completed session ids so a group is ready iff every session has emitted a terminal row, rather than counting raw rows. Propagate each session's terminal reward back to its earlier rows (preserving the step reward in extra_info.step_reward). Add an optional max_groups cap on /get_rollout_data so the rollout side can fetch one group at a time.

…le groups Stop routing samples through slime's data_buffer (its add_samples asserts len(group) == n_samples_per_prompt, which fails for variable-length trajectory groups). Accumulate groups in a local list instead, fetch from buffer_server with the new max_groups cap, and dedupe rewards on session-completed rows when filtering by weight version / dapo. Add rl.variable_group_rewards.post_process_rewards for slime's --custom-reward-post-process-path: groups by sample.group_index and applies GRPO-style baseline (and optional std normalization) per group, which works regardless of group size.

…oup rewards Rename SLIME_GLOBAL_BATCH_SIZE / SLIME_ROLLOUT_BATCH_SIZE / SLIME_N_SAMPLES_PER_PROMPT to RL_GLOBAL_BATCH_SIZE / RL_ROLLOUT_GROUP_BATCH_SIZE / RL_GROUP_SIZE, with the script deriving whichever batch-size var is unset and asserting all three are positive. Pass --custom-reward-post-process-path rl.variable_group_rewards.post_process_rewards so per-group GRPO normalization handles the new variable-length groups, and enable --use-dynamic-global-batch-size on the trainer.

Signed-off-by: zeocax <zeocax@zeocax.com>

zeocax added 3 commits May 21, 2026 18:43

two-tiger requested review from WangXuhongCN and two-tiger and removed request for WangXuhongCN May 21, 2026 10:57

Fix system prompt clearing issue.

63f8229

Signed-off-by: zeocax <zeocax@zeocax.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support variable-length trajectory groups end-to-end#6

support variable-length trajectory groups end-to-end#6
zeocax wants to merge 4 commits into
AI45Lab:mainfrom
zeocax:feature/variable-group-trajectories

zeocax commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zeocax commented May 21, 2026

Summary

Commits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant