You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
as we have discussed in the comment #470 (comment)
I want to add two RL algorithms as features, here are the guide that I want to use, please check it and verify the possibility of adding these two algorithms:
DAPO and GiGPO Implementation Guide for Trinity-RFT
This document explains how to add two group-based RL algorithms from the referenced papers into Trinity-RFT:
No algorithm_type: dapo bundle; not documented as DAPO defaults
Token-level policy loss
policy_loss_fn_args.loss_agg_mode: token-mean
Same — needs DAPO defaults
Overlong reward shaping
MathDAPORewardFn in trinity/common/rewards/dapo_reward.py
Wired in examples/dapo_math/dapo.yaml but algorithm is still grpo
Dynamic sampling
RewardSTDFilter drops zero-variance groups
No batch-size replenishment; no explicit “accuracy ∈ {0,1}” filter naming
examples/dapo_math/README.md states DAPO algorithm is still WIP; only the DAPO-Math dataset + GRPO + DAPO-style reward/clip are demonstrated.
GiGPO support
GiGPO is not registered. Multi-turn infrastructure exists (EID.run, EID.step, multi_step_grpo, AgentScope / ALFWorld examples) but there is no hierarchical episode+step advantage or anchor-state grouping.
2. Trinity extension model (shared by both algorithms)
Trinity decomposes RL into pluggable modules (see AlgorithmType in trinity/algorithm/algorithm.py):
flowchart LR
Explorer --> Buffer
Buffer --> Trainer
subgraph algorithm_bundle
SampleStrategy
AdvantageFn
PolicyLossFn
KLFn
EntropyLossFn
end
Trainer --> algorithm_bundle
Loading
Typical work for a new algorithm:
Implement only modules that differ from GRPO / multi_step_grpo.
Register classes in advantage_fn, policy_loss_fn, and optionally buffer/operators.
Add XxxAlgorithm(AlgorithmType) + "xxx" in ALGORITHM_TYPE.
Add examples/<xxx>/ with YAML + README and tests under tests/algorithm/.
Recommended first PR path: prototype in trinity/plugins/, validate, then upstream to core registries (see CONTRIBUTING.md).
3. DAPO implementation plan
3.1 What DAPO changes (paper summary)
DAPO keeps a GRPO-style critic-free group baseline and fixes long-CoT training with four techniques (Section 3, arXiv:2503.14476):
Dynamic sampling — drop prompt groups where all rollouts are correct or all incorrect (zero learning signal); resample until the batch has enough “informative” groups.
Token-level policy gradient loss — aggregate loss per token, not per sequence mean (important when responses are very long).
Overlong reward shaping — soft penalty as responses approach max length (stabilizes format/length).
Mathematically, DAPO is still group-relative advantage + clipped importance-weighted policy gradient; the novelty is training system details, not a new critic or value network.
Align numeric hyperparameters with examples/dapo_math/dapo.yaml and the open DAPO/verl recipe when reproducing paper numbers.
3.3 Dynamic sampling operator (main new code)
Behavior: For each task group (same prompt / eid.tid), compute rollout rewards. If std(rewards) == 0 (all pass or all fail), exclude the whole group from the trainer batch. Optionally trigger additional explorer rollouts until batch_size valid groups are collected (paper: keep effective gradient count stable).
Relation to existing code:RewardSTDFilter in trinity/buffer/operators/filters/reward_filter.py already skips groups with variance <= threshold. Extend or replace with:
Optional min_valid_groups hook for explorer scheduling (may require a small change in explorer/buffer batch assembly if resampling is not only filter-side)
PPOPolicyLossFn already implements decoupled clipping via separate clip_range_low / clip_range_high. A dedicated DAPOPolicyLossFn subclass is optional (defaults only) for clarity in configs and docs.
Episode-level (A^E(\tau_i)): same as GRPO over (N) full trajectories sharing task (x) and initial state — normalize total return (R(\tau_i)) within the episode group.
Step-level (A^S(a^{(i)}_t)): build anchor state groups (G^S(\tilde{s})) by hashing environment states (\tilde{s}) seen across trajectories; compare discounted returns (R^{(i)}t = \sum{k\ge t} \gamma^{k-t} r^{(i)}_k) within each group.
One Experience per environment step (multi-turn action_mask).
Set eid.run ∈ {0..N-1} for trajectory index within a task group.
Set eid.step for time index (t).
Store in experience.info:
env_state_hash (or canonical serialized observation) for anchor grouping
step_reward (r^{(i)}_t) (scalar per step)
optional episode_return at terminal step for cross-check
Anchor state hash: Must be deterministic for “same environment state” (e.g., ALFWorld room layout string, WebShop page DOM fingerprint). GiGPO groups all ((a,r)) with matching env_state_hash across runs and steps.
e.g. step_wise_alfworld_workflow, AgentScope workflows
Explorer
repeat_times: N trajectories per task
Config only
GiGPOAdvantageFn algorithm (pseudocode)
Implement AdvantageFn (or extend patterns from StepWiseGRPOAdvantageFn + GRPOGroupedAdvantage):
process(batch of experiences):
# 1) Episode-level
for each task group (eid.tid):
for each run (eid.rid):
R_i = sum_t step_reward or terminal reward
compute A_E per run (mean/std or mean/1 normalization)
broadcast A_E to every step in that run
# 2) Step-level anchor groups
build map: env_state_hash -> list of (experience, discounted_return R_t)
for each hash group with |group| >= 2:
compute A_S for each member (normalize R_t within group)
for singleton groups: A_S = 0
# 3) Combine
for each experience:
advantages = (A_E + omega * A_S) * action_mask
returns = advantages.clone()
Discount (\gamma): Configurable advantage_fn_args.gamma (paper uses standard RL discount; agent tasks often use (\gamma \approx 1) for sparse terminal reward).
Normalization:fnorm: std | none where none means divide by 1 (RLOO-style).
Register as "gigpo": "trinity.algorithm.advantage_fn.gigpo_advantage.GiGPOAdvantageFn".
Default GiGPOAlgorithm.default_config() (target)
algorithm:
algorithm_type: gigporepeat_times: 8advantage_fn: gigpoadvantage_fn_args:
omega: 1.0gamma: 1.0fnorm: none # or std for math-like groupingepsilon: 1e-6policy_loss_fn: ppopolicy_loss_fn_args:
clip_range_low: 0.2clip_range_high: 0.2loss_agg_mode: token-meankl_penalty_fn: nonekl_loss_fn: k2entropy_loss_fn: defaultexplorer_input:
taskset:
rollout_args:
# N trajectories per taskworkflow_args:
max_env_steps: 30default_workflow_type: step_wise_alfworld_workflow # after hash support
Base on MultiStepGRPOAlgorithm flags: use_critic: false, compute_advantage_in_trainer: false, schema: experience.
4.4 Difference vs multi_step_grpo
multi_step_grpo
GiGPO
Grouping
Task-level GRPO on last step only; broadcast to earlier steps
Episode group + anchor-state step groups
Credit
Same scalar advantage for all steps in a run
(A^E + \omega A^S) per step
State identity
Not used
env_state_hash required
Extra rollouts
No
No
GiGPO is not a small config tweak; it needs the new advantage class and workflow metadata.
4.5 Orthogonality with DAPO
The GiGPO paper notes compatibility with group-based methods including DAPO. A future gigpo + DAPO-style clip/reward could be:
as we have discussed in the comment #470 (comment)
I want to add two RL algorithms as features, here are the guide that I want to use, please check it and verify the possibility of adding these two algorithms:
DAPO and GiGPO Implementation Guide for Trinity-RFT
This document explains how to add two group-based RL algorithms from the referenced papers into Trinity-RFT:
2503.144762505.10978Related onboarding docs:
docs/sphinx_doc/source/tutorial/develop_algorithm.md1. Current status in Trinity-RFT
Already supported (baseline)
Registered
algorithm_typevalues live intrinity/algorithm/__init__.py. Relevant baselines:grpo— grouped outcome advantage + PPO-style clipped loss (GRPOAlgorithm)multi_step_grpo— multi-turn rollouts with last-step GRPO broadcast (MultiStepGRPOAlgorithm+step_wise_grpoadvantage)Partial DAPO support (not a dedicated algorithm)
PPOPolicyLossFnsupportsclip_range_low/clip_range_highseparatelyalgorithm_type: dapobundle; not documented as DAPO defaultspolicy_loss_fn_args.loss_agg_mode: token-meanMathDAPORewardFnintrinity/common/rewards/dapo_reward.pyexamples/dapo_math/dapo.yamlbut algorithm is stillgrpoRewardSTDFilterdrops zero-variance groupsexamples/dapo_math/README.mdstates DAPO algorithm is still WIP; only the DAPO-Math dataset + GRPO + DAPO-style reward/clip are demonstrated.GiGPO support
GiGPO is not registered. Multi-turn infrastructure exists (
EID.run,EID.step,multi_step_grpo, AgentScope / ALFWorld examples) but there is no hierarchical episode+step advantage or anchor-state grouping.2. Trinity extension model (shared by both algorithms)
Trinity decomposes RL into pluggable modules (see
AlgorithmTypeintrinity/algorithm/algorithm.py):flowchart LR Explorer --> Buffer Buffer --> Trainer subgraph algorithm_bundle SampleStrategy AdvantageFn PolicyLossFn KLFn EntropyLossFn end Trainer --> algorithm_bundleTypical work for a new algorithm:
multi_step_grpo.advantage_fn,policy_loss_fn, and optionallybuffer/operators.XxxAlgorithm(AlgorithmType)+"xxx"inALGORITHM_TYPE.examples/<xxx>/with YAML + README and tests undertests/algorithm/.Recommended first PR path: prototype in
trinity/plugins/, validate, then upstream to core registries (seeCONTRIBUTING.md).3. DAPO implementation plan
3.1 What DAPO changes (paper summary)
DAPO keeps a GRPO-style critic-free group baseline and fixes long-CoT training with four techniques (Section 3, arXiv:2503.14476):
Mathematically, DAPO is still group-relative advantage + clipped importance-weighted policy gradient; the novelty is training system details, not a new critic or value network.
3.2 Mapping to Trinity modules
DAPOAlgorithmdefaultstrinity/algorithm/algorithm.py,trinity/algorithm/__init__.pygrpo(GRPOGroupedAdvantage)ppowith decoupled clip +token-meandapo→ thin wrapper overPPOPolicyLossFnfor discoverabilitymath_dapo_rewardtrinity/common/rewards/dapo_reward.pytrinity/buffer/operators/filters/dapo_dynamic_sampling.py(new)Default
DAPOAlgorithm.default_config()(target)Align numeric hyperparameters with
examples/dapo_math/dapo.yamland the open DAPO/verl recipe when reproducing paper numbers.3.3 Dynamic sampling operator (main new code)
Behavior: For each task group (same prompt /
eid.tid), compute rollout rewards. Ifstd(rewards) == 0(all pass or all fail), exclude the whole group from the trainer batch. Optionally trigger additional explorer rollouts untilbatch_sizevalid groups are collected (paper: keep effective gradient count stable).Relation to existing code:
RewardSTDFilterintrinity/buffer/operators/filters/reward_filter.pyalready skips groups withvariance <= threshold. Extend or replace with:DAPODynamicSamplingFilter— explicit metrics:dropped_all_correct,dropped_all_wrong,kept_groupsmin_valid_groupshook for explorer scheduling (may require a small change in explorer/buffer batch assembly if resampling is not only filter-side)Registration:
trinity/buffer/operators/__init__.py→"dapo_dynamic_sampling": "...".3.4 Policy loss: optional
dapoaliasPPOPolicyLossFnalready implements decoupled clipping via separateclip_range_low/clip_range_high. A dedicatedDAPOPolicyLossFnsubclass is optional (defaults only) for clarity in configs and docs.3.5 Example and tests
examples/dapo_math/dapo.yaml— changealgorithm_type: grpo→dapoexamples/dapo_math/README.md— document four techniques and metricstests/algorithm/test_dapo_dynamic_sampling.py— group filter edge cases (N=1, all 0, all 1, mixed)3.6 Suggested PR sequence (DAPO)
DAPOAlgorithm+ registry + updateexamples/dapo_mathtoalgorithm_type: dapo(reuse existing reward + clip args).DAPODynamicSamplingFilter+ buffer config wiring + tests.4. GiGPO implementation plan
4.1 What GiGPO changes (paper summary)
GiGPO targets multi-turn LLM agents where GRPO only assigns one advantage per trajectory. It stays critic-free and avoids extra per-state rollouts.
Two-level advantages (Equation 8, arXiv:2505.10978):
Normalization (F_{\text{norm}}) can be
std(GRPO-style) or1(RLOO-style); agent benchmarks in the paper often benefit from (F_{\text{norm}}=1).Policy update: standard clipped objective over each step’s tokens (same family as GRPO/PPO on multi-turn experiences).
Reference implementation (external): langfengQ/verl-agent (cited in the paper).
4.2 Prerequisites in Trinity (explorer / workflow)
GiGPO requires per-step experiences with stable grouping keys. Trinity already documents this on
EID:Workflow contract (new or extended):
Experienceper environment step (multi-turnaction_mask).eid.run∈{0..N-1}for trajectory index within a task group.eid.stepfor time index (t).experience.info:env_state_hash(or canonical serialized observation) for anchor groupingstep_reward(r^{(i)}_t) (scalar per step)episode_returnat terminal step for cross-checkAnchor state hash: Must be deterministic for “same environment state” (e.g., ALFWorld room layout string, WebShop page DOM fingerprint). GiGPO groups all ((a,r)) with matching
env_state_hashacross runs and steps.Existing references:
examples/grpo_alfworld_general_multi_step/—multi_step_grpo+ step-wise workflowsexamples/agentscope_*— ReAct / tool agents (good GiGPO targets after hash plumbing)4.3 Mapping to Trinity modules
GiGPOAdvantageFntrinity/algorithm/advantage_fn/gigpo_advantage.py(new)ppo(clipped, token-mean)GiGPOAlgorithmtrinity/algorithm/algorithm.py,__init__.pyenv_state_hash+ step rewardsstep_wise_alfworld_workflow, AgentScope workflowsrepeat_times: Ntrajectories per taskGiGPOAdvantageFnalgorithm (pseudocode)Implement
AdvantageFn(or extend patterns fromStepWiseGRPOAdvantageFn+GRPOGroupedAdvantage):Discount (\gamma): Configurable
advantage_fn_args.gamma(paper uses standard RL discount; agent tasks often use (\gamma \approx 1) for sparse terminal reward).Normalization:
fnorm: std | nonewherenonemeans divide by 1 (RLOO-style).Register as
"gigpo": "trinity.algorithm.advantage_fn.gigpo_advantage.GiGPOAdvantageFn".Default
GiGPOAlgorithm.default_config()(target)Base on
MultiStepGRPOAlgorithmflags:use_critic: false,compute_advantage_in_trainer: false,schema: experience.4.4 Difference vs
multi_step_grpomulti_step_grpoenv_state_hashrequiredGiGPO is not a small config tweak; it needs the new advantage class and workflow metadata.
4.5 Orthogonality with DAPO
The GiGPO paper notes compatibility with group-based methods including DAPO. A future
gigpo+ DAPO-style clip/reward could be:for agent tasks with very long generations (less common than math DAPO setup).
4.6 Example and tests
examples/gigpo_alfworld/gigpo.yaml(forkgrpo_alfworld_general_multi_step)env_state_hashtests/algorithm/test_gigpo_advantage.py— synthetic trajectories with repeated statesexamples/agentscope_frozenlake) with toy hash4.7 Suggested PR sequence (GiGPO)
GiGPOAdvantageFn+ unit tests (no workflow changes; mockinfo["env_state_hash"]).GiGPOAlgorithm+ registry +examples/gigpo_alfworldusing existing workflow + hash in one environment.multi_step_grpovsgigpo.5. Side-by-side comparison
GiGPOAdvantageFn+ workflow state hashgrpo,math_dapo_reward,PPOPolicyLossFnmulti_step_grpoexplorer pattern,ppolossexamples/dapo_mathexamples/grpo_alfworld_general_multi_step→examples/gigpo_*6. Registry checklist (both algorithms)
When upstreaming from plugins to core:
trinity/algorithm/advantage_fn/__init__.py— registergigpo(DAPO reusesgrpo)trinity/algorithm/policy_loss_fn/__init__.py— optionaldapoaliastrinity/algorithm/algorithm.py—DAPOAlgorithm,GiGPOAlgorithmtrinity/algorithm/__init__.py—"dapo","gigpo"inALGORITHM_TYPEtrinity/buffer/operators/__init__.py—dapo_dynamic_samplingtests/algorithm/— unit testsdocs/rl_algorithm_improvement_guide.md— mark DAPO/GiGPO as implementedREADME.md— supported algorithms listRun before PR:
7. Validation metrics
DAPO
pg_clipfrac, KL, reward mean/std per groupalgorithm_type: grpowith same data)GiGPO
multi_step_grpofnorm: stdvsnone8. References
docs/sphinx_doc/source/tutorial/develop_algorithm.mdtrinity/algorithm/advantage_fn/grpo_advantage.pytrinity/algorithm/advantage_fn/multi_step_grpo_advantage.py