Skip to content

[FEATURE] DAPO and GiGPO Implementation Suggestion #551

@artaasd95

Description

@artaasd95

as we have discussed in the comment #470 (comment)
I want to add two RL algorithms as features, here are the guide that I want to use, please check it and verify the possibility of adding these two algorithms:

DAPO and GiGPO Implementation Guide for Trinity-RFT

This document explains how to add two group-based RL algorithms from the referenced papers into Trinity-RFT:

Paper arXiv Algorithm Primary use case
DAPO: An Open-Source LLM Reinforcement Learning System at Scale 2503.14476 DAPO Long-CoT math/reasoning (single-turn, outcome reward)
Group-in-Group Policy Optimization for LLM Agent Training 2505.10978 GiGPO Multi-turn LLM agents (sparse rewards, step credit)

Related onboarding docs:

Naming note: Earlier Trinity docs mention “GiGRPO”; the NeurIPS 2025 paper defines GiGPO (Group-in-Group Policy Optimization). This guide uses GiGPO and registers algorithm_type: gigpo.


1. Current status in Trinity-RFT

Already supported (baseline)

Registered algorithm_type values live in trinity/algorithm/__init__.py. Relevant baselines:

  • grpo — grouped outcome advantage + PPO-style clipped loss (GRPOAlgorithm)
  • multi_step_grpo — multi-turn rollouts with last-step GRPO broadcast (MultiStepGRPOAlgorithm + step_wise_grpo advantage)

Partial DAPO support (not a dedicated algorithm)

DAPO technique Trinity today Gap
Decoupled clip (Clip-Higher) PPOPolicyLossFn supports clip_range_low / clip_range_high separately No algorithm_type: dapo bundle; not documented as DAPO defaults
Token-level policy loss policy_loss_fn_args.loss_agg_mode: token-mean Same — needs DAPO defaults
Overlong reward shaping MathDAPORewardFn in trinity/common/rewards/dapo_reward.py Wired in examples/dapo_math/dapo.yaml but algorithm is still grpo
Dynamic sampling RewardSTDFilter drops zero-variance groups No batch-size replenishment; no explicit “accuracy ∈ {0,1}” filter naming

examples/dapo_math/README.md states DAPO algorithm is still WIP; only the DAPO-Math dataset + GRPO + DAPO-style reward/clip are demonstrated.

GiGPO support

GiGPO is not registered. Multi-turn infrastructure exists (EID.run, EID.step, multi_step_grpo, AgentScope / ALFWorld examples) but there is no hierarchical episode+step advantage or anchor-state grouping.


2. Trinity extension model (shared by both algorithms)

Trinity decomposes RL into pluggable modules (see AlgorithmType in trinity/algorithm/algorithm.py):

flowchart LR
  Explorer --> Buffer
  Buffer --> Trainer
  subgraph algorithm_bundle
    SampleStrategy
    AdvantageFn
    PolicyLossFn
    KLFn
    EntropyLossFn
  end
  Trainer --> algorithm_bundle
Loading

Typical work for a new algorithm:

  1. Implement only modules that differ from GRPO / multi_step_grpo.
  2. Register classes in advantage_fn, policy_loss_fn, and optionally buffer/operators.
  3. Add XxxAlgorithm(AlgorithmType) + "xxx" in ALGORITHM_TYPE.
  4. Add examples/<xxx>/ with YAML + README and tests under tests/algorithm/.

Recommended first PR path: prototype in trinity/plugins/, validate, then upstream to core registries (see CONTRIBUTING.md).


3. DAPO implementation plan

3.1 What DAPO changes (paper summary)

DAPO keeps a GRPO-style critic-free group baseline and fixes long-CoT training with four techniques (Section 3, arXiv:2503.14476):

  1. Clip-Higher (decoupled clip) — asymmetric PPO clip: lower bound for negative advantages, higher upper bound for positive advantages (reduces entropy collapse).
  2. Dynamic sampling — drop prompt groups where all rollouts are correct or all incorrect (zero learning signal); resample until the batch has enough “informative” groups.
  3. Token-level policy gradient loss — aggregate loss per token, not per sequence mean (important when responses are very long).
  4. Overlong reward shaping — soft penalty as responses approach max length (stabilizes format/length).

Mathematically, DAPO is still group-relative advantage + clipped importance-weighted policy gradient; the novelty is training system details, not a new critic or value network.

3.2 Mapping to Trinity modules

Component Implementation Files to touch
Algorithm bundle DAPOAlgorithm defaults trinity/algorithm/algorithm.py, trinity/algorithm/__init__.py
Advantage Reuse grpo (GRPOGroupedAdvantage) No new file required initially
Policy loss Reuse ppo with decoupled clip + token-mean Optionally alias dapo → thin wrapper over PPOPolicyLossFn for discoverability
Reward Reuse math_dapo_reward Already in trinity/common/rewards/dapo_reward.py
Dynamic sampling New buffer/explorer filter trinity/buffer/operators/filters/dapo_dynamic_sampling.py (new)

Default DAPOAlgorithm.default_config() (target)

algorithm:
  algorithm_type: dapo
  repeat_times: 16
  advantage_fn: grpo
  policy_loss_fn: ppo
  policy_loss_fn_args:
    clip_range_low: 0.2
    clip_range_high: 0.28      # Clip-Higher
    loss_agg_mode: token-mean  # Token-level loss
  kl_penalty_fn: none
  kl_loss_fn: k2
  entropy_loss_fn: default

buffer:
  # pipeline operator (exact config key depends on buffer schema)
  operators:
    - type: dapo_dynamic_sampling
      min_std: 1e-6
      resample: true

explorer_input:
  reward_fn_args:
    enable_overlong_penalty: true
    penalty_factor: 1.0
    max_response_length: 20480
    cache_length: 4096

Align numeric hyperparameters with examples/dapo_math/dapo.yaml and the open DAPO/verl recipe when reproducing paper numbers.

3.3 Dynamic sampling operator (main new code)

Behavior: For each task group (same prompt / eid.tid), compute rollout rewards. If std(rewards) == 0 (all pass or all fail), exclude the whole group from the trainer batch. Optionally trigger additional explorer rollouts until batch_size valid groups are collected (paper: keep effective gradient count stable).

Relation to existing code: RewardSTDFilter in trinity/buffer/operators/filters/reward_filter.py already skips groups with variance <= threshold. Extend or replace with:

  • DAPODynamicSamplingFilter — explicit metrics: dropped_all_correct, dropped_all_wrong, kept_groups
  • Optional min_valid_groups hook for explorer scheduling (may require a small change in explorer/buffer batch assembly if resampling is not only filter-side)

Registration: trinity/buffer/operators/__init__.py"dapo_dynamic_sampling": "...".

3.4 Policy loss: optional dapo alias

PPOPolicyLossFn already implements decoupled clipping via separate clip_range_low / clip_range_high. A dedicated DAPOPolicyLossFn subclass is optional (defaults only) for clarity in configs and docs.

3.5 Example and tests

Deliverable Path
End-to-end config examples/dapo_math/dapo.yaml — change algorithm_type: grpodapo
README examples/dapo_math/README.md — document four techniques and metrics
Unit tests tests/algorithm/test_dapo_dynamic_sampling.py — group filter edge cases (N=1, all 0, all 1, mixed)
Regression Existing GRPO examples unchanged

3.6 Suggested PR sequence (DAPO)

  1. PR 1 (small): DAPOAlgorithm + registry + update examples/dapo_math to algorithm_type: dapo (reuse existing reward + clip args).
  2. PR 2 (medium): DAPODynamicSamplingFilter + buffer config wiring + tests.
  3. PR 3 (optional): Benchmark vs GRPO on DAPO-Math / AIME eval configs; document entropy and clip fraction metrics.

4. GiGPO implementation plan

4.1 What GiGPO changes (paper summary)

GiGPO targets multi-turn LLM agents where GRPO only assigns one advantage per trajectory. It stays critic-free and avoids extra per-state rollouts.

Two-level advantages (Equation 8, arXiv:2505.10978):

  • Episode-level (A^E(\tau_i)): same as GRPO over (N) full trajectories sharing task (x) and initial state — normalize total return (R(\tau_i)) within the episode group.
  • Step-level (A^S(a^{(i)}_t)): build anchor state groups (G^S(\tilde{s})) by hashing environment states (\tilde{s}) seen across trajectories; compare discounted returns (R^{(i)}t = \sum{k\ge t} \gamma^{k-t} r^{(i)}_k) within each group.
  • Combined: (A(a^{(i)}_t) = A^E(\tau_i) + \omega \cdot A^S(a^{(i)}_t)).

Normalization (F_{\text{norm}}) can be std (GRPO-style) or 1 (RLOO-style); agent benchmarks in the paper often benefit from (F_{\text{norm}}=1).

Policy update: standard clipped objective over each step’s tokens (same family as GRPO/PPO on multi-turn experiences).

Reference implementation (external): langfengQ/verl-agent (cited in the paper).

4.2 Prerequisites in Trinity (explorer / workflow)

GiGPO requires per-step experiences with stable grouping keys. Trinity already documents this on EID:

    To enable the full functionality of the experience grouping, user should manually set the `run` and `step` fields in custom workflows.
    ...
    run: int = 0
    ...
    step: int = 0

Workflow contract (new or extended):

  1. One Experience per environment step (multi-turn action_mask).
  2. Set eid.run{0..N-1} for trajectory index within a task group.
  3. Set eid.step for time index (t).
  4. Store in experience.info:
    • env_state_hash (or canonical serialized observation) for anchor grouping
    • step_reward (r^{(i)}_t) (scalar per step)
    • optional episode_return at terminal step for cross-check

Anchor state hash: Must be deterministic for “same environment state” (e.g., ALFWorld room layout string, WebShop page DOM fingerprint). GiGPO groups all ((a,r)) with matching env_state_hash across runs and steps.

Existing references:

  • examples/grpo_alfworld_general_multi_step/multi_step_grpo + step-wise workflows
  • examples/agentscope_* — ReAct / tool agents (good GiGPO targets after hash plumbing)

4.3 Mapping to Trinity modules

Component Implementation Files to touch
Advantage GiGPOAdvantageFn trinity/algorithm/advantage_fn/gigpo_advantage.py (new)
Policy loss Reuse ppo (clipped, token-mean) Same as multi-step GRPO
Algorithm bundle GiGPOAlgorithm trinity/algorithm/algorithm.py, __init__.py
Workflows Emit env_state_hash + step rewards e.g. step_wise_alfworld_workflow, AgentScope workflows
Explorer repeat_times: N trajectories per task Config only

GiGPOAdvantageFn algorithm (pseudocode)

Implement AdvantageFn (or extend patterns from StepWiseGRPOAdvantageFn + GRPOGroupedAdvantage):

process(batch of experiences):
  # 1) Episode-level
  for each task group (eid.tid):
    for each run (eid.rid):
      R_i = sum_t step_reward or terminal reward
    compute A_E per run (mean/std or mean/1 normalization)
    broadcast A_E to every step in that run

  # 2) Step-level anchor groups
  build map: env_state_hash -> list of (experience, discounted_return R_t)
  for each hash group with |group| >= 2:
    compute A_S for each member (normalize R_t within group)
  for singleton groups: A_S = 0

  # 3) Combine
  for each experience:
    advantages = (A_E + omega * A_S) * action_mask
    returns = advantages.clone()

Discount (\gamma): Configurable advantage_fn_args.gamma (paper uses standard RL discount; agent tasks often use (\gamma \approx 1) for sparse terminal reward).

Normalization: fnorm: std | none where none means divide by 1 (RLOO-style).

Register as "gigpo": "trinity.algorithm.advantage_fn.gigpo_advantage.GiGPOAdvantageFn".

Default GiGPOAlgorithm.default_config() (target)

algorithm:
  algorithm_type: gigpo
  repeat_times: 8
  advantage_fn: gigpo
  advantage_fn_args:
    omega: 1.0
    gamma: 1.0
    fnorm: none          # or std for math-like grouping
    epsilon: 1e-6
  policy_loss_fn: ppo
  policy_loss_fn_args:
    clip_range_low: 0.2
    clip_range_high: 0.2
    loss_agg_mode: token-mean
  kl_penalty_fn: none
  kl_loss_fn: k2
  entropy_loss_fn: default

explorer_input:
  taskset:
    rollout_args:
      # N trajectories per task
    workflow_args:
      max_env_steps: 30
  default_workflow_type: step_wise_alfworld_workflow  # after hash support

Base on MultiStepGRPOAlgorithm flags: use_critic: false, compute_advantage_in_trainer: false, schema: experience.

4.4 Difference vs multi_step_grpo

multi_step_grpo GiGPO
Grouping Task-level GRPO on last step only; broadcast to earlier steps Episode group + anchor-state step groups
Credit Same scalar advantage for all steps in a run (A^E + \omega A^S) per step
State identity Not used env_state_hash required
Extra rollouts No No

GiGPO is not a small config tweak; it needs the new advantage class and workflow metadata.

4.5 Orthogonality with DAPO

The GiGPO paper notes compatibility with group-based methods including DAPO. A future gigpo + DAPO-style clip/reward could be:

algorithm_type: gigpo
policy_loss_fn_args:
  clip_range_low: 0.2
  clip_range_high: 0.28

for agent tasks with very long generations (less common than math DAPO setup).

4.6 Example and tests

Deliverable Path
Minimal example examples/gigpo_alfworld/gigpo.yaml (fork grpo_alfworld_general_multi_step)
README Math for (A^E), (A^S), (\omega); how to set env_state_hash
Unit tests tests/algorithm/test_gigpo_advantage.py — synthetic trajectories with repeated states
Integration Optional smoke test on FrozenLake (examples/agentscope_frozenlake) with toy hash

4.7 Suggested PR sequence (GiGPO)

  1. PR 1: GiGPOAdvantageFn + unit tests (no workflow changes; mock info["env_state_hash"]).
  2. PR 2: GiGPOAlgorithm + registry + examples/gigpo_alfworld using existing workflow + hash in one environment.
  3. PR 3: AgentScope / WebShop workflows + benchmark README comparing multi_step_grpo vs gigpo.

5. Side-by-side comparison

Dimension DAPO GiGPO
Turn structure Single-turn outcome reward Multi-turn per-step experiences
Primary new code Buffer dynamic sampling (+ algorithm bundle) GiGPOAdvantageFn + workflow state hash
Reuse from Trinity grpo, math_dapo_reward, PPOPolicyLossFn multi_step_grpo explorer pattern, ppo loss
Example upgrade examples/dapo_math examples/grpo_alfworld_general_multi_stepexamples/gigpo_*
Eval focus AIME / math reasoning ALFWorld, WebShop, search-augmented QA

6. Registry checklist (both algorithms)

When upstreaming from plugins to core:

  • trinity/algorithm/advantage_fn/__init__.py — register gigpo (DAPO reuses grpo)
  • trinity/algorithm/policy_loss_fn/__init__.py — optional dapo alias
  • trinity/algorithm/algorithm.pyDAPOAlgorithm, GiGPOAlgorithm
  • trinity/algorithm/__init__.py"dapo", "gigpo" in ALGORITHM_TYPE
  • trinity/buffer/operators/__init__.pydapo_dynamic_sampling
  • tests/algorithm/ — unit tests
  • docs/rl_algorithm_improvement_guide.md — mark DAPO/GiGPO as implemented
  • README.md — supported algorithms list

Run before PR:

python -m pytest tests/algorithm/
pre-commit run --all-files

7. Validation metrics

DAPO

  • Training: policy entropy, pg_clipfrac, KL, reward mean/std per group
  • Eval: AIME 2024 / held-out math set (compare against algorithm_type: grpo with same data)
  • Ablations: disable each of the four techniques one at a time

GiGPO

  • Training: fraction of anchor groups with size > 1, mean (|A^S|), mean (|A^E|)
  • Eval: task success rate on ALFWorld / WebShop vs multi_step_grpo
  • Ablations: (\omega = 0) (episode-only), (\omega > 0), fnorm: std vs none

8. References

  • DAPO: Yu et al., arXiv:2503.14476, project page dapo-sia.github.io
  • GiGPO: Feng et al., arXiv:2505.10978, code langfengQ/verl-agent
  • Trinity algorithm development: docs/sphinx_doc/source/tutorial/develop_algorithm.md
  • Existing GRPO implementation: trinity/algorithm/advantage_fn/grpo_advantage.py
  • Existing multi-step GRPO: trinity/algorithm/advantage_fn/multi_step_grpo_advantage.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions