[FEATURE] DAPO and GiGPO Implementation Suggestion

as we have discussed in the comment https://github.com/agentscope-ai/Trinity-RFT/issues/470#issuecomment-4514366963
I want to add two RL algorithms as features, here are the guide that I want to use, please check it and verify the possibility of adding these two algorithms:

 # DAPO and GiGPO Implementation Guide for Trinity-RFT

This document explains how to add two group-based RL algorithms from the referenced papers into Trinity-RFT:

| Paper | arXiv | Algorithm | Primary use case |
|-------|-------|-----------|------------------|
| [DAPO: An Open-Source LLM Reinforcement Learning System at Scale](https://arxiv.org/abs/2503.14476) | `2503.14476` | **DAPO** | Long-CoT math/reasoning (single-turn, outcome reward) |
| [Group-in-Group Policy Optimization for LLM Agent Training](https://arxiv.org/abs/2505.10978) | `2505.10978` | **GiGPO** | Multi-turn LLM agents (sparse rewards, step credit) |

Related onboarding docs:

- [project_modules_onboarding.md](project_modules_onboarding.md) — repository map and runtime flow
- [rl_algorithm_improvement_guide.md](rl_algorithm_improvement_guide.md) — algorithm registry and extension pattern
- Official dev flow: `docs/sphinx_doc/source/tutorial/develop_algorithm.md`

> **Naming note:** Earlier Trinity docs mention “GiGRPO”; the NeurIPS 2025 paper defines **GiGPO** (Group-in-Group Policy Optimization). This guide uses **GiGPO** and registers `algorithm_type: gigpo`.

---

## 1. Current status in Trinity-RFT

### Already supported (baseline)

Registered `algorithm_type` values live in `trinity/algorithm/__init__.py`. Relevant baselines:

- **`grpo`** — grouped outcome advantage + PPO-style clipped loss (`GRPOAlgorithm`)
- **`multi_step_grpo`** — multi-turn rollouts with last-step GRPO broadcast (`MultiStepGRPOAlgorithm` + `step_wise_grpo` advantage)

### Partial DAPO support (not a dedicated algorithm)

| DAPO technique | Trinity today | Gap |
|----------------|---------------|-----|
| Decoupled clip (Clip-Higher) | `PPOPolicyLossFn` supports `clip_range_low` / `clip_range_high` separately | No `algorithm_type: dapo` bundle; not documented as DAPO defaults |
| Token-level policy loss | `policy_loss_fn_args.loss_agg_mode: token-mean` | Same — needs DAPO defaults |
| Overlong reward shaping | `MathDAPORewardFn` in `trinity/common/rewards/dapo_reward.py` | Wired in `examples/dapo_math/dapo.yaml` but algorithm is still `grpo` |
| Dynamic sampling | `RewardSTDFilter` drops zero-variance **groups** | No batch-size replenishment; no explicit “accuracy ∈ {0,1}” filter naming |

`examples/dapo_math/README.md` states DAPO **algorithm** is still WIP; only the DAPO-Math dataset + GRPO + DAPO-style reward/clip are demonstrated.

### GiGPO support

GiGPO is **not** registered. Multi-turn infrastructure exists (`EID.run`, `EID.step`, `multi_step_grpo`, AgentScope / ALFWorld examples) but there is no hierarchical episode+step advantage or anchor-state grouping.

---

## 2. Trinity extension model (shared by both algorithms)

Trinity decomposes RL into pluggable modules (see `AlgorithmType` in `trinity/algorithm/algorithm.py`):

```mermaid
flowchart LR
  Explorer --> Buffer
  Buffer --> Trainer
  subgraph algorithm_bundle
    SampleStrategy
    AdvantageFn
    PolicyLossFn
    KLFn
    EntropyLossFn
  end
  Trainer --> algorithm_bundle
```

**Typical work for a new algorithm:**

1. Implement only modules that differ from GRPO / `multi_step_grpo`.
2. Register classes in `advantage_fn`, `policy_loss_fn`, and optionally `buffer/operators`.
3. Add `XxxAlgorithm(AlgorithmType)` + `"xxx"` in `ALGORITHM_TYPE`.
4. Add `examples/<xxx>/` with YAML + README and tests under `tests/algorithm/`.

**Recommended first PR path:** prototype in `trinity/plugins/`, validate, then upstream to core registries (see `CONTRIBUTING.md`).

---

## 3. DAPO implementation plan

### 3.1 What DAPO changes (paper summary)

DAPO keeps a **GRPO-style critic-free group baseline** and fixes long-CoT training with four techniques (Section 3, arXiv:2503.14476):

1. **Clip-Higher (decoupled clip)** — asymmetric PPO clip: lower bound for negative advantages, higher upper bound for positive advantages (reduces entropy collapse).
2. **Dynamic sampling** — drop prompt groups where all rollouts are correct or all incorrect (zero learning signal); resample until the batch has enough “informative” groups.
3. **Token-level policy gradient loss** — aggregate loss per token, not per sequence mean (important when responses are very long).
4. **Overlong reward shaping** — soft penalty as responses approach max length (stabilizes format/length).

Mathematically, DAPO is still group-relative advantage + clipped importance-weighted policy gradient; the novelty is **training system details**, not a new critic or value network.

### 3.2 Mapping to Trinity modules

| Component | Implementation | Files to touch |
|-----------|----------------|--------------|
| **Algorithm bundle** | `DAPOAlgorithm` defaults | `trinity/algorithm/algorithm.py`, `trinity/algorithm/__init__.py` |
| **Advantage** | Reuse `grpo` (`GRPOGroupedAdvantage`) | No new file required initially |
| **Policy loss** | Reuse `ppo` with decoupled clip + `token-mean` | Optionally alias `dapo` → thin wrapper over `PPOPolicyLossFn` for discoverability |
| **Reward** | Reuse `math_dapo_reward` | Already in `trinity/common/rewards/dapo_reward.py` |
| **Dynamic sampling** | New buffer/explorer filter | `trinity/buffer/operators/filters/dapo_dynamic_sampling.py` (new) |

#### Default `DAPOAlgorithm.default_config()` (target)

```yaml
algorithm:
  algorithm_type: dapo
  repeat_times: 16
  advantage_fn: grpo
  policy_loss_fn: ppo
  policy_loss_fn_args:
    clip_range_low: 0.2
    clip_range_high: 0.28      # Clip-Higher
    loss_agg_mode: token-mean  # Token-level loss
  kl_penalty_fn: none
  kl_loss_fn: k2
  entropy_loss_fn: default

buffer:
  # pipeline operator (exact config key depends on buffer schema)
  operators:
    - type: dapo_dynamic_sampling
      min_std: 1e-6
      resample: true

explorer_input:
  reward_fn_args:
    enable_overlong_penalty: true
    penalty_factor: 1.0
    max_response_length: 20480
    cache_length: 4096
```

Align numeric hyperparameters with `examples/dapo_math/dapo.yaml` and the open DAPO/verl recipe when reproducing paper numbers.

### 3.3 Dynamic sampling operator (main new code)

**Behavior:** For each task group (same prompt / `eid.tid`), compute rollout rewards. If `std(rewards) == 0` (all pass or all fail), exclude the whole group from the trainer batch. Optionally trigger **additional explorer rollouts** until `batch_size` valid groups are collected (paper: keep effective gradient count stable).

**Relation to existing code:** `RewardSTDFilter` in `trinity/buffer/operators/filters/reward_filter.py` already skips groups with `variance <= threshold`. Extend or replace with:

- `DAPODynamicSamplingFilter` — explicit metrics: `dropped_all_correct`, `dropped_all_wrong`, `kept_groups`
- Optional `min_valid_groups` hook for explorer scheduling (may require a small change in explorer/buffer batch assembly if resampling is not only filter-side)

**Registration:** `trinity/buffer/operators/__init__.py` → `"dapo_dynamic_sampling": "..."`.

### 3.4 Policy loss: optional `dapo` alias

`PPOPolicyLossFn` already implements decoupled clipping via separate `clip_range_low` / `clip_range_high`. A dedicated `DAPOPolicyLossFn` subclass is optional (defaults only) for clarity in configs and docs.

### 3.5 Example and tests

| Deliverable | Path |
|-------------|------|
| End-to-end config | `examples/dapo_math/dapo.yaml` — change `algorithm_type: grpo` → `dapo` |
| README | `examples/dapo_math/README.md` — document four techniques and metrics |
| Unit tests | `tests/algorithm/test_dapo_dynamic_sampling.py` — group filter edge cases (N=1, all 0, all 1, mixed) |
| Regression | Existing GRPO examples unchanged |

### 3.6 Suggested PR sequence (DAPO)

1. **PR 1 (small):** `DAPOAlgorithm` + registry + update `examples/dapo_math` to `algorithm_type: dapo` (reuse existing reward + clip args).
2. **PR 2 (medium):** `DAPODynamicSamplingFilter` + buffer config wiring + tests.
3. **PR 3 (optional):** Benchmark vs GRPO on DAPO-Math / AIME eval configs; document entropy and clip fraction metrics.

---

## 4. GiGPO implementation plan

### 4.1 What GiGPO changes (paper summary)

GiGPO targets **multi-turn LLM agents** where GRPO only assigns one advantage per trajectory. It stays **critic-free** and avoids extra per-state rollouts.

Two-level advantages (Equation 8, arXiv:2505.10978):

- **Episode-level** \(A^E(\tau_i)\): same as GRPO over \(N\) full trajectories sharing task \(x\) and initial state — normalize total return \(R(\tau_i)\) within the episode group.
- **Step-level** \(A^S(a^{(i)}_t)\): build **anchor state groups** \(G^S(\tilde{s})\) by hashing environment states \(\tilde{s}\) seen across trajectories; compare discounted returns \(R^{(i)}_t = \sum_{k\ge t} \gamma^{k-t} r^{(i)}_k\) within each group.
- **Combined:** \(A(a^{(i)}_t) = A^E(\tau_i) + \omega \cdot A^S(a^{(i)}_t)\).

Normalization \(F_{\text{norm}}\) can be `std` (GRPO-style) or `1` (RLOO-style); agent benchmarks in the paper often benefit from \(F_{\text{norm}}=1\).

Policy update: standard clipped objective over **each step’s tokens** (same family as GRPO/PPO on multi-turn experiences).

Reference implementation (external): [langfengQ/verl-agent](https://github.com/langfengQ/verl-agent) (cited in the paper).

### 4.2 Prerequisites in Trinity (explorer / workflow)

GiGPO requires **per-step experiences** with stable grouping keys. Trinity already documents this on `EID`:

```24:39:trinity/common/experience.py
    To enable the full functionality of the experience grouping, user should manually set the `run` and `step` fields in custom workflows.
    ...
    run: int = 0
    ...
    step: int = 0
```

**Workflow contract (new or extended):**

1. One `Experience` per environment step (multi-turn `action_mask`).
2. Set `eid.run` ∈ `{0..N-1}` for trajectory index within a task group.
3. Set `eid.step` for time index \(t\).
4. Store in `experience.info`:
   - `env_state_hash` (or canonical serialized observation) for anchor grouping
   - `step_reward` \(r^{(i)}_t\) (scalar per step)
   - optional `episode_return` at terminal step for cross-check

**Anchor state hash:** Must be deterministic for “same environment state” (e.g., ALFWorld room layout string, WebShop page DOM fingerprint). GiGPO groups all \((a,r)\) with matching `env_state_hash` across runs and steps.

Existing references:

- `examples/grpo_alfworld_general_multi_step/` — `multi_step_grpo` + step-wise workflows
- `examples/agentscope_*` — ReAct / tool agents (good GiGPO targets after hash plumbing)

### 4.3 Mapping to Trinity modules

| Component | Implementation | Files to touch |
|-----------|----------------|--------------|
| **Advantage** | `GiGPOAdvantageFn` | `trinity/algorithm/advantage_fn/gigpo_advantage.py` (new) |
| **Policy loss** | Reuse `ppo` (clipped, token-mean) | Same as multi-step GRPO |
| **Algorithm bundle** | `GiGPOAlgorithm` | `trinity/algorithm/algorithm.py`, `__init__.py` |
| **Workflows** | Emit `env_state_hash` + step rewards | e.g. `step_wise_alfworld_workflow`, AgentScope workflows |
| **Explorer** | `repeat_times: N` trajectories per task | Config only |

#### `GiGPOAdvantageFn` algorithm (pseudocode)

Implement `AdvantageFn` (or extend patterns from `StepWiseGRPOAdvantageFn` + `GRPOGroupedAdvantage`):

```
process(batch of experiences):
  # 1) Episode-level
  for each task group (eid.tid):
    for each run (eid.rid):
      R_i = sum_t step_reward or terminal reward
    compute A_E per run (mean/std or mean/1 normalization)
    broadcast A_E to every step in that run

  # 2) Step-level anchor groups
  build map: env_state_hash -> list of (experience, discounted_return R_t)
  for each hash group with |group| >= 2:
    compute A_S for each member (normalize R_t within group)
  for singleton groups: A_S = 0

  # 3) Combine
  for each experience:
    advantages = (A_E + omega * A_S) * action_mask
    returns = advantages.clone()
```

**Discount \(\gamma\):** Configurable `advantage_fn_args.gamma` (paper uses standard RL discount; agent tasks often use \(\gamma \approx 1\) for sparse terminal reward).

**Normalization:** `fnorm: std | none` where `none` means divide by 1 (RLOO-style).

Register as `"gigpo": "trinity.algorithm.advantage_fn.gigpo_advantage.GiGPOAdvantageFn"`.

#### Default `GiGPOAlgorithm.default_config()` (target)

```yaml
algorithm:
  algorithm_type: gigpo
  repeat_times: 8
  advantage_fn: gigpo
  advantage_fn_args:
    omega: 1.0
    gamma: 1.0
    fnorm: none          # or std for math-like grouping
    epsilon: 1e-6
  policy_loss_fn: ppo
  policy_loss_fn_args:
    clip_range_low: 0.2
    clip_range_high: 0.2
    loss_agg_mode: token-mean
  kl_penalty_fn: none
  kl_loss_fn: k2
  entropy_loss_fn: default

explorer_input:
  taskset:
    rollout_args:
      # N trajectories per task
    workflow_args:
      max_env_steps: 30
  default_workflow_type: step_wise_alfworld_workflow  # after hash support
```

Base on `MultiStepGRPOAlgorithm` flags: `use_critic: false`, `compute_advantage_in_trainer: false`, `schema: experience`.

### 4.4 Difference vs `multi_step_grpo`

| | `multi_step_grpo` | GiGPO |
|--|-------------------|--------|
| Grouping | Task-level GRPO on **last step** only; broadcast to earlier steps | Episode group **+** anchor-state step groups |
| Credit | Same scalar advantage for all steps in a run | \(A^E + \omega A^S\) per step |
| State identity | Not used | `env_state_hash` required |
| Extra rollouts | No | No |

GiGPO is **not** a small config tweak; it needs the new advantage class and workflow metadata.

### 4.5 Orthogonality with DAPO

The GiGPO paper notes compatibility with group-based methods including DAPO. A future `gigpo` + DAPO-style clip/reward could be:

```yaml
algorithm_type: gigpo
policy_loss_fn_args:
  clip_range_low: 0.2
  clip_range_high: 0.28
```

for agent tasks with very long generations (less common than math DAPO setup).

### 4.6 Example and tests

| Deliverable | Path |
|-------------|------|
| Minimal example | `examples/gigpo_alfworld/gigpo.yaml` (fork `grpo_alfworld_general_multi_step`) |
| README | Math for \(A^E\), \(A^S\), \(\omega\); how to set `env_state_hash` |
| Unit tests | `tests/algorithm/test_gigpo_advantage.py` — synthetic trajectories with repeated states |
| Integration | Optional smoke test on FrozenLake (`examples/agentscope_frozenlake`) with toy hash |

### 4.7 Suggested PR sequence (GiGPO)

1. **PR 1:** `GiGPOAdvantageFn` + unit tests (no workflow changes; mock `info["env_state_hash"]`).
2. **PR 2:** `GiGPOAlgorithm` + registry + `examples/gigpo_alfworld` using existing workflow + hash in one environment.
3. **PR 3:** AgentScope / WebShop workflows + benchmark README comparing `multi_step_grpo` vs `gigpo`.

---

## 5. Side-by-side comparison

| Dimension | DAPO | GiGPO |
|-----------|------|-------|
| Turn structure | Single-turn outcome reward | Multi-turn per-step experiences |
| Primary new code | Buffer dynamic sampling (+ algorithm bundle) | `GiGPOAdvantageFn` + workflow state hash |
| Reuse from Trinity | `grpo`, `math_dapo_reward`, `PPOPolicyLossFn` | `multi_step_grpo` explorer pattern, `ppo` loss |
| Example upgrade | `examples/dapo_math` | `examples/grpo_alfworld_general_multi_step` → `examples/gigpo_*` |
| Eval focus | AIME / math reasoning | ALFWorld, WebShop, search-augmented QA |

---

## 6. Registry checklist (both algorithms)

When upstreaming from plugins to core:

- [ ] `trinity/algorithm/advantage_fn/__init__.py` — register `gigpo` (DAPO reuses `grpo`)
- [ ] `trinity/algorithm/policy_loss_fn/__init__.py` — optional `dapo` alias
- [ ] `trinity/algorithm/algorithm.py` — `DAPOAlgorithm`, `GiGPOAlgorithm`
- [ ] `trinity/algorithm/__init__.py` — `"dapo"`, `"gigpo"` in `ALGORITHM_TYPE`
- [ ] `trinity/buffer/operators/__init__.py` — `dapo_dynamic_sampling`
- [ ] `tests/algorithm/` — unit tests
- [ ] `docs/rl_algorithm_improvement_guide.md` — mark DAPO/GiGPO as implemented
- [ ] `README.md` — supported algorithms list

Run before PR:

```bash
python -m pytest tests/algorithm/
pre-commit run --all-files
```

---

## 7. Validation metrics

**DAPO**

- Training: policy entropy, `pg_clipfrac`, KL, reward mean/std per group
- Eval: AIME 2024 / held-out math set (compare against `algorithm_type: grpo` with same data)
- Ablations: disable each of the four techniques one at a time

**GiGPO**

- Training: fraction of anchor groups with size > 1, mean \(|A^S|\), mean \(|A^E|\)
- Eval: task success rate on ALFWorld / WebShop vs `multi_step_grpo`
- Ablations: \(\omega = 0\) (episode-only), \(\omega > 0\), `fnorm: std` vs `none`

---

## 8. References

- DAPO: Yu et al., arXiv:[2503.14476](https://arxiv.org/abs/2503.14476), project page [dapo-sia.github.io](https://dapo-sia.github.io/)
- GiGPO: Feng et al., arXiv:[2505.10978](https://arxiv.org/abs/2505.10978), code [langfengQ/verl-agent](https://github.com/langfengQ/verl-agent)
- Trinity algorithm development: `docs/sphinx_doc/source/tutorial/develop_algorithm.md`
- Existing GRPO implementation: `trinity/algorithm/advantage_fn/grpo_advantage.py`
- Existing multi-step GRPO: `trinity/algorithm/advantage_fn/multi_step_grpo_advantage.py`


Component	Implementation	Files to touch
Algorithm bundle	`DAPOAlgorithm` defaults	`trinity/algorithm/algorithm.py`, `trinity/algorithm/__init__.py`
Advantage	Reuse `grpo` (`GRPOGroupedAdvantage`)	No new file required initially
Policy loss	Reuse `ppo` with decoupled clip + `token-mean`	Optionally alias `dapo` → thin wrapper over `PPOPolicyLossFn` for discoverability
Reward	Reuse `math_dapo_reward`	Already in `trinity/common/rewards/dapo_reward.py`
Dynamic sampling	New buffer/explorer filter	`trinity/buffer/operators/filters/dapo_dynamic_sampling.py` (new)

Component	Implementation	Files to touch
Advantage	`GiGPOAdvantageFn`	`trinity/algorithm/advantage_fn/gigpo_advantage.py` (new)
Policy loss	Reuse `ppo` (clipped, token-mean)	Same as multi-step GRPO
Algorithm bundle	`GiGPOAlgorithm`	`trinity/algorithm/algorithm.py`, `__init__.py`
Workflows	Emit `env_state_hash` + step rewards	e.g. `step_wise_alfworld_workflow`, AgentScope workflows
Explorer	`repeat_times: N` trajectories per task	Config only

Paper	arXiv	Algorithm	Primary use case
DAPO: An Open-Source LLM Reinforcement Learning System at Scale	`2503.14476`	DAPO	Long-CoT math/reasoning (single-turn, outcome reward)
Group-in-Group Policy Optimization for LLM Agent Training	`2505.10978`	GiGPO	Multi-turn LLM agents (sparse rewards, step credit)

DAPO technique	Trinity today	Gap
Decoupled clip (Clip-Higher)	`PPOPolicyLossFn` supports `clip_range_low` / `clip_range_high` separately	No `algorithm_type: dapo` bundle; not documented as DAPO defaults
Token-level policy loss	`policy_loss_fn_args.loss_agg_mode: token-mean`	Same — needs DAPO defaults
Overlong reward shaping	`MathDAPORewardFn` in `trinity/common/rewards/dapo_reward.py`	Wired in `examples/dapo_math/dapo.yaml` but algorithm is still `grpo`
Dynamic sampling	`RewardSTDFilter` drops zero-variance groups	No batch-size replenishment; no explicit “accuracy ∈ {0,1}” filter naming

Deliverable	Path
End-to-end config	`examples/dapo_math/dapo.yaml` — change `algorithm_type: grpo` → `dapo`
README	`examples/dapo_math/README.md` — document four techniques and metrics
Unit tests	`tests/algorithm/test_dapo_dynamic_sampling.py` — group filter edge cases (N=1, all 0, all 1, mixed)
Regression	Existing GRPO examples unchanged

	`multi_step_grpo`	GiGPO
Grouping	Task-level GRPO on last step only; broadcast to earlier steps	Episode group + anchor-state step groups
Credit	Same scalar advantage for all steps in a run	(A^E + \omega A^S) per step
State identity	Not used	`env_state_hash` required
Extra rollouts	No	No

Deliverable	Path
Minimal example	`examples/gigpo_alfworld/gigpo.yaml` (fork `grpo_alfworld_general_multi_step`)
README	Math for (A^E), (A^S), (\omega); how to set `env_state_hash`
Unit tests	`tests/algorithm/test_gigpo_advantage.py` — synthetic trajectories with repeated states
Integration	Optional smoke test on FrozenLake (`examples/agentscope_frozenlake`) with toy hash

Dimension	DAPO	GiGPO
Turn structure	Single-turn outcome reward	Multi-turn per-step experiences
Primary new code	Buffer dynamic sampling (+ algorithm bundle)	`GiGPOAdvantageFn` + workflow state hash
Reuse from Trinity	`grpo`, `math_dapo_reward`, `PPOPolicyLossFn`	`multi_step_grpo` explorer pattern, `ppo` loss
Example upgrade	`examples/dapo_math`	`examples/grpo_alfworld_general_multi_step` → `examples/gigpo_*`
Eval focus	AIME / math reasoning	ALFWorld, WebShop, search-augmented QA

[FEATURE] DAPO and GiGPO Implementation Suggestion #551

Description

DAPO and GiGPO Implementation Guide for Trinity-RFT

1. Current status in Trinity-RFT

Already supported (baseline)

Partial DAPO support (not a dedicated algorithm)

GiGPO support

2. Trinity extension model (shared by both algorithms)

3. DAPO implementation plan

3.1 What DAPO changes (paper summary)

3.2 Mapping to Trinity modules

Default DAPOAlgorithm.default_config() (target)

3.3 Dynamic sampling operator (main new code)

3.4 Policy loss: optional dapo alias

3.5 Example and tests

3.6 Suggested PR sequence (DAPO)

4. GiGPO implementation plan

4.1 What GiGPO changes (paper summary)

4.2 Prerequisites in Trinity (explorer / workflow)

4.3 Mapping to Trinity modules

GiGPOAdvantageFn algorithm (pseudocode)

Default GiGPOAlgorithm.default_config() (target)

4.4 Difference vs multi_step_grpo

4.5 Orthogonality with DAPO

4.6 Example and tests

4.7 Suggested PR sequence (GiGPO)

5. Side-by-side comparison

6. Registry checklist (both algorithms)

7. Validation metrics

8. References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Default `DAPOAlgorithm.default_config()` (target)

3.4 Policy loss: optional `dapo` alias

`GiGPOAdvantageFn` algorithm (pseudocode)

Default `GiGPOAlgorithm.default_config()` (target)

4.4 Difference vs `multi_step_grpo`