Skip to content

DAPO Integration in Trinity: Algorithm Registration, Dynamic Sampling Pipeline, Overlong Handling, and Reviewer-Focused Docs#552

Open
artaasd95 wants to merge 4 commits into
agentscope-ai:mainfrom
artaasd95:main
Open

DAPO Integration in Trinity: Algorithm Registration, Dynamic Sampling Pipeline, Overlong Handling, and Reviewer-Focused Docs#552
artaasd95 wants to merge 4 commits into
agentscope-ai:mainfrom
artaasd95:main

Conversation

@artaasd95
Copy link
Copy Markdown

@artaasd95 artaasd95 commented May 25, 2026

Description

PR Description
This PR introduces DAPO as a first-class algorithm in Trinity and wires it end-to-end across algorithm registration, reward processing, experience filtering, examples, and tests. The implementation is additive and does not change existing algorithm defaults.

as we have discussed in the issue #551
you can check the changes below:

  1. Algorithm integration and defaults

  2. Added DAPO algorithm type class at algorithm.py.

  3. DAPO default configuration is defined at algorithm.py to algorithm.py:

    • repeat_times: 16
    • advantage_fn: grpo
    • sample_strategy: default
    • policy_loss_fn: ppo
    • kl_penalty_fn: none
    • kl_loss_fn: none
    • entropy_loss_fn: default
  4. Registered algorithm key dapo in the global algorithm registry at init.py.

  5. Added dapo to grouped-advantage handling in config manager at algorithm_config_manager.py and algorithm_config_manager.py.

  6. Dynamic sampling operator (DAPO behavior)

  7. Added DAPODynamicSamplingFilter at reward_filter.py.

  8. Outcome score extraction logic at reward_filter.py.

  9. Correctness decision logic at reward_filter.py.

  10. Group filtering conditions:

    • Drop all-wrong groups at reward_filter.py
    • Drop all-correct groups at reward_filter.py
  11. Registered operator names in operator registry:

    • dapo_dynamic_sampling at init.py
    • mask_response_truncated at init.py
  12. Overlong response handling

  13. Added MaskResponseTruncatedOperator at reward_filter.py.

  14. Workflow-level masking safety added in math workflows:

    • Sync workflow path at customized_math_workflows.py and customized_math_workflows.py
    • Async workflow path at customized_math_workflows.py and customized_math_workflows.py
  15. Reward function updates for DAPO

  16. Added MathDAPORewardFn class at dapo_reward.py.

  17. Accuracy score uses symmetric rule:

    • correctness thresholding at dapo_reward.py
    • +1/-1 assignment at dapo_reward.py
  18. Overlong penalty logic:

    • enable flag branch at dapo_reward.py
    • penalty function at dapo_reward.py
    • expected length formula at dapo_reward.py
    • hard overlong penalty return at dapo_reward.py
  19. Registered reward type math_dapo_reward at init.py.

  20. Example configuration and docs

  21. End-to-end DAPO example config:

    • algorithm type at dapo.yaml
    • clip settings at dapo.yaml and dapo.yaml
    • token-level aggregation at dapo.yaml
    • pipeline operators at dapo.yaml and dapo.yaml
    • overlong reward args at dapo.yaml, dapo.yaml, dapo.yaml, and dapo.yaml
    • default reward binding at dapo.yaml
  22. DAPO feature documentation:

    • DAPO listed in main README at README.md
    • Dataset perspective update (English) at example_dataset_perspective.md
    • Dataset perspective update (Chinese) at example_dataset_perspective.md
    • DAPO behavior note for dynamic sampling and std_threshold guidance at README.md
    • Metrics guidance at README.md
  23. Tests and platform compatibility

  24. Added DAPO test suite at test_dapo_algorithm.py.

  25. Covered cases:

    • registry at test_dapo_algorithm.py
    • default config at test_dapo_algorithm.py
    • no-reference policy expectation at test_dapo_algorithm.py
    • dynamic sampling all-correct at test_dapo_algorithm.py
    • dynamic sampling all-wrong at test_dapo_algorithm.py
    • dynamic sampling mixed correctness at test_dapo_algorithm.py
    • overlong-different but all-correct behavior at test_dapo_algorithm.py
    • truncation masking at test_dapo_algorithm.py
    • symmetric reward correctness at test_dapo_algorithm.py
  26. Windows collection compatibility for posix-only resource import:

    • guard and mock setup at conftest.py, conftest.py, and conftest.py
  27. Reviewer-focused behavior summary

  28. This is additive: no default behavior changes for non-DAPO algorithms.

  29. DAPO path is activated through algorithm_type dapo and dedicated pipeline/reward wiring.

  30. Dynamic sampling keeps only mixed-correctness task groups, which can reduce effective sample count but should improve training signal quality.

  31. Overlong handling is enforced both in pipeline operator and workflow mask path.

  32. Documentation and examples are complete for reproducibility.

Validation

  1. DAPO test command executed in meridian-test environment.
  2. Result: 9 passed.

If you want, I can also give you a shorter reviewer checklist version (one page) plus an expanded architecture version (for design review) from this same material.
Note: Considering that is my first PR in this project, please notify me if I did not considered anything!

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has passed all tests
  • Docstrings have been added/updated in Google Style
  • Documentation has been updated
  • Code is ready for review

artaasd95 added 4 commits May 25, 2026 16:47
- Introduced DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) algorithm in the codebase.
- Updated example configurations and README files to reflect DAPO usage on the DAPO-Math-17k dataset.
- Added tests for DAPO algorithm registration and dynamic sampling filter functionality.
- Enhanced existing documentation to include DAPO techniques and implementation details.
- Updated experience processing to support DAPO-specific features such as dynamic sampling and response truncation handling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant