Skip to content

Add Countdown number game RL recipe#594

Open
YujiaBao wants to merge 2 commits intothinking-machines-lab:mainfrom
YujiaBao:countdown-recipe-pr
Open

Add Countdown number game RL recipe#594
YujiaBao wants to merge 2 commits intothinking-machines-lab:mainfrom
YujiaBao:countdown-recipe-pr

Conversation

@YujiaBao
Copy link
Copy Markdown
Member

@YujiaBao YujiaBao commented Apr 3, 2026

Summary

  • Add tinker_cookbook/recipes/countdown_rl/ — GRPO training on Jiayi-Pan/Countdown-Tasks-3to4 where models combine 3-4 numbers with arithmetic to reach a target
  • Partial credit rewards that grade proximity to target, converting all-bad GRPO groups into useful training signal (+4% over binary rewards)
  • Configurable reward mode (binary vs partial), KL penalty support, fewshot prefix
  • 14 unit tests for reward verification logic
  • Best config reaches 85% test accuracy on Qwen3-4B-Instruct-2507 (from 68% baseline)

Hyperparameter sweep

Config Best Test Acc Key Finding
binary, 512tok 68% Baseline
partial, 1024tok 76% Partial rewards +4%
partial, 2048tok, 40 steps 85% Token budget is biggest lever

Test plan

  • 14 unit tests pass (pytest tinker_cookbook/recipes/countdown_rl/)
  • Training script imports cleanly
  • 8 training experiments completed successfully
  • CI

🤖 Generated with Claude Code

YujiaBao and others added 2 commits April 3, 2026 18:22
GRPO training on [Jiayi-Pan/Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4)
where models learn to reach a target number using 3-4 input numbers
with basic arithmetic (+, -, *, /).

**Key features:**
- Verifiable reward with partial credit for "close but wrong" answers,
  converting all-bad GRPO groups into useful training signal
- Configurable reward mode (`binary` vs `partial`)
- KL penalty support via KLReferenceConfig
- Fewshot prefix for cold-start format compliance

**Hyperparameter sweep results** (Qwen3-4B-Instruct-2507, LoRA rank 32):

| Config                    | Best Test Acc |
|---------------------------|---------------|
| binary, 512tok (baseline) | 68%           |
| partial, 1024tok          | 76%           |
| partial, 2048tok, 40 steps| **85%**       |

Key findings: (1) partial credit adds ~4% by creating within-group
reward variance, (2) token budget is the biggest lever (512→2048 = +17%),
(3) model learns conciseness naturally through GRPO (avg tokens drops
from 1100→500 over training).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant