Add Countdown number game RL recipe by YujiaBao · Pull Request #594 · thinking-machines-lab/tinker-cookbook

YujiaBao · 2026-04-03T18:23:04Z

Summary

Add tinker_cookbook/recipes/countdown_rl/ — GRPO training on Jiayi-Pan/Countdown-Tasks-3to4 where models combine 3-4 numbers with arithmetic to reach a target
Partial credit rewards that grade proximity to target, converting all-bad GRPO groups into useful training signal (+4% over binary rewards)
Configurable reward mode (binary vs partial), KL penalty support, fewshot prefix
14 unit tests for reward verification logic
Best config reaches 85% test accuracy on Qwen3-4B-Instruct-2507 (from 68% baseline)

Hyperparameter sweep

Config	Best Test Acc	Key Finding
binary, 512tok	68%	Baseline
partial, 1024tok	76%	Partial rewards +4%
partial, 2048tok, 40 steps	85%	Token budget is biggest lever

Test plan

14 unit tests pass (pytest tinker_cookbook/recipes/countdown_rl/)
Training script imports cleanly
8 training experiments completed successfully
CI

🤖 Generated with Claude Code

GRPO training on [Jiayi-Pan/Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) where models learn to reach a target number using 3-4 input numbers with basic arithmetic (+, -, *, /). **Key features:** - Verifiable reward with partial credit for "close but wrong" answers, converting all-bad GRPO groups into useful training signal - Configurable reward mode (`binary` vs `partial`) - KL penalty support via KLReferenceConfig - Fewshot prefix for cold-start format compliance **Hyperparameter sweep results** (Qwen3-4B-Instruct-2507, LoRA rank 32): | Config | Best Test Acc | |---------------------------|---------------| | binary, 512tok (baseline) | 68% | | partial, 1024tok | 76% | | partial, 2048tok, 40 steps| **85%** | Key findings: (1) partial credit adds ~4% by creating within-group reward variance, (2) token budget is the biggest lever (512→2048 = +17%), (3) model learns conciseness naturally through GRPO (avg tokens drops from 1100→500 over training). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

YujiaBao and others added 2 commits April 3, 2026 18:22

Add README with motivation, findings, and reproduction instructions

3293c63

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Countdown number game RL recipe#594

Add Countdown number game RL recipe#594
YujiaBao wants to merge 2 commits intothinking-machines-lab:mainfrom
YujiaBao:countdown-recipe-pr

YujiaBao commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YujiaBao commented Apr 3, 2026

Summary

Hyperparameter sweep

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant