Harbor is a benchmark + runner that manages sandboxed task environments (e.g. Docker or Daytona) and can verify solutions via a task-specific verifier.
This folder provides lightweight examples for using Harbor as a Gym-style terminal environment, and connecting it to tinker-cookbook for RL training.
From the repo root:
pip install harborA task is a directory that typically contains:
instruction.md: the natural language goalenvironment/: how the sandbox is built (e.g. a Dockerfile)tests/: verifier scripts/tests that produce a reward (oftenreward.txt)solution/: optional reference solutiontask.toml: metadata and timeouts
This repo include example task under harbor_envs/devops_task
AsyncTerminalGymEnv (implemented in terminal_env.py) wraps Harbor's environment lifecycle and exposes an async Gym-like API:
reset() -> (obs, info)step(action) -> (obs, reward, done, info)
It uses tmux for interaction (so it works with interactive TUIs), and computes reward sparsely by running the task verifier when done=True.
HarborTerminalTinkerEnv (in wrapped_env.py) adapts the Gym-style terminal env to the tinker-cookbook RL Env interface:
- Observation: a
tinker.ModelInputbuilt by a tinker-cookbook renderer over a chat history - Action: model completion tokens that decode into assistant text
- The assistant text is expected to be JSON in the
terminus-json-plainformat describing terminal keystrokes
This is a small “glue layer” so you can use Harbor sandboxes as RL environments inside tinker-cookbook.
The script train.py wires:
HarborSingleTaskRLDatasetBuilder(groups_per_batch, group_size)- Harbor sandboxes (Docker or Daytona)
- tinker-cookbook training loop
Example (Daytona):
export DAYTONA_API_KEY=...
export TINKER_API_KEY=...
python tinker_cookbook/recipes/terminal_rl/train.py \
--task-dir tinker_cookbook/recipes/terminal_rl/harbor_envs/devops_task \
--env daytona \
--model-name Qwen/Qwen3-30B-A3B-Instruct-2507 \
--renderer-name qwen3 \
--max-tokens 4096 \
--temperature 0.7 \
--groups-per-batch 1 \
--group-size 8 \
--num-batches 25 \
--learning-rate 4e-5 \
--lora-rank 32 \
--save-every 10 \
--log-path ./runs/devops_qwen3_lr4e-5_r32_nb25Example (local Docker):
export TINKER_API_KEY=...
python tinker_cookbook/recipes/terminal_rl/train.py \
--task-dir tinker_cookbook/recipes/terminal_rl/harbor_envs/devops_task \
--env docker \
--model-name Qwen/Qwen3-30B-A3B-Instruct-2507 \
--renderer-name qwen3 \
--max-tokens 4096 \
--temperature 0.7 \
--groups-per-batch 1 \
--group-size 8 \
--num-batches 25 \
--learning-rate 4e-5 \
--lora-rank 32 \
--save-every 10 \
--log-path ./runs/devops_qwen3_lr4e-5_r32_nb25Notes:
- requires
DAYTONA_API_KEYTINKER_API_KEY. - tinker/tinker-cookbook are optional dependencies; install them separately in your training environment.
To validate the integration between Harbor and tinker-cookbook, I conducted a hyperparameter search using Qwen2.5-30B-Instruct (via Qwen3 renderer) on the example DevOps task.
The training used loss fn importance_sampling with a group size of 8. We compared three learning rates: 1e-5, 2e-5, and 4e-5 over 25 optimization steps. The primary goal of this experiment was to validate the effectiveness of the environment integration. Due to the latency constraints of sandboxed execution, we restricted the training to a single representative DevOps task with a limited horizon.
(Figure: Comparison of Total Reward and Turns per Episode across different learning rates)
-
Higher LR is beneficial for Sparse Rewards:
- LR 4e-5 (Green) showed the best performance, rapidly increasing the Total Reward from ~0.25 to nearly 0.80 within just 25 steps.
- Lower learning rates (
1e-5,2e-5) struggled to escape the initial policy, resulting in flat reward curves. This suggests that for sparse-reward terminal tasks, a larger learning rate helps the model explore effective command sequences more aggressively.
-
Efficiency Gains (Lower Turns):
- As the model learned to solve the task (Step 15+), the Turns per Episode for the
4e-5run significantly decreased (dropping from ~7 turns to ~5 turns). - This indicates the agent didn't just learn to solve the task, but learned to solve it efficiently (e.g., avoiding redundant
lsorpwdcommands), which is a critical metric for terminal agents.
- As the model learned to solve the task (Step 15+), the Turns per Episode for the
-
Environment Stability:
- The
HarborTerminalTinkerEnvsuccessfully handled the rollout, execution, and verification cycle inside the sandbox. The clear correlation between Reward increase and Turns decrease confirms that the environment provides valid, learnable signals for RL training.
- The