Hi author,
could you share your training config for qwen3-base series. One question is that we found in DAPO setting without KL penlaty, the qwen3-base model is easy to have model collapse after 200 training steps, I wonder whether you face the similar training instabilty for it when doing for off-policy RL. Thank you!
Hi author,
could you share your training config for qwen3-base series. One question is that we found in DAPO setting without KL penlaty, the qwen3-base model is easy to have model collapse after 200 training steps, I wonder whether you face the similar training instabilty for it when doing for off-policy RL. Thank you!