50-reward GRPO training: a 0.1 temp change collapsed the systemzenodo.org4 pointsHenryAvery2 months ago