Speech Synthesis: Multilingual and Cross-Lingual Approaches

Wed-2-11-2 MULTI-LINGUAL MULTI-SPEAKER TEXT-TO-SPEECH SYNTHESIS FOR VOICE CLONING WITH ONLINE SPEAKER ENROLLMENT

Zhaoyu Liu(The Hong Kong University of Science and Technology) and Brian Mak(The Hong Kong University of Science and Technology)
Abstract: Recent studies in multi-lingual and multi-speaker text-to-speech synthesis proposed approaches that use proprietary corpora of performing artists and require fine-tuning to enroll new voices. To reduce these costs, we investigate a novel approach for generating high-quality speeches in multiple languages of speakers enrolled in their native language. In our proposed system, we introduce tone/stress embeddings which extend the language embedding to represent tone and stress information. By manipulating the tone/stress embedding input, our system can synthesize speeches in native accent or foreign accent. To support online enrollment of new speakers, we condition the Tacotron-based synthesizer on speaker embeddings derived from a pretrained x-vector speaker encoder by transfer learning. We introduce a shared phoneme set to encourage more phoneme sharing compared with the IPA. Our MOS results demonstrate that the native speech in all languages is highly intelligible and natural. We also find L2-norm normalization and ZCA-whitening on x-vectors are helpful to improve the system stability and audio quality. We also find that the WaveNet performance is seemingly language-independent: the WaveNet model trained with anyone of the three supported languages in our system can be used to generate speeches in the other two languages very well.
Student Information

Student Events

Travel Grants