Wed-2-11-5 Cross-lingual Text-To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space

Detai Xin(The University of Tokyo), Yuki Saito(The University of Tokyo), Shinnosuke Takamichi(University of Tokyo), Tomoki Koriyama(The University of Tokyo) and Hiroshi Saruwatari(The University of Tokyo)

Abstract: We present a method for improving the performance of cross-lingual text-to-speech synthesis. Previous works are able to model speaker individuality in speaker space via speaker encoder but suffer from performance decreasing when synthesizing cross-lingual speech. This is because the speaker space formed by all speaker embeddings is completely language-dependent. In order to construct a language-independent speaker space, we regard cross-lingual speech synthesis as a domain adaptation problem and propose a training method to let the speaker encoder adapt speaker embedding of different languages into the same space. Furthermore, to improve speaker individuality and construct a human-interpretable speaker space, we propose a regression method to construct perceptually correlated speaker space. Experimental result demonstrates that our method could not only improve the performance of both cross-lingual and intra-lingual speech but also find perceptually similar speakers beyond languages.

Paper

prev Wed-2-11-4 Phonological features for 0-shot multilingual speech synthesis

next Wed-2-11-6 Tone Learning in Low-Resource Bilingual TTS

About

About the Conference

Welcome from the Chair

Conference Committees

Calls