Detai Xin(The University of Tokyo), Yuki Saito(The University of Tokyo), Shinnosuke Takamichi(University of Tokyo), Tomoki Koriyama(The University of Tokyo) and Hiroshi Saruwatari(The University of Tokyo)
Abstract:
We present a method for improving the performance of cross-lingual text-to-speech synthesis.
Previous works are able to model speaker individuality in speaker space via speaker encoder but suffer from performance decreasing when synthesizing cross-lingual speech.
This is because the speaker space formed by all speaker embeddings is completely language-dependent.
In order to construct a language-independent speaker space, we regard cross-lingual speech synthesis as a domain adaptation problem and propose a training method to let the speaker encoder adapt speaker embedding of different languages into the same space.
Furthermore, to improve speaker individuality and construct a human-interpretable speaker space, we propose a regression method to construct perceptually correlated speaker space.
Experimental result demonstrates that our method could not only improve the performance of both cross-lingual and intra-lingual speech but also find perceptually similar speakers beyond languages.