Mon-2-7-3 Non-parallel Many-to-many Voice Conversion with PSR-StarGAN

Yanping Li(Nanjing University of Posts and Telecommunications), Dongxiang Xu(Nanjing University of Posts and Telecommunications), Yan Zhang(JIT), Yang Wang(vivo AI Lab) and Binbin Chen(vivo AI Lab)

Abstract: Voice Conversion (VC) aims at modifying source speaker's speech to sound like that of target speaker while preserving linguistic information of given speech. StarGAN-VC was recently proposed, which utilizes a variant of Generative Adversarial Networks (GAN) to perform non-parallel many-to-many VC. However, the quality of generated speech is not satisfactory enough. An improved method named "PSR-StarGAN-VC'' is proposed in this paper by incorporating three improvements. Firstly, perceptual loss functions are introduced to optimize the generator in StarGAN-VC aiming to learn high-level spectral features. Secondly, considering that Switchable Normalization (SN) could learn different operations in different normalization layers of model, it is introduced to replace Batch Normalization (BN) in StarGAN-VC. Lastly, Residual Network (ResNet) is applied to establish the mapping of different layers between the encoder and decoder of generator aiming to retain more semantic features when converting speech, and to reduce the difficulty of training. Experiment results on the VCC 2018 datasets demonstrate superiority of the proposed method in terms of naturalness and speaker similarity.

Paper

prev Mon-2-7-2 Improving the Speaker Identity of Non-Parallel Many-to-Many VoiceConversion with Adversarial Speaker Recognition

next Mon-2-7-4 TTS Skins: Speaker Conversion via ASR

About

About the Conference

Welcome from the Chair

Conference Committees

Calls