Mon-2-7-2 Improving the Speaker Identity of Non-Parallel Many-to-Many VoiceConversion with Adversarial Speaker Recognition

Shaojin Ding(Texas A&M University), Guanlong Zhao(Texas A&M University) and Ricardo Gutierrez-Osuna(Texas A&M University)

Abstract: Phonetic Posteriorgrams (PPGs) have received much attention for non-parallel many-to-many Voice Conversion (VC), and have been shown to achieve state-of-the-art performance. These methods implicitly assume that PPGs are speaker-independent and contain only linguistic information in an utterance. In practice, however, PPGs carry speaker individuality cues, such as accent, intonation, and speaking rate. As a result, these cues can leak into the voice conversion, making it sound similar to the source speaker. To address this issue, we propose an adversarial learning approach that can remove speaker-dependent information in VC models based on a PPG2speech synthesizer. During training, the encoder output of a PPG2speech synthesizer is fed to a classifier trained to identify the corresponding speaker, while the encoder is trained to fool the classifier. As a result, a more speaker-independent representation is learned. The proposed method is advantageous as it does not require pre-training the speaker classifier, and the adversarial speaker classifier is jointly trained with the PPG2speech synthesizer end-to-end. We conduct objective and subjective experiments on the CSTR VCTK Corpus under standard and one-shot VC conditions. Results show that the proposed method significantly improves the speaker identity of VC syntheses when compared with a baseline system trained without adversarial learning.

Paper

prev Mon-2-7-1 Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

next Mon-2-7-3 Non-parallel Many-to-many Voice Conversion with PSR-StarGAN

About

About the Conference

Welcome from the Chair

Conference Committees

Calls