Thu-1-1-3 A Cyclical Post-filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-speech Systems

Yi-Chiao Wu(Nagoya University), Patrick Lumban Tobing(Nagoya University), Kazuki Yasuhara(Nagoya University), Noriyuki Matsunaga(AI Inc.), Yamato Ohtani(AI Inc.) and Tomoki Toda(Nagoya University)

Abstract: Recently, the effectiveness of text-to-speech (TTS) systems combined with neural vocoders to generate high-fidelity speech has been shown. However, collecting the required training data and building these advanced systems from scratch are time and resource consuming. A more economical approach is to develop a neural vocoder to enhance the speech generated by existing TTS systems. Nonetheless, this approach usually suffers from two issues: 1) temporal mismatches between TTS and natural waveforms and 2) acoustic mismatches between training and testing data. To address these issues, we adopt a cyclic voice conversion (VC) model to generate temporally matched pseudo-VC data for training and acoustically matched enhanced data for testing the neural vocoders. Because of the generality, this framework can be applied to arbitrary neural vocoders. In this paper, we apply the proposed method with a state-of-the-art WaveNet vocoder for two different TTS systems, and both objective and subjective experimental results confirm the effectiveness of the proposed framework. Index Terms: temporal mismatch, acoustic mismatch, cycle-consistent, voice conversion, post-filter for text-to-speech

Paper

prev Thu-1-1-2 Quasi-Periodic Parallel WaveGAN Vocoder: A Non-autoregressive Pitch-dependent Dilated Convolution Model for Parametric Speech Generation

next Thu-1-1-4 Audio Dequantization for High Fidelity Audio Generation in Flow-based Neural Vocoder

About

About the Conference

Welcome from the Chair

Conference Committees

Calls