Thu-1-11-3 Non-autoregressive End-to-End TTS with Coarse-to-Fine Decoding

Tao Wang(National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences), Jianhua Tao(National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences), Jiangyan Yi(National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences), Ruibo Fu(National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) and Zhengqi Wen(National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences)
Abstract: Most end-to-end neural text-to-speech (TTS) systems generate acoustic features autoregressively from left to right, which still suffer from two problems: 1) low efficiency during inference; 2) the limitation of "exposure bias". To overcome these shortcomings, this paper proposes a non-autoregressive speech synthesis model which is based on the transformer structure. During training, the ground truth of acoustic features is schedule masked. The decoder needs to predict the entire acoustic features by taking text and the masked ground truth. During inference, we just need a text as input, the network will predict the acoustic features in one step. Additionally, to make the model can fuse left and right context when decoding, we decompose the decoding process into two stages. Given an input text embedding, we first generate coarse acoustic features, which focus on the meaning of sentences. Then, we fill in missing details of acoustic features by taking into account the text information and the coarse acoustic features. Experiments on a Chinese female corpus illustrate that our approach can achieve competitive results in speech naturalness relative to autoregressive model. Most importantly, our model speed up the acoustic features generation by 296× compared with the autoregressive model based on transformer structure.
Student Information

Student Events

Travel Grants