Wed-1-3-8 DurIAN: Duration Informed Attention Network For Speech Synthesis

Chengzhu Yu(Tencent), Heng Lu(Tencent American), Na Hu(Tencent), Meng Yu(Tencent), Chao Weng(Tencent AI Lab), Kun Xu(Tencent), Peng Liu(Tencent), Deyi Tuo(Tencent), Shiyin Kang(Tencent), Guangzhi Lei(Tencent), Dan Su(Tencent AILab Shenzhen) and Dong Yu(Tencent)

Abstract: In this paper, we present a generic, robust, and effective speech synthesis system. The key component of this system is the Duration Informed Attention Network (DurIAN), an autoregressive model in which the alignments between the input text and the output acoustic features are inferred from a duration model. This is different from the end-to-end attention mechanism used, and accounts for various unavoidable artifacts, in existing end-to-end speech synthesis systems such as Tacotron. To improve the efficiency of speech generation, we also propose a multi-band parallel generation strategy on top of the WaveRNN model. The proposed Multi-band WaveRNN effectively reduces the total computational complexity from 9.8 to 3.6 GFLOPS, and is able to generate audio that is 6 times faster than real time on a single CPU core. We show that DurIAN could generate highly natural speech that is on par with current state of the art end-to-end systems, while at the same time avoid word skipping/repeating errors in those systems.

Paper

prev Wed-1-3-7 High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency

next Wed-1-3-9 Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes

About

About the Conference

Welcome from the Chair

Conference Committees

Calls