Wed-3-4-5 Enhancing Monotonicity for Robust Autoregressive Transformer TTS

Xiangyu Liang(Tsinghua University), Zhiyong Wu(The Chinese University of Hong Kong), Runnan Li(Tsinghua University (THU)), Yanqing Liu(Search Technology Center Asia (STCA), Microsoft), Sheng Zhao(Search Technology Center Asia (STCA), Microsoft) and Helen Meng(The Chinese University of Hong Kong)
Abstract: With the development of sequence-to-sequence modeling algorithms, Text-to-Speech (TTS) techniques have achieved significant improvement in speech quality and naturalness. These deep learning algorithms, such as recurrent neural networks (RNNs) and its memory enhanced variations, have shown strong reconstruction ability from input linguistic features to acoustic features. However, the efficiency of these algorithms is limited for its sequential process in both training and inference. Recently, Transformer with superiority in parallelism is proposed to TTS. It employs the positional embedding instead of recurrent mechanism for position modeling and significantly boosts training speed. However, this approach lacks monotonic constraint and is deficient with issues like pronunciation skipping. Therefore, in this paper, we propose a monotonicity enhancing approach with the combining use of Stepwise Monotonic Attention (SMA) and multi-head attention for Transformer based TTS system. Experiments show the proposed approach can reduce bad cases from 53 of 500 sentences to 1, together with an improvement on MOS from 4.09 to 4.17 in the naturalness test.
Student Information

Student Events

Travel Grants