Wed-1-3-5 Reformer-TTS: Neural Speech Synthesis with Reformer Network

Hyeongrae Ihm(Seoul National University), Joun Yeop Lee(Seoul National University), Byoung Jin Choi(Seoul National University), Sung Jun Cheon(Seoul National University) and Nam Soo Kim(Seoul National University)
Abstract: Recent End-to-end text-to-speech (TTS) systems based on the deep neural network (DNN) have shown the state-of-the-art performance on the speech synthesis field. Especially, the attention-based sequence-to-sequence models increased the quality of the alignment between the text and spectrogram successfully. Leveraging such improvement, speech synthesis using a Transformer network was reported to generate human-like speech audio. However, such sequence-to-sequence models require intensive computing power and memory during training. The attention scores are calculated over the entire key at every query sequence. To mitigate this issue, we propose Reformer-TTS, the model using a Reformer network which utilizes the location-sensitive hashing attention and the reversible residual network. As a result, we show that the Reformer network consumes almost twice smaller memory margin as the Transformer, which leads to the fast convergence of training end-to-end TTS system. We demonstrate such advantages with memory usage and subjective performance evaluation.
Student Information

Student Events

Travel Grants