Hyeongrae Ihm(Seoul National University), Joun Yeop Lee(Seoul National University), Byoung Jin Choi(Seoul National University), Sung Jun Cheon(Seoul National University) and Nam Soo Kim(Seoul National University)
Abstract:
Recent End-to-end text-to-speech (TTS) systems based on the deep neural network (DNN) have shown the state-of-the-art performance on the speech synthesis field. Especially, the attention-based sequence-to-sequence models increased the quality of the alignment between the text and spectrogram successfully. Leveraging such improvement, speech synthesis using a Transformer network was reported to generate human-like speech audio. However, such sequence-to-sequence models require intensive computing power and memory during training. The attention scores are calculated over the entire key at every query sequence. To mitigate this issue, we propose Reformer-TTS, the model using a Reformer network which utilizes the location-sensitive hashing attention and the reversible residual network. As a result, we show that the Reformer network consumes almost twice smaller memory margin as the Transformer, which leads to the fast convergence of training end-to-end TTS system. We demonstrate such advantages with memory usage and subjective performance evaluation.