Wang Dai(Beijing Language and Culture University), Jinsong Zhang(Beijing Language and Culture University), Yingming Gao(Institute of Acoustics and Speech Communication, Technische Universität Dresden), Wei Wei(Beijing Language and Culture University), Dengfeng Ke(National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences), Binghuai Lin(MIG, Tencent Science and Technology Ltd., Beijing) and Yanlu Xie(Beijing Language and Culture University)
Abstract:
Formant tracking is one of the most fundamental problems in speech processing. Traditionally, formants are estimated using signal processing methods. Recent studies showed that generic convolutional architectures can outperform recurrent networks on temporal tasks such as speech synthesis and machine translation. In this paper, we explored the use of Temporal Convolutional Network (TCN) for formant tracking. In addition to the conventional implementation, we modified the architecture from three aspects. First, we turned off the “causal” mode of dilated convolution, making sure the dilated convolution see the future speech frames. Second, each hidden layer reused the output information from all the previous layer through dense connection. Third, we also adopted a gating mechanism to alleviate the problem of gradient disappearance by selectively forgetting unimportant information. The model was validated on the open access formant database VTR. Experiment showed that our model was easy to converge and achieved the overall mean absolute percent error (MAPE) of 8.2% on speech-labeled frames, compared to three competitive baselines of 9.4%(LSTM), 9.1%(Bi-LSTM) and 8.9% (TCN).