Jiaxing Liu(Tianjin University), Zhilei Liu(Tianjin University), Longbiao Wang(Tianjin University), Yuan Gao(Tianjin University), Lili Guo(Tianjin University) and Jianwu Dang(JAIST)
As the fundamental research of affective computing, speech emotion recognition (SER) has gained a lot of attention. Unlike with common deep learning tasks, SER was restricted by the scarcity of emotional speech datasets. In this paper, the vector quantization variational automatic encoder (VQ-VAE) was introduced and trained by massive unlabeled data in an unsupervised manner. Benefiting from the excellent invariant distribution encoding capability and discrete embedding space of VQ-VAE, the pre-trained VQ-VAE could learn latent representation from labeled data. The extracted latent representation could serve as the additional source data to make data abundantly available. While solving data lacking issue, sequence information modeling was also taken into account which was considered useful for SER. The proposed sequence model, temporal attention convolutional network (TACN) was simple yet good at learning contextual information from limited data which was not friendly to complicated structures of recurrent neural network (RNN) based sequence models. To validate the effectiveness of the latent representation, t-distributed stochastic neighbor embedding (t-SNE) was introduced to analyze the visualizations. To verify the performance of the proposed TACN, quantitative classification results of all commonly used sequence models were provided. Our proposed model achieved state-of-the-art performance on IEMOCAP.