Mon-1-9-4 WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition

Guang Shen(Harbin Engineering University), Riwei Lai(Harbin Engineering University), Rui Chen(Harbin Engineering University), Yu Zhang(Southern University of Science and Technology), Kejia Zhang(Harbin Engineering University), Qilong Han(Harbin Engineering University) and Hongtao Song(Harbin Engineering University)
Abstract: While having numerous real-world applications, speech emotion recognition is still a technically challenging problem. How to effectively leverage the inherent multiple modalities in speech data (e.g., audio and text) is key to accurate classification. Existing studies normally choose to fuse multimodal features at the utterance level and largely neglect the dynamic interplay of features from different modalities at a fine-granular level over time. In this paper, we explicitly model dynamic interactions between audio and text at the word level via interaction units between two long short-term memory networks representing audio and text. We also devise a hierarchical representation of audio information from the frame, phoneme and word levels, which largely improves the expressiveness of resulting audio features. We finally propose WISE, a novel word-level interaction-based multimodal fusion framework for speech emotion recognition, to accommodate the aforementioned components. We evaluate WISE on the public benchmark IEMOCAP corpus and demonstrate that it outperforms state-of-the-art methods.
Student Information

Student Events

Travel Grants