Mon-1-9-3 Multi-modal Attention for Speech Emotion Recognition

Zexu Pan(National University of Singapore), Zhaojie Luo(Osaka University), Jichen Yang(National University of Singapore) and Haizhou Li(National University of Singapore)

Abstract: Emotion represents an essential aspect of human speech that is manifested in speech prosody. Speech, visual, and textual cues are complementary in human communication. In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to makes use of visual and textual cues in speech emotion recognition. We propose a novel multi-modal attention mechanism, cLSTM-MMA, which facilitates the attention across three modalities and selectively fuse the information. cLSTM-MMA is fused with other uni-modal sub-networks in the late fusion. The experiments show that speech emotion recognition benefits significantly from visual and textual cues, and the proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of accuracy, but with a much more compact network structure. The proposed hybrid network MMAN achieves state-of-the-art performance on IEMOCAP database for emotion recognition.

Paper

prev Mon-1-9-2 Multimodal Deception Detection using Automatically Extracted Acoustic, Visual, and Lexical Features

next Mon-1-9-4 WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition

About

About the Conference

Welcome from the Chair

Conference Committees

Calls