Krishna D N(Sizzle.gg) and Ankita Patil(Sizzle.gg)
Abstract:
In this work, we propose a new approach for multimodal emotion recognition using cross-modal attention and raw waveform based convolutional neural networks. Our approach uses audio
and text information to predict the emotion label. We use an audio encoder to process the raw audio waveform to extract high-level features from the audio, and we use text encoder to extract
high-level semantic information from text. We use cross-modal attention where the features from audio encoder attend to the features from text encoder and vice versa. This helps in developing
interaction between speech and text sequences to extract most relevant features for emotion recognition. Our experiments show that the proposed approach obtains the state of the art results on IEMOCAP dataset [1]. We obtain 1.9% absolute improvement in accuracy compared to the previous state of the art method [2]. Our proposed approach uses 1D convolutional neural network to process the raw waveform instead of spectrogram features. Our experiments also shows that processing raw
waveform gives a 0.54% improvement over spectrogram based modal.