Ziping Zhao(Tianjin Normal University), Qifei Li(Tianjin Normal University), Nicholas Cummins(University of Augsburg), Bin Liu(National Laboratory of Pattern Recognition, CASIA, Beijing), Haishuai Wang(Tianjin Normal University), Jianhua Tao(National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) and Björn Schuller(University of Augsburg / Imperial College London)
A fast-growing area of mental health research is the search for speech-based objective markers for conditions such as depression. One vital challenge in the development of speech-based depression severity assessment systems is the extraction of depression-relevant features from speech signals. In order to deliver more comprehensive feature representation, we herein explore the benefits of a hybrid network that encodes depression-related characteristics in speech for the task of depression severity
assessment. The proposed network leverages self-attention networks (SAN) trained on low-level acoustic features and deep convolutional neural networks (DCNN) trained on 3D Log-Mel spectrograms. The feature representations learnt in the SAN and DCNN are concatenated and average pooling is exploited to aggregate complementary segment-level features. Finally, support vector regression is applied to predict a speaker’s Beck Depression Inventory-II score. Experiments based on a subset of
the Audio-Visual Depressive Language Corpus, as used in the 2013 and 2014 Audio/Visual Emotion Challenges, demonstrate the effectiveness of our proposed hybrid approach.