Zhi Zhu(Fairy Devices Inc.) and Yoshinao Sato(Fairy Devices Inc.)
Research on affective computing has achieved remarkable success with the development of deep learning.
One of the major difficulties in emotion recognition is inconsistent criteria for emotion categorization between multiple corpora.
Most previous studies using multiple corpora discard or merge a part of their emotion classes.
This prescription causes catastrophic information loss with respect to emotion categorization.
Furthermore, the influences of corpus-specific factors other than emotions, such as languages, speech registers, and recording environments, should be eliminated to fully utilize multiple corpora.
In this paper, we address the challenge of reconciling multiple emotion corpora by learning a corpus-independent emotion encoding disentangled from all the remaining factors without causing catastrophic information loss.
For this purpose, we propose a model that consists of a shared emotion encoder, multiple emotion classifiers, and an adversarial corpus discriminator.
This model is trained with multi-task learning harnessed by adversarial learning.
We conducted speech emotion classification experiments with our method on two corpora, namely, EmoDB and CREMA-D.
The results demonstrate that our method achieves higher accuracies than mono-corpus models.
In addition, it is indicated that the proposed method suppresses corpus-dependent factors other than emotions in the embedding space.