Tomoya Koike(The University of Tokyo), Kun Qian(The University of Tokyo), Björn Schuller(University of Augsburg / Imperial College London) and Yoshiharu Yamamoto(The University of Tokyo)
Human hand-crafted features are always regarded as expensive, time-consuming, and difficult in almost all of the machine-learning-related tasks. First, those well-designed features extremely rely on human expert domain knowledge, which may restrain the collaboration work across fields. Second, the features extracted in such a brute-force scenario may not be easy to be transferred to another task, which means a series of new features should be designed. To this end, we introduce a method based on a transfer learning strategy combined with data augmentation techniques for the ComParE 2020 Challenge Mask Sub-Challenge. Unlike the previous studies mainly based on pre-trained models by image data, we use a pre-trained model based on large scale audio data, i.e., AudioSet. In addition, the SpecAugment and mixup methods are used to improve the generalisation of the deep models. Experimental results demonstrate that the best-proposed model can significantly (p<.001, by one-tailed z-test) improve the unweighted average recall (UAR) from 71.8% (baseline) to 76.2% on the test set. Finally, the best result, i.e., 77.5% of the UAR on the test set, is achieved by a late fusion of the two best proposed models and the best single model in the baseline.