Jeno Szep(University of Arizona) and Salim Hariri(University of Arizona)
In this study, we address the ComParE 2020 Paralinguistics Mask sub-challenge, where the task is the detection of wearing surgical masks from short speech segments. In our approach, we propose a computer-vision-based pipeline to utilize the capabilities of deep convolutional neural network-based image classifiers developed in recent years and apply this technology to a specific class of spectrograms. Several linear and logarithmic scale spectrograms were tested, and the best performance is achieved on linear-scale, 3-Channel Spectrograms created from the audio segments. A single model image classifier provided a 6.1% better result than the best single-dataset baseline model. The ensemble of our models further improves accuracy and achieves 73.0% UAR by training just on the ‘train’ dataset and reaches 80.1% UAR on the test set when training includes the ‘devel’ dataset, which result is 8.3% higher than the baseline. We also provide an activation-mapping analysis to identify frequency ranges that are critical in the ‘mask’ versus ‘clear’ classification.