Samuele Cornell(Università Politecnica delle Marche), Maurizio Omologo(Fondazione Bruno Kessler - irst), Stefano Squartini(Università Politecnica delle Marche) and Emmanuel Vincent(Inria)
Abstract:
We consider the problem of detecting the activity and counting overlapping speakers in distant-microphone recordings. We treat supervised Voice Activity Detection (VAD), Overlapped Speech Detection (OSD), joint VAD+OSD, and speaker counting as instances of a general Overlapped Speech Detection and Counting (OSDC) task, and we design a Temporal Convolutional Network (TCN) based method to address it.
We show that TCNs significantly outperform state-of-the-art methods on two real-world distant speech datasets. In particular our best architecture obtains, for OSD, 29.1 % and 25.5 % absolute improvement in Average Precision over previous techniques on, respectively, the AMI and CHiME-6 datasets.
Furthermore, we find that generalization for joint VAD+OSD improves by using a speaker counting objective rather than a VAD+OSD objective.
We also study the effectiveness of forced alignment based labeling and data augmentation, and show that both can improve OSD performance.