Xinyuan Qian(National University of Singapore) and Shoufeng Lin(National University of Singapore)
Abstract:
Multi-speaker tracking using both audio and video modalities is a key task in human-robot interaction and video conferencing. The complementary nature of audio and video signals improves the tracking robustness against noise and outliers compared to the uni-modal approaches. However, the online tracking of multiple speakers via audio-video fusion, especially without the target number prior, is still an open challenge. In this paper, we propose a Generalized Labeled Multi-Bernoulli-based framework that jointly estimates the number of targets and their respective states online. Experimental results using the AV16.3 dataset demonstrate the effectiveness of the proposed method.