Zhengyang Chen(MoE Key Lab of Artificial Intelligence SpeechLab, Department of Computer Science and EngineeringShanghai Jiao Tong University, Shanghai), Shuai Wang(Shanghai Jiao Tong University) and Yanmin Qian(Shanghai Jiao Tong University)
The information from different modalities usually compensates each other. In this paper, we use the audio and visual data in VoxCeleb dataset to do person verification. We explored different information fusion strategies and loss functions for the audio-visual person verification system at the embedding level. System performance is evaluated using the public trail lists on VoxCeleb1 dataset. Our best system using audio-visual knowledge at the embedding level achieves 0.585%, 0.427% and 0.735% EER on the three official trial lists of VoxCeleb1, which are the best reported results on this dataset. Moreover, to imitate more complex test environment with one modality corrupted or missing, we construct a noisy evaluation set based on VoxCeleb1 dataset. We use a data augmentation strategy at the embedding level to help our audio-visual system to distinguish the noisy and the clean embedding. With such data augmented strategy, the proposed audio-visual person verification system is more robust on the noisy evaluation set.