Jiahao Xu(The University of Sydney), Kun Hu(The University of Sydney), chang xu(The University of Sydney), Duc Chung Tran(Computing Fundamental Department, FPT University) and zhiyong wang(The University of Sydney)
Predicting and applying Time-Frequency (T-F) masks on mixture signals have been successfully utilized for speech separation. However, existing studies have not well utilized the identity context of a speaker for the inference of masks. In this paper, we propose a novel speaker-aware monaural speech
separation model. We firstly devise an encoder to disentangle speaker identity information with the supervision from the auxiliary speaker verification task. Then, we develop a spectrogram masking network to predict speaker masks, which would be applied to the mixture signal for the reconstruction of source signals. Experimental results on two WSJ0 mixed datasets demonstrate that our proposed model outperforms existing models in different separation scenarios.