jianshu zhao(Tokyo institute of Technology), Shengzhou Gao(Tokyo Institute of Technology) and Takahiro Shinozaki(Tokyo Institute of Technology)
Target-speaker speech separation, due to its essence in industrial applications, has been heavily researched for long by many. The key metric for qualifying a good separation algorithm still lies on the separation performance, i.e., the quality of the separated voice. In this paper, we presented a novel high-performance time-domain waveform based target-speaker speech separation architecture(WaveFilter) for this task. Unlike most previous researches which adopted Time-Frequency based approaches, WaveFilter does the job by applying Convolutional Neural Network(CNN) based feature extractors directly on the raw Time-domain audio data, for both the speech separation network and the auxiliary target-speaker feature extraction network. We achieved a 10.46 Signal to Noise Ratio(SNR) improvement on the WSJ0 2-mix dataset and a 10.44 SNR improvement on the Librispeech dataset as our final results, which is much higher than the existing approaches. Our method also achieved an 4.9 SNR improvement on the WSJ0 3-mix data. This proves the feasibility of WaveFilter on separating the target-speaker's voice from multi-speaker voice mixtures without knowing the exact number of speakers in advance, which in turn proves the readiness of our method for real-world applications.