Mon-3-11-6 A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments

Yunzhe Hao(Institute of Automation, Chinese Academy of Sciences), jiaming xu(Institute of Automation, Chinese Academy of Sciences), Jing Shi(Institute of Automation, Chinese Academy of Sciences.), Peng Zhang(Institute of Automation, Chinese Academy of Science), Lei Qin(Huawei Consumer Business Group) and Bo Xu(Institute of Automation, Chinese Academy of Science)
Abstract: Speech recognition technology in single-talker scenes has matured in recent years. However, in noisy environments, es- pecially in multi-talker scenes, speech recognition performance is significantly reduced. Towards cocktail party problem, we propose a unified time-domain target speaker extraction frame- work. In this framework, we obtain a voiceprint from a clean speech of the target speaker and then extract the speech of the same speaker in a mixed speech based on the previously obtained voiceprint. This framework uses voiceprint infor- mation to avoid permutation problems. In addition, a time- domain model can avoid the phase reconstruction problem of tra-ditional time-frequency domain models. Our framework is suitable for scenes where people are relatively fixed and their voiceprints are easily registered, such as in a car, home, meeting room, or other such scenes. The proposed global model based on the dual-path recurrent neural network (DPRNN) block achieved state-of-the-art under speaker extraction tasks on the WSJ0-2mix dataset. We also built corresponding low-latency models. Results showed comparable model performance and a much shorter upper limit latency than time-frequency domain models. We found that performance of the low-latency model gradually decreased as latency decreased, which is important when deploying models in actual application scenarios.
