Yunzhe Hao(Institute of Automation, Chinese Academy of Sciences), jiaming xu(Institute of Automation, Chinese Academy of Sciences), Jing Shi(Institute of Automation, Chinese Academy of Sciences.), Peng Zhang(Institute of Automation, Chinese Academy of Science), Lei Qin(Huawei Consumer Business Group) and Bo Xu(Institute of Automation, Chinese Academy of Science)
Abstract:
Speech recognition technology in single-talker scenes has
matured in recent years. However, in noisy environments, es-
pecially in multi-talker scenes, speech recognition performance
is significantly reduced. Towards cocktail party problem, we
propose a unified time-domain target speaker extraction frame-
work. In this framework, we obtain a voiceprint from a clean
speech of the target speaker and then extract the speech of
the same speaker in a mixed speech based on the previously
obtained voiceprint. This framework uses voiceprint infor-
mation to avoid permutation problems. In addition, a time-
domain model can avoid the phase reconstruction problem of
tra-ditional time-frequency domain models. Our framework is
suitable for scenes where people are relatively fixed and their
voiceprints are easily registered, such as in a car, home, meeting
room, or other such scenes. The proposed global model based
on the dual-path recurrent neural network (DPRNN) block
achieved state-of-the-art under speaker extraction tasks on the
WSJ0-2mix dataset. We also built corresponding low-latency
models. Results showed comparable model performance and a
much shorter upper limit latency than time-frequency domain
models. We found that performance of the low-latency model
gradually decreased as latency decreased, which is important
when deploying models in actual application scenarios.