Keisuke Kinoshita(NTT), Thilo von Neumann(Paderborn University), Marc Delcroix(NTT Communication Science Laboratories), Tomohiro Nakatani(NTT Corporation) and Reinhold Haeb-Umbach(Paderborn University)
Abstract:
Recently, the separation performance was greatly improved by time-domain audio source separation based on dual-path recurrent neural network (DPRNN). DPRNN is a simple but effective model for a long sequential data. It first splits the input data into short chunks and alternately applies an intra-chunk and an inter-chunk RNNs, for local and global modeling, respectively. While DPRNN is quite efficient in modeling a sequential data of the length of an utterance, i.e., about 5 to 10 second data, it is harder to apply it to longer sequences such as whole conversations consisting of multiple utterances. To mitigate this problem, this paper proposes a multi-path RNN (MPRNN), a generalized version of DPRNN, that models the input data in a hierarchical manner. In the MPRNN framework, the input data is represented at several (>3) resolutions, each of which is modeled by a specific RNN sub-module. For example, the RNN sub-module that deals with the finest resolution may model temporal relationship within a phoneme, while the RNN sub-module handling the most coarse resolution may capture the relationship between utterances such as speaker information. We perform experiments using simulated dialogue-like mixtures and show that the proposed MPRNN outperforms the current state-of-the-art DPRNN framework.