Wed-2-4-7 Multi-path RNN for hierarchical modeling of long sequential data and its application to speaker stream separation

Keisuke Kinoshita(NTT), Thilo von Neumann(Paderborn University), Marc Delcroix(NTT Communication Science Laboratories), Tomohiro Nakatani(NTT Corporation) and Reinhold Haeb-Umbach(Paderborn University)
Abstract: Recently, the separation performance was greatly improved by time-domain audio source separation based on dual-path recurrent neural network (DPRNN). DPRNN is a simple but effective model for a long sequential data. It first splits the input data into short chunks and alternately applies an intra-chunk and an inter-chunk RNNs, for local and global modeling, respectively. While DPRNN is quite efficient in modeling a sequential data of the length of an utterance, i.e., about 5 to 10 second data, it is harder to apply it to longer sequences such as whole conversations consisting of multiple utterances. To mitigate this problem, this paper proposes a multi-path RNN (MPRNN), a generalized version of DPRNN, that models the input data in a hierarchical manner. In the MPRNN framework, the input data is represented at several (>3) resolutions, each of which is modeled by a specific RNN sub-module. For example, the RNN sub-module that deals with the finest resolution may model temporal relationship within a phoneme, while the RNN sub-module handling the most coarse resolution may capture the relationship between utterances such as speaker information. We perform experiments using simulated dialogue-like mixtures and show that the proposed MPRNN outperforms the current state-of-the-art DPRNN framework.
Student Information

Student Events

Travel Grants