Wed-1-11-1 Singing Voice Extraction with Attention based Spectrograms Fusion

Hao Shi(Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University), Longbiao Wang(Tianjin University), Sheng Li(National Institute of Information and Communications Technology (NICT), Advanced Speech Technology Laboratory), Chenchen Ding(NICT), Meng Ge(Tianjin University), Nan Li(Tianjin University), Jianwu Dang(JAIST) and Hiroshi Seki(Huiyan Technology (Tianjin) Co. Ltd., Tianjin)
Abstract: We propose a novel attention mechanism-based spectrograms fusion system with minimum difference masks (MDMs) estimation for singing voice extraction. Compared with previous works that use a fully connected neural network, our system takes advantage of the multi-head attention mechanism. Specifically, we 1) try a variety of embedding methods of multiple spectrograms as the input of attention mechanisms, which can provide multi-scale correlation information between adjacent frames in the spectrograms; 2) add a regular term to loss function to obtain better continuity of spectrogram; 3) use the phase of the linear fusion waveform to reconstruct the final waveform, which can reduce the impact of the inconsistent spectrogram. Experiments on the MIR-1K dataset show that our system consistently improves the quantitative evaluation by the perceptual evaluation of speech quality, signal-to-distortion ratio, signal-to-interference ratio, and signal-to-artifact ratio.
Student Information

Student Events

Travel Grants