Chenda Li(Shanghai Jiao Tong University) and Yanmin Qian(Shanghai Jiao Tong University)
Solving the cocktail party problem with the multi-modal approach has become popular in recent years. Humans can focus on the speech that they are interested in for the multi-talker mixed speech, by hearing the mixed speech, watching the speaker, and understanding the context what the speaker is talking about. In this paper, we try to solve the speaker-independent speech separation problem with all three audio-visual-contextual modalities at the first time, and those are hearing speech, watching speaker and understanding contextual language. Compared to the previous methods applying pure audio modal or audio-visual modals, a specific model is further designed to extract contextual language information for all target speakers directly from the speech mixture. Then these extracted contextual knowledge are further incorporated into the multi-modal based speech separation architecture with an appropriate attention mechanism. The experiments show that a significant performance improvement can be observed with the newly proposed audio-visual-contextual speech separation.