Mon-3-11-5 Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation

Chenda Li(Shanghai Jiao Tong University) and Yanmin Qian(Shanghai Jiao Tong University)

Abstract: Solving the cocktail party problem with the multi-modal approach has become popular in recent years. Humans can focus on the speech that they are interested in for the multi-talker mixed speech, by hearing the mixed speech, watching the speaker, and understanding the context what the speaker is talking about. In this paper, we try to solve the speaker-independent speech separation problem with all three audio-visual-contextual modalities at the first time, and those are hearing speech, watching speaker and understanding contextual language. Compared to the previous methods applying pure audio modal or audio-visual modals, a specific model is further designed to extract contextual language information for all target speakers directly from the speech mixture. Then these extracted contextual knowledge are further incorporated into the multi-modal based speech separation architecture with an appropriate attention mechanism. The experiments show that a significant performance improvement can be observed with the newly proposed audio-visual-contextual speech separation.

Paper

prev Mon-3-11-4 X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network

next Mon-3-11-6 A Uniﬁed Framework for Low-Latency Speaker Extraction in Cocktail Party Environments

About

About the Conference

Welcome from the Chair

Conference Committees

Calls