Wed-3-12-9 Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion

Hong Liu(Shenzhen Graduate School, Peking University), Zhan Chen(Peking University) and Bing Yang(Shenzhen Graduate School, Peking University)

Abstract: Current studies have shown that extracting representative visual features and efficiently fusing audio and visual modalities are vital for audio-visual speech recognition (AVSR), but these are still challenging. To this end, we propose a lip graph assisted AVSR method with bidirectional synchronous fusion. First, a hybrid visual stream combines the image branch and graph branch to capture discriminative visual features. Specially, the lip graph exploits the natural and dynamic connections between the lip key points to model the lip shape, and the temporal evolution of the lip graph is captured by the graph convolutional networks followed by bidirectional gated recurrent units. Second, the hybrid visual stream is combined with the audio stream by an attention-based bidirectional synchronous fusion which allows bidirectional information interaction to resolve the asynchrony between the two modalities during fusion. The experimental results on LRW-BBC dataset show that our method outperforms the end-to-end AVSR baseline method in both clean and noisy conditions.

Paper

prev Wed-3-12-8 Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks

next Wed-3-12-10 CAPTION ALIGNMENT FOR LOW RESOURCE AUDIO-VISUAL DATA

About

About the Conference

Welcome from the Chair

Conference Committees

Calls