Wed-3-12-4 Audio-visual Multi-channel Recognition of Overlapped Speech

Jianwei Yu(the Chinese University of Hong Kong), Bo Wu(Tencent AI Lab), Rongzhi Gu(Peking University Shenzhen Graduate School), Shi-Xiong ZHANG(Tencent AI Lab), lianwu chen(tencent), YONG XU(Tencent AI lab), Meng Yu(Tencent AI Lab), Dan Su(Tencent AILab Shenzhen), Dong Yu(Tencent AI Lab), Xunying Liu(Chinese University of Hong Kong) and Helen Meng(The Chinese University of Hong Kong)

Abstract: Automatic speech recognition (ASR) of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in state-of-the-art ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption, this paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation back-end and recognition font-end. A series of audio-visual multi-channel speech separation networks based on TF masking, filter\&sum and mask based MVDR beamforming approaches were developed. To reduce the error cost mismatch between the separation and recognition components, they are jointly fine-tuned using the connectionist temporal classification (CTC) loss function, or a multi-task criterion interpolation with scale-invariant signal to noise ratio (Si-SNR) error cost. Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81\% (26.83\% relative) and 22.22\% (56.87\% relative) absolute in word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 (LRS2) dataset respectively.

Paper

prev Wed-3-12-3 Fusion Architectures for Word-based Audiovisual Speech Recognition

next Wed-3-12-5 TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog

About

About the Conference

Welcome from the Chair

Conference Committees

Calls