Mon-2-5-8 Exploiting Cross Domain Visual Feature Generation for Disordered Speech Recognition

SHANSONG LIU(The Chinese University of Hong Kong), Xurong Xie(Chinese University of Hong Kong), Jianwei Yu(the Chinese University of Hong Kong), shoukang hu(The Chinese University of Hong Kong), Mengzhe Geng(The Chinese University of Hong Kong), Rongfeng Su(Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences.), Shi-Xiong ZHANG(Tencent AI Lab), Xunying Liu(Chinese University of Hong Kong) and Helen Meng(The Chinese University of Hong Kong)

Abstract: Audio-visual speech recognition (AVSR) technologies have been successfully applied to a wide range of tasks. When developing AVSR systems for disordered speech characterized by severe degradation of voice quality and large mismatch against normal, it is difﬁcult to record large amounts of high quality audio-visual data. In order to address this issue, a cross-domain visual feature generation approach is proposed in this paper. Audio-visual inversion DNN system constructed using widely available out-of-domain audio-visual data was used to generate visual features for disordered speakers for whom video data is either very limited or unavailable. Experiments conducted on the UASpeech corpus suggest that the proposed cross-domain visual feature generation based AVSR system consistently outperformed the baseline ASR system and AVSR system using original visual features. An overall word error rate reduction of 3.6% absolute (14% relative) was obtained over the previously published best system on the 8 UASpeech dysarthric speakers with audio-visual data of the same task.

Paper

prev Mon-2-5-7 An Utterance Verification System for Word Naming Therapy in Aphasia

next Mon-2-5-9 Joint prediction of punctuation and disfluency in speech transcripts

About

About the Conference

Welcome from the Chair

Conference Committees

Calls