Wed-3-12-3 Fusion Architectures for Word-based Audiovisual Speech Recognition

Michael Wand(The Swiss AI Lab IDSIA) and Juergen Schmidhuber(The Swiss AI Lab IDSIA)
Abstract: In this study we investigate architectures for modality fusion in audiovisual speech recognition, where one aims to alleviate the adverse effect of acoustic noise on the speech recognition accuracy by using video images of the speaker's face as an additional modality. Starting from an established neural network fusion system, we substantially improve the recognition accuracy by taking single-modality losses into account: late fusion (at the output logits level) is substantially more robust than the baseline, in particular for unseen acoustic noise, at the expense of having to determine the optimal weighting of the input streams. The latter requirement can be removed by making the fusion itself a trainable part of the network.
Student Information

Student Events

Travel Grants