Triantafyllos Afouras(University of Oxford), Joon Son Chung(University of Oxford) and Andrew Zisserman(University of Oxford)
The goal of this work is to train models that can identify a spoken language just by
interpreting the speaker's lip movements.
To this end, we collect a large scale multilingual audio-visual speech dataset with language labels,
starting from TEDx talks downloaded from YouTube.
Our contributions are the following:
(i) We show that models can learn to discriminate among 14 different languages using only visual speech information;
(ii) we compare different designs in sequence modelling and utterance-level aggregation in order to
determine the best architecture for this task;
(iii) we investigate the factors that contribute discriminative cues
and show that our model indeed solves the problem by finding temporal patterns
in mouth movements and not by exploiting spurious correlations.
We demonstrate this further by evaluating our models on challenging examples from bilingual speakers.