Mandar Gogate(Edinburgh Napier University), Kia Dashtipour(Edinburgh Napier University) and Amir Hussain(Edinburgh Napier University)
In this paper, we present VISION, a first of its kind audio-visual (AV) corpus consisting of 2500 utterances produced by 209 speakers and recorded in real noisy environments including social gatherings, streets, and cafeteria. While a number of speech enhancement frameworks have been proposed in the literature that exploits AV cues, there exist no visual speech corpus recorded in real environments with sufficient variety of speakers to evaluate the generalisation capability of AV frameworks in wide range of background visual and acoustic noises. The main purpose of the dataset is to foster research in the area of AV signal processing and to provide a corpus that can be used for reliable evaluation of AV speech enhancement systems in everyday settings. In addition, we present a baseline deep neural network (DNN) based spectral mask estimation model for speech enhancement. The comparative simulation results in terms of subjective listening test demonstrate significant performance improvement of the baseline DNN as compared to state-of-the-art speech enhancement approaches.