Venkat Krishnamohan(Indian Institute of Science), Akshara Soman(Indian Institute of Science, Bangalore), Anshul Gupta(Mercedes Benz Research and Development) and Sriram Ganapathy(Indian Institute of Science, Bangalore, India, 560012)
Abstract:
Audiovisual correspondence learning is the task of acquiring the association between images and its corresponding audio. In this paper, we propose a novel experimental paradigm in which unfamiliar pseudo images and pseudowords in audio form are introduced to both humans and machine systems. The task is to learn the association between the pairs of image and audio which is later evaluated with a retrieval task. The machine system used in the study is pretrained with the ImageNet corpus along with the corresponding audio labels. This model is transfer learned for the new image-audio pairs. Using the proposed paradigm, we perform a direct comparison of one-shot, two-shot and three-shot learning performance for humans and machine systems. The human behavioral experiment confirms that the majority of the correspondence learning happens in the first exposure of the audio-visual pair. This paper proposes a machine model which performs on par with the humans in audio-visual correspondence learning. But compared to the machine model, humans exhibited better generalization ability for new input samples with a single exposure.