Masahiro Yasuda(NTT Corporation), Yasunori Ohishi(NTT corporation), Yuma Koizumi(NTT Media Intelligence Laboratories) and Noboru Harada(NTT Corporation)
Recent advancements in representation learning enable crossmodal retrieval by modeling an audio-visual co-occurrence in a single aspect, such as physical and linguistic. Unfortunately, in real-world media data, since co-occurrences in various aspects are complexly mixed, it is difficult to distinguish a specific target co-occurrence from many other non-target co-occurrences, resulting in failure in crossmodal retrieval. To overcome this problem, we propose a triplet-loss-based representation learning method that incorporates an awareness mechanism. We adopt weakly-supervised event detection, which provides a constraint in representation learning so that our method can ``be aware'' of a specific target audio-visual co-occurrence and discriminate it from other non-target co-occurrences.
We evaluated the performance of our method by applying it to a sound effect retrieval task using recorded TV broadcast data. In the task, a sound effect appropriate for a given video input should be retrieved. We then conducted objective and subjective evaluations, the results indicating that the proposed method produces significantly better associations of sound and visual effects than baselines with no awareness mechanism.