Mon-3-11-9 Crossmodal Sound Retrieval based on Specific Target Co-occurrence Denoted with Weak Labels

Masahiro Yasuda(NTT Corporation), Yasunori Ohishi(NTT corporation), Yuma Koizumi(NTT Media Intelligence Laboratories) and Noboru Harada(NTT Corporation)

Abstract: Recent advancements in representation learning enable crossmodal retrieval by modeling an audio-visual co-occurrence in a single aspect, such as physical and linguistic. Unfortunately, in real-world media data, since co-occurrences in various aspects are complexly mixed, it is difficult to distinguish a specific target co-occurrence from many other non-target co-occurrences, resulting in failure in crossmodal retrieval. To overcome this problem, we propose a triplet-loss-based representation learning method that incorporates an awareness mechanism. We adopt weakly-supervised event detection, which provides a constraint in representation learning so that our method can ``be aware'' of a specific target audio-visual co-occurrence and discriminate it from other non-target co-occurrences. We evaluated the performance of our method by applying it to a sound effect retrieval task using recorded TV broadcast data. In the task, a sound effect appropriate for a given video input should be retrieved. We then conducted objective and subjective evaluations, the results indicating that the proposed method produces significantly better associations of sound and visual effects than baselines with no awareness mechanism.

Paper

prev Mon-3-11-8 Listen to What You Want: Neural Network-based Universal Sound Selector

next Mon-3-11-10 Speaker-Aware Monaural Speech Separation

About

About the Conference

Welcome from the Chair

Conference Committees

Calls