Yasunori Ohishi(NTT Corporation), Akisato Kimura(NTT Corporation), Takahito Kawanishi(NTT Corporation), Kunio Kashino(NTT Corporation), David Harwath(Massachusetts Institute of Technology) and James Glass(Massachusetts Institute of Technology)
We propose a data expansion method for learning a multilingual semantic embedding model using disjoint datasets containing images and their multilingual audio captions.
Here, disjoint means that there are no shared images among the multiple language datasets, in contrast
to existing works on multilingual semantic embedding based on visually-grounded speech audio, where it has been assumed that each image is associated with spoken captions of multiple languages.
Although learning on disjoint datasets is more challenging, we consider it crucial in practical situations.
Our main idea is to refer to another paired data when evaluating a loss value regarding an anchor image.
We call this scheme ``pair expansion''. The motivation behind this idea is to utilize even disjoint pairs by finding similarities, or commonalities, that may exist in different images.
Specifically, we examine two approaches for calculating similarities: one using image embedding vectors and the other using object recognition results.
Our experiments show that expanded pairs improve crossmodal and cross-lingual retrieval accuracy compared with non-expanded cases.
They also show that similarities measured by the image embedding vectors yield better accuracy than those based on object recognition results.