Wed-3-12-8 Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks

Masood Mortazavi(Futurewei)
Abstract: Semantically-aligned (speech, image) datasets can be used to explore “visually-grounded speech”. In a majority of existing investigations, features of an image signal are extracted using neural networks “pre-trained” on other tasks (e.g., classification on ImageNet). In still others, pre-trained networks are used to extract audio features prior to semantic embedding. Without “transfer learning” through pre-trained initialization or pretrained feature extraction, previous results have tended to show low rates of recall in speech-to-image and image-to-speech queries. Choosing appropriate neural architectures for encoders in the speech and image branches and using large datasets, one can obtain competitive recall rates without any reliance on any pretrained initialization or feature extraction: (speech, image) semantic alignment and speech-to-image and image to-speech retrieval are canonical tasks worthy of independent investigation of their own and allow one to explore other questions—e.g., the size of the audio embedder can be reduced significantly with little loss of recall rates in speech-to-image and image-to-speech queries.
Student Information

Student Events

Travel Grants