Mon-1-11-4 What does an End-to-End Dialect Identification Model Learn about Non-dialectal Information?

Shammur Absar Chowdhury(University of Trento), Ahmed Ali(Qatar Computing Research Institute), Suwon Shon(Massachusetts Institute of Technology) and James Glass(Massachusetts Institute of Technology)

Abstract: An end-to-end dialect identification system generates the likelihood of each dialect, given a speech utterance. The performance relies on its capabilities to discriminate the acoustic properties between the different dialects, even though the input signal contains non-dialectal information such as speaker and channel. In this work, we study how non-dialectal information are encoded inside the end-to-end dialect identification model. We design several proxy tasks to understand the model's ability to represent speech input for differentiating non-dialectal information -- such as (a) gender and voice identity of speakers, (b) languages, (c) channel (recording and transmission) quality -- and compare with dialectal information (i.e., predicting geographic region of the dialects). By analyzing non-dialectal representations from layers of an end-to-end Arabic dialect identification (ADI) model, we observe that the model retains gender and channel information throughout the network while learning a speaker-invariant representation. Our findings also suggest that the CNN layers of the end-to-end model mirror feature extractors capturing voice-specific information, while the fully-connected layers encode more dialectal information.

Paper

prev Mon-1-11-3 On the Usage of Multi-feature Integration for Speaker Verification and Language Identification

next Mon-1-11-5 Releasing a toolkit and comparing the performance of language embeddings across various spoken language identification datasets

About

About the Conference

Welcome from the Chair

Conference Committees

Calls