Victoria Mingote(University of Zaragoza), Antonio Miguel(ViVoLAB, Aragon Institute for Engineering Research (I3A), University of Zaragoza, Spain), Alfonso Ortega(University of Zaragoza) and Eduardo Lleida Solano(University of Zaragoza)
Abstract:
In this paper, we present a new approach for the enrollment process in a deep neural network (DNN) system which learns the speaker model by an optimization process. Most Speaker Verification (SV) systems extract representations for both the enrollment and test utterances called embeddings, and then, these systems usually apply a similarity metric or complex back-ends to carry out the verification process. Unlike previous works, we propose to take advantage of the knowledge acquired by a DNN to model the speakers from the training set since the last layer of the DNN can be seen as an embedding dictionary which represents train speakers. Thus, after the initial training phase, we introduce a new learnable vector for each enrollment speaker. Furthermore, to lead this training process, we employ a loss function more appropriate for verification, the approximated Detection Cost Function (aDCF) loss function. The new strategy to produce enrollment models for each target speaker was tested on the RSR-Part II database for text-dependent speaker verification, where the proposed approach outperforms the reference system based on directly averaging of the embeddings extracted from the enroll data using the network and the application of cosine similarity.