Yuheng Wei(School of Computer Science and Technology, Xidian University), Junzhao Du(School of Computer Science and Technology, Xidian University) and Hui Liu(School of Computer Science and Technology, Xidian University)
Speaker recognition for unseen speakers out of the training dataset relies on the discrimination of speaker embedding. Recent studies use the angular softmax losses with angular margin penalties to enhance the intra-class compactness of speaker embedding, which achieve obvious performance improvement. However, the classification layer encounters the problem of dimension explosion in these losses with the growth of training speakers. In this paper, like the prototype network loss in the few-short learning and the generalized end-to-end loss, we optimize the cosine distances between speaker embeddings and their corresponding centroids rather than the weight vectors in the classification layer. For the intra-class compactness, we impose the additive angular margin to shorten the cosine distance between speaker embeddings belonging to the same speaker. Meanwhile, we also explicitly improve the inter-class separability by enlarging the cosine distance between different speaker centroids. Experiments show that our loss achieves comparable performance with the stat-of-the-art angular margin softmax loss in both verification and identification tasks and markedly reduces the training iterations.