Wed-2-12-7 An Effective Speaker Recognition Method Based on Joint Identification and Verification Supervisions

Ying Liu(National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei), Yan Song(National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei), Yiheng Jiang(National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei), Ian McLoughlin(National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei), Lin Liu(iFLYTEK Research, iFLYTEK CO., LTD., Hefei, Anhui 230088) and Lirong Dai(National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei)

Abstract: Deep embedding learning based speaker verification methods have attracted significant recent research interest due to their superior performance. Existing methods mainly focus on designing frame-level feature extraction structures, utterance-level aggregation methods and loss functions to learn discriminative speaker embeddings. The scores of verification trials are then computed using cosine distance or Probabilistic Linear Discriminative Analysis (PLDA) classifiers. This paper proposes an effective speaker recognition method which is based on joint identification and verification supervisions, inspired by multi-task learning frameworks. Specifically, a deep architecture with convolutional feature extractor, attentive pooling and two classifier branches is presented. The first, an identification branch, is trained with additive margin softmax loss (AM-Softmax) to classify the speaker identities. The second, a verification branch, trains a discriminator with binary cross entropy loss (BCE) to optimize a new triplet-based mutual information. To balance the two losses during different training stages, a ramp-up/ramp-down weighting scheme is employed. Furthermore, an attentive bilinear pooling method is proposed to improve the effectiveness of embeddings. Extensive experiments have been conducted on VoxCeleb1 to evaluate the proposed method, demonstrating results that relatively reduce the equal error rate (EER) by 22% compared to the baseline system using identification supervision only.

Paper

prev Wed-2-12-6 Unsupervised Training of Siamese Networks for Speaker Verification.

next Wed-2-12-8 Speaker-Aware Linear Discriminant Analysis in Speaker Verification

About

About the Conference

Welcome from the Chair

Conference Committees

Calls