Mon-2-10-7 ARET: Aggregated Residual Extended Time-delay Neural Networks for Speaker Verification

Ruiteng Zhang(Tianjin University), Jianguo Wei(Tianjin University), Wenhuan Lu(Tianjin University), Longbiao Wang(Tianjin University), Meng Liu(Tianjin University), Lin Zhang(Tianjin University), Jiayu Jin(Tianjin University) and Junhai Xu(Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University)

Abstract: The time-delay neural network (TDNN) is widely used in speaker verification to extract long-term temporal features of speakers. Although common TDNN approaches well capture time-sequential information, they lack the delicate transformations needed for deep representation. To solve this problem, we propose two TDNN architectures. RET integrates short-cut connections into conventional time-delay blocks, and ARET adopts a split-transform-merge strategy to extract more discriminative representation. Experiments on VoxCeleb datasets without augmentation indicate that ARET realizes satisfactory performance on the VoxCeleb1 test set, VoxCeleb1-E, and VoxCeleb1-H, with 1.389%, 1.520%, and 2.614% equal error rate (EER), respectively. Compared to state-of-the-art results on these test sets, RET achieves a 23% ∼ 43% relative reduction in EER, and ARET reaches 32% ∼ 45%.

Paper

prev Mon-2-10-6 self-attention encoding and pooling for speaker recognition

next Mon-2-10-8 Adversarial Separation Network for Speaker Recognition

About

About the Conference

Welcome from the Chair

Conference Committees

Calls