Ruirui Li(Amazon), Jyun-Yu Jiang(University of California, Los Angeles), Xian Wu(University of Notre Dame), Chu-Cheng Hsieh(Amazon) and Andreas Stolcke(Amazon)
Speaker identification based on voice input is a fundamental capability in speech processing enabling versatile downstream applications, such as personalization and authentication. With the advent of deep learning, most of the state-of-the-art methods apply machine learning techniques and derive acoustic embeddings from utterances with convolutional neural networks (CNNs) and recurrent neural networks (RNNs). This paper addresses two inherent limitations of current approaches. First, voice characteristics over long time spans might not be fully captured by CNNs and RNNs, as they are designed to focus on local feature extraction and adjacent dependencies modeling, respectively. Second, complicated deep learning models can be fragile with regard to subtle but intentional changes in model inputs, also known as adversarial perturbations. To distill informative global acoustic embedding representations from utterances and be robust to adversarial perturbations, we propose a Self-Attentive Adversarial Speaker Identification method (SAASI). In experiments on the VCTK dataset, SAASI significantly outperforms four state-of-the-art baselines in recognizing both known and new speakers.