Seyyed Saeed Sarfjoo(Idiap Research Institute), Srikanth Madikeri(Idiap Research Institute), Petr Motlicek(Idiap Research Institute) and Sebastien Marcel
To adapt the speaker verification (SV) system to a target domain with limited data, this paper investigates the transfer learning of the model pre-trained on the source domain data. To that end, layer-by-layer adaptation with transfer learning from the initial and final layers of the pre-trained model is investigated. We show that the model adapted from the initial layers outperforms the model adapted from the final layers. Based on this evidence, and inspired by the works in image recognition field, we hypothesize that low-level convolutional neural network (CNN) layers characterize domain-specific component while high-level CNN layers are domain-independent and have more discriminative power. For adapting these domain-specific components, angular margin softmax (AMSoftmax) applied on the CNN-based implementation of the x-vector architecture. In addition, to reduce the problem of over-fitting on the limited target data, transfer learning on the batch norm layers is investigated. Mean shift and covariance estimation of batch norm
allows to map the represented components of the target domain to the source domain. Using TDNN and E-TDNN versions of the x-vectors as baseline models, the adapted models on the development set of NIST SRE 2018 outperformed the baselines with relative improvements of 11.0 and 13.8 %, respectively.