Genshun Wan(University of Science and Technology of China), Jia Pan(University of Science and Technology of China), Qingran Wang(iFlytek Research, iFlytek Co., Ltd.), Jianqing Gao(iFlytek Research, iFlytek Co., Ltd.) and Zhongfu Ye(University of Science and Technology of China)
In our previous work, we introduced a speaker adaptive training method based on frame-level attention mechanism for speech recognition, which has been proved an effective way to do speaker adaptive training. In this paper, we present an improved method by introducing the attention-over-attention mechanism. This attention module is used to further measure the contribution of each frame to the speaker embeddings in an utterance, and then generate an utterance-level speaker embedding to perform speaker adaptive training. Compared with the frame-level ones, the generated utterance-level speaker embeddings are more representative and stable. Experiments on both the Switchboard and AISHELL-2 tasks show that our method can achieve a relative word error rate reduction of approximately 8.0% compared with the speaker independent model, and over 6.0% compared with the traditional utterance-level d-vector-based speaker adaptive training method.