Xingchen Song(Tsinghua University), Guangsen Wang(Salesforce Research Asia), Yiheng Huang(Tencent AI Lab), Zhiyong Wu(Tsinghua University), Dan Su(Tencent AILab Shenzhen) and Helen Meng(The Chinese University of Hong Kong)
Self-attention network (SAN) can benefit significantly from the bi-directional representation learning through unsupervised pretraining paradigms such as BERT and XLNet. In this paper, we present an XLNet-like pretraining scheme ''Speech-XLNet'' to learn speech representations with self-attention networks (SANs). Firstly, we find that by shuffling the speech frame orders, Speech-XLNet serves as a strong regularizer which encourages the SAN network to make inferences by focusing on global structures through its attention weights. Secondly, Speech-XLNet also allows the model to explore bi-directional context information while maintaining the autoregressive training manner. Visualization results show that our approach can generalize better with more flattened and widely distributed optimas compared to the conventional approach. Experimental results on TIMIT demonstrate that Speech-XLNet greatly improves hybrid SAN/HMM in terms of both convergence speed and recognition accuracy. Our best systems achieve a relative improvement of 15.2% on the TIMIT task. Besides, we also apply our pretrained model to an End-to-End SAN with WSJ dataset and WER is reduced by up to 68% when only a few hours of transcribed data is used.