Chengyi Wang(Nankai University), Yu Wu(Microsoft Research Asia), Liang Lu(Microsoft), Shujie Liu(Microsoft Research Asia, Beijing), Jinyu Li(Microsoft), Guoli Ye -(Microsoft) and Ming Zhou(microsoft research asia)
The attention-based Transformer model has achieved promising results for speech recognition (SR) in the offline mode. However, in the streaming mode, the Transformer model usually incurs significant latency to maintain its recognition accuracy when applying a fixed-length look-ahead window in each encoder layer. In this paper, we propose a novel low-latency streaming approach for Transformer models, which consists of a scout network and a recognition network. The scout network detects the whole word boundary without seeing any future frames, while the recognition network predicts the next subword by utilizing the information from all the frames before the predicted boundary. Our model achieves the best performance (2.7/6.4 WER) with only 639 ms latency on the test-clean and test-other data sets of Librispeech.