Wed-1-5-5 Improved hybrid streaming ASR with Transformer language models

Pau Baquero-Arnal(Machine Learning and Language Processing (MLLP) research group, Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València (Spain)), Javier Jorge(Machine Learning and Language Processing (MLLP) research group, Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València (Spain)), Adrià Giménez(Machine Learning and Language Processing (MLLP) research group, Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València (Spain)), Joan Albert Silvestre-Cerdà(Machine Learning and Language Processing (MLLP) research group, Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València (Spain)), Javier Iranzo-Sánchez(Machine Learning and Language Processing (MLLP) research group, Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València (Spain)), Albert Sanchis(Machine Learning and Language Processing (MLLP) research group, Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València (Spain)), Jorge Civera(Machine Learning and Language Processing (MLLP) research group, Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València (Spain)) and Alfons Juan(Machine Learning and Language Processing (MLLP) research group, Valencian Research Institute for Artificial Intelligence (VRAIN),
Abstract: Streaming ASR is gaining momentum due to its wide applicability, though it is still unclear how best to come close to the accuracy of state-of-the-art off-line ASR systems when the output must come within a short delay after the incoming audio stream. Following our previous work on streaming one-pass decoding with hybrid ASR systems and LSTM language models, in this work we report further improvements by replacing LSTMs with Transformer models. First, two key ideas are discussed so as to run these models fast during inference. Then, empirical results on LibriSpeech and TED-LIUM are provided showing that Transformer language models lead to improved recognition rates on both tasks. ASR systems obtained in this work can be seamlessly transfered to a streaming setup with minimal quality losses. Indeed, to the best of our knowledge,no better results have been reported on these tasks when assessed under a streaming setup
Student Information

Student Events

Travel Grants