Wed-2-8-9 Utterance invariant training for hybrid two-pass end-to-end speech recognition

Dhananjaya Gowda(Samsung Research), Abhinav Garg(SR), Ankur Kumar(SRIB), Kwangyoun Kim(Samsung Electronics), Jiyeon Kim(Samsung), Sachin Singh(SRIB), Mehul Kumar(Samsung), Shatrughan Singh(SRIB) and Chanwoo Kim(Samsung Research)

Abstract: In this paper, we propose an utterance invariant training (UIT) specifically designed to improve the performance of a two-pass end-to-end hybrid ASR. Our proposed hybrid ASR solution uses a shared encoder with a monotonic chunkwise attention (MoChA) decoder for streaming capabilities, while using a low-latency bidirectional full-attention (BFA) decoder for enhancing the overall ASR accuracy. A modified sequence summary network (SSN) based utterance invariant training is used to suit the two-pass model architecture. The input feature stream self-conditioned by scaling and shifting with its own sequence summary is used as a concatenative conditioning on the bidirectional encoder layers sitting on top of the shared encoder. In effect, the proposed utterance invariant training combines three different types of conditioning namely, concatenative, multiplicative and additive. Experimental results show that the proposed approach shows reduction in word error rates up to 7% relative on Librispeech, and 10-15% on a large scale Korean end-to-end two-pass hybrid ASR model.

Paper

prev Wed-2-8-8 Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition

next Wed-2-8-10 SCADA: Stochastic, Consistent and Adversarial Data Augmentation to Improve ASR

About

About the Conference

Welcome from the Chair

Conference Committees

Calls