Thu-1-2-8 Emitting Word Timings with End-to-End Models

Tara Sainath(Google), Ruoming Pang(Google Inc.), David Rybach(Google), Basi Garcia(Google Inc.) and Trevor Strohman(Google, Inc.)
Abstract: Having end-to-end (E2E) models emit the start and end times of words \emph{on-device} is important for various applications at Google. This unsolved problem presents challenges with respect to model size, latency and accuracy. In this paper, we present an approach to word timings by constraining the attention head of the Listen, Attend, Spell (LAS) 2nd-pass rescorer \cite{SainathPang19}. On a Voice-Search task, we show that this approach does not degrade accuracy compared to when no attention head is constrained. In addition, it meets on-device size and latency constraints. In comparison, constraining the alignment with a 1st-pass Recurrent Neural Network Transducer (RNN-T) model to emit word timings results in quality degradation. Furthermore, a low-frame-rate conventional-based acoustic model \cite{SainathPang19}, which is trained with a constrained alignment and is used in many applications for word timings, is slower to detect start and end times compared to our proposed 2nd-pass LAS approach.
Student Information

Student Events

Travel Grants