Wed-1-5-3 Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-end Speech Recognition

Gakuto Kurata(IBM Research) and George Saon(IBM Research)

Abstract: End-to-end training of recurrent neural network transducers (RNN-Ts) does not require frame-level alignments between audio and output symbols. Because of that, the posterior lattices defined by the predictive distributions from different RNN-Ts trained on the same data can differ a lot, which poses a new set of challenges in knowledge distillation between such models. These discrepancies are especially prominent in the posterior lattices between an offline model and a streaming model, which can be expected from the fact that the streaming RNN-T emits symbols later than the offline RNN-T. We propose a method to train an RNN-T so that the posterior peaks at each node in the posterior lattice are aligned with the ones from a pretrained model for the same utterance. By utilizing this method, we can train an offline RNN-T that can serve as a good teacher to train a student streaming RNN-T. Experimental results on the standard Switchboard conversational telephone speech corpus demonstrate accuracy improvements for a streaming unidirectional RNN-T by knowledge distillation from an offline bidirectional counterpart.

Paper

prev Wed-1-5-2 Low Latency End-to-End Streaming Speech Recognition with a Scout Network

next Wed-1-5-4 Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition

About

About the Conference

Welcome from the Chair

Conference Committees

Calls