Gakuto Kurata(IBM Research) and George Saon(IBM Research)
Abstract:
End-to-end training of recurrent neural network transducers (RNN-Ts)
does not require frame-level alignments between audio and output
symbols. Because of that, the posterior lattices defined by the
predictive distributions from different RNN-Ts trained on the same data can
differ a lot, which poses a new set of challenges in knowledge
distillation between such models.
These discrepancies are especially prominent in the posterior lattices between an offline model and a streaming model, which can be expected from the fact that the streaming RNN-T emits symbols later than the offline RNN-T.
We propose a method to train an RNN-T so that the posterior peaks at each node in the posterior lattice are aligned with the ones from a pretrained model for the same utterance.
By utilizing this method, we can train an offline RNN-T that
can serve as a good teacher to train a student streaming RNN-T.
Experimental results on the standard Switchboard conversational
telephone speech corpus demonstrate accuracy improvements for
a streaming unidirectional RNN-T by knowledge distillation from
an offline bidirectional counterpart.