Mon-2-2-2 Self-Distillation for Improving CTC-Transformer-based ASR Systems

Takafumi Moriya(NTT Corporation), Tsubasa Ochiai(NTT Communication Science Laboratories), Shigeki Karita(NTT Communication Science Laboratories), Hiroshi Sato(NTT media intelligent laboratory), Tomohiro Tanaka(NTT Corporation), Takanori Ashihara(NTT Corporation), Ryo Masumura(NTT Corporation), Yusuke Shinohara(NTT Corporation) and Marc Delcroix(NTT Communication Science Laboratories)
Abstract: We present a novel training approach for encoder-decoder-based sequence-to-sequence (S2S) models. The important key factor of S2S is the attention mechanism as it captures the relationships between input and output sequences. The attention weights inform which time frames should be attended to for predicting the output labels. In previous work, we proposed distilling S2S knowledge into connectionist temporal classification (CTC) models by using the attention characteristics to create pseudo-targets for an auxiliary cross entropy loss term. This approach can significantly improve CTC models. However, it remained unclear whether our proposal could be used to improve S2S models. In this paper, we extend our previous work to create a strong S2S model, i.e. Transformer with CTC (CTC-Transformer). We utilize Transformer outputs and the source attention weights for making pseudo-targets that contain both the posterior and the timing information of each Transformer output. These pseudo-targets are used to train the shared encoder of the CTC-Transformer so as to consider the direct feedback from the Transformer-decoder and obtain more informative representations. Experiments on various tasks demonstrate that our proposal is also effective for enhancing S2S model training. In particular, our best system on Japanese ASR task outperforms the previous state-of-the-art alternative.
Student Information

Student Events

Travel Grants