Wed-2-8-8 Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition

Ryo Masumura(NTT Corporation), Naoki Makishima(NTT Corporation), Mana Ihori(NTT Corporation), Akihiko Takashima(NTT Corporation), Tomohiro Tanaka(NTT Corporation) and Shota Orihashi(NTT Corporation)

Abstract: This paper describes a simple and efficient pre-training method using a large number of external texts to enhance end-to-end automatic speech recognition (ASR). Generally, it is essential to prepare speech-to-text paired data to construct end-to-end ASR models, but it is difficult to collect a large amount of such data in practice. One issue caused by data scarcity is that the performance of ASR on out-of-domain tasks different from those using the speech-to-text paired data is poor, since the mapping from the speech information to textual information is not well learned. To address this problem, we leverage a large number of phoneme-to-grapheme (P2G) paired data, which can be easily created from external texts and a rich pronunciation dictionary. The P2G conversion and end-to-end ASR are regarded as similar transformation tasks where the input phonetic information is converted into textual information. Our method utilizes the P2G conversion task for pre-training of a decoder network in Transformer encoder-decoder based end-to-end ASR. Experiments using 4 billion tokens of Web text demonstrates that the performance of ASR on out-of-domain tasks can be significantly improved by our pre-training.

Paper

prev Wed-2-8-7 Improved Noisy Student Training for Automatic Speech Recognition

next Wed-2-8-9 Utterance invariant training for hybrid two-pass end-to-end speech recognition

About

About the Conference

Welcome from the Chair

Conference Committees

Calls