Xinhui Hu(Hithink Flush Information Network Co Ltd), Qi Zhang(Hithink RoyaFlush AI Research Institute), Lei Yang(Hithink RoyaFlush AI Research Institute), Binbin Gu(Hithink RoyaFlush AI Research Institute) and Xinkang Xu(Hithink RoyaFlush AI Research Institute)
Abstract:
To deal with the problem of data scarce in training language
model (LM) for code-switching (CS) speech recognition, we
proposed an approach to obtain augmentation texts from three
different viewpoints. The first one is to enhance monolingual
LM by selecting corresponding sentences for existing conversational corpora; The second one is based on replacements using
syntactic constraint for a monolingual Chinese corpus, with the
helps of an aligned word list obtained from a pseudo-parallel
corpus, and part-of-speech (POS) of words; The third one is to
use text generation based on a pointer-generator network with
copy mechanism, using a real CS text data for training. All sentences from these approaches show improvement for CS LMs,
and they are finally fused into an LM for CS ASR tasks.
Evaluations on LMs built by the above augmented data
were conducted on two Mandarin-English CS speech sets DTANG, and SEAME. The perplexities were greatly reduced
with all kinds of augmented texts, and speech recognition performances were steadily improved. The mixed word error rate
(MER) of DTANG and SEAME evaluation dataset got relative
reduction by 9.10% and 29.73%, respectively.