Yizhou Lu(Shanghai Jiao Tong University), Mingkun Huang(Shanghai Jiao Tong University), Hao Li(Shanghai Jiao Tong University), Jiaqi Guo(Shanghai Jiao Tong University) and Yanmin Qian(Shanghai Jiao Tong University)
Code-switching speech recognition is a challenging task which has been studied in many previous work, and one main challenge for this task is the lack of code-switching data. In this paper, we study end-to-end models for Mandarin-English code-switching automatic speech recognition. External monolingual data are utilized to alleviate the data sparsity problem. More importantly, we propose a bi-encoder transformer network based Mixture of Experts (MoE) architecture to better leverage these data. We decouple Mandarin and English modeling with two separate encoders to better capture language-specific information, and a gating network is employed to explicitly handle the language identification task. For the gating network, different models and training modes are explored to learn the better MoE interpolation coefficients. Experimental results show that compared with the baseline transformer model, the proposed new MoE architecture can obtain up to 10.4% relative error reduction on the code-switching test set.