Longfei Yang(Tokyo Institute of Technology), Kaiqi Fu(Beijing Language and Culture University), Jinsong Zhang(Beijing Language and Culture University) and Takahiro Shinozaki(Tokyo Institute of Technology)
Abstract:
Pronunciation erroneous tendency (PET) is designed to provide instructive feedback to guide second language learners and PET detection plays an important role in compute aided pronunciation training (CAPT) system. However, PET detection suffers data sparsity problem since second language data collection and annotation are time-consuming. In this paper, we propose a contrastive predictive coding (CPC) based unsupervised learning approach to extract relevant knowledge from a large scale of unlabeled raw speech from two native languages for PET detection. We establish a unified framework in which language adversarial training is incorporated to guide CPC to align the feature distribution between two languages. In addition, sinc convolutional layer is introduced to extract formant-like feature which is considered relevant to some kinds of erroneous pronunciations. Through the experimental on Japanese part of BLCU inter-Chinese speech corpus, results show that our proposed language adversarial contrastive predictive coding with sinc conv is effective to improve the performance of pronunciation erroneous tendency detection for second language learners.