Mon-1-10-4 LAIX Corpus of Chinese Learner English Towards A Benchmark for L2 English ASR

Huan Luan(LAIX), Jiahong Yuan(LAIX), Hui Lin(LAIX) and Yanhong Wang(LAIX)
Abstract: This paper introduces a corpus of Chinese Learner English containing 82 hours of L2 English speech by Chinese learners from all major dialect regions, collected through mobile apps developed by LAIX Inc. The LAIX corpus was created to serve as a benchmark dataset for evaluating Automatic Speech Recognition (ASR) performance on L2 English, the first of this kind as far as we know. The paper describes our effort to build the corpus, including corpus design, data selection and transcription. Multiple rounds of quality check were conducted in the transcription process. Transcription errors were analyzed in terms of error types, rounds of reviewing, and learners' proficiency levels. Word error rates of state-of-the-art ASR systems on the benchmark corpus were also reported.
Student Information

Student Events

Travel Grants