Wed-2-6-10 MLS: A Large-Scale Multilingual Dataset for Speech Research

Vineel Pratap(Facebook), Qiantong Xu(Facebook AI Research), Anuroop Sriram(Facebook AI), Gabriel Synnaeve(Facebook AI Research) and Ronan Collobert(Facebook AI Research)
Abstract: This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 32K hours of English and a total of 4.5K hours for other languages. We provide baseline Automatic Speech Recognition (ASR) models and Language Models (LM) for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org.
Student Information

Student Events

Travel Grants