Syed Shahnawazuddin(National Institute of Technology Patna), Nagaraj Adiga(University of Crete), Kunal Kumar(National Institute of Technology Patna), Aayushi Poddar(National Institute of Technology Patna) and Waquar Ahmad(National Institute of Technology Calicut)
Automatic recognition of children's speech is a challenging research problem due to several reasons. One among those is unavailability of large amounts of speech data from child speakers to develop automatic speech recognition (ASR) systems employing deep learning architectures.Using a limited amount of training data limits the power of the learned system. To overcome this issue, we have explored means to effectively make use of adults' speech data for training an ASR system. For that purpose, generative adversarial network (GAN) based voice conversion (VC) is exploited to modify the acoustic attributes of adults' speech making it perceptually similar to that of children's speech. The original and converted speech samples from adult speakers are then pooled together to learn the statistical model parameters. Significantly improved recognition rate for children's speech is noted due to VC-based data augmentation. To further enhance the recognition rate, a limited amount of children's speech data is also pooled into training. Large reduction in error rate is observed in this case as well. It is worth mentioning that GAN-based VC does not change the speaking-rate. To demonstrate the need to deal with speaking-rate differences we report the results of time-scale modification of children’s speech test data.