Yan Huang(Microsoft Corporation), Jinyu Li(Microsoft), Lei He(Microsoft), Wenning Wei(Microsoft), William Gale(Microsoft) and Yifan Gong(Microsoft Corp)
Rapid unsupervised speaker adaptation in an E2E system posits us new challenges due to its end-to-end unified structure in addition to its intrinsic difficulty of data sparsity and imperfect label . Previously we proposed utilizing the content relevant personalized speech synthesis for rapid speaker adaptation and achieved significant performance breakthrough in a hybrid system . In this paper, we answer the following two questions: First, how to effectively perform rapid speaker adaptation in an RNN-T. Second, whether our previously proposed approach is still beneficial for the RNN-T and what are the modification and distinct observations. We apply the proposed methodology to a speaker adaptation task in a state-of-art presentation transcription RNN-T system. In the 1 min setup, it yields 11.58% or 7.95% relative word error rate (WER) reduction for the sup/unsup adaptation, comparing to the negligible gain when adapting with 1 min source speech. In the 10 min setup, it yields 15.71% or 8.00% relative WER reduction, doubling the gain of the source speech adaptation. We further apply various data filtering techniques and significantly bridge the gap between sup/unsup adaptation.