Mon-3-7-3 Rapid RNN-T Adaptation Using Personalized Speech Synthesis and Neural Language Generator

Yan Huang(Microsoft Corporation), Jinyu Li(Microsoft), Lei He(Microsoft), Wenning Wei(Microsoft), William Gale(Microsoft) and Yifan Gong(Microsoft Corp)

Abstract: Rapid unsupervised speaker adaptation in an E2E system posits us new challenges due to its end-to-end unified structure in addition to its intrinsic difficulty of data sparsity and imperfect label [1]. Previously we proposed utilizing the content relevant personalized speech synthesis for rapid speaker adaptation and achieved significant performance breakthrough in a hybrid system [2]. In this paper, we answer the following two questions: First, how to effectively perform rapid speaker adaptation in an RNN-T. Second, whether our previously proposed approach is still beneficial for the RNN-T and what are the modification and distinct observations. We apply the proposed methodology to a speaker adaptation task in a state-of-art presentation transcription RNN-T system. In the 1 min setup, it yields 11.58% or 7.95% relative word error rate (WER) reduction for the sup/unsup adaptation, comparing to the negligible gain when adapting with 1 min source speech. In the 10 min setup, it yields 15.71% or 8.00% relative WER reduction, doubling the gain of the source speech adaptation. We further apply various data filtering techniques and significantly bridge the gap between sup/unsup adaptation.

Paper

prev Mon-3-7-2 Speaker Adaptive Training for Speech Recognition Based on Attention-over-Attention Mechanism

next Mon-3-7-4 Speech Transformer with Speaker Aware Persistent Memory

About

About the Conference

Welcome from the Chair

Conference Committees

Calls