Mon-2-7-6 Spoken Content and Voice Factorization for Few-shot Speaker Adaptation

Tao Wang(National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences), Jianhua Tao(National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences), Ruibo Fu(National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences), Jiangyan Yi(National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences), Zhengqi Wen(National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) and Rongxiu Zhong(National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences)
Abstract: The low similarity and naturalness of synthesized speech remain a challenging problem for speaker adaptation with few resources. Since the acoustic model is too complex to interpret, overfitting will occur when training with few data. To prevent the model from overfitting, this paper proposes a novel speaker adaptation framework that decomposes the parameter space of the end-to-end acoustic model into two parts, with the one on predicting spoken content and the other on modeling speaker's voice. The spoken content is represented by phone posteriorgram(PPG) which is speaker independent. By adapting the two sub-modules separately, the overfitting can be alleviated effectively. Moreover, we propose two different adaptation strategies based on whether the data has text annotation. In this way, speaker adaptation can also be performed without text annotations. Experimental results confirm the adaptability of our proposed method of factorizating spoken content and voice. And listening tests demonstrate that our proposed method can achieve better performance with just 10 sentences than speaker adaptation conducted on Tacotron in terms of naturalness and speaker similarity.
Student Information

Student Events

Travel Grants