Wed-3-10-8 Learning Syllable-Level Discrete Prosodic Representation for Expressive Speech Generation

Guangyan Zhang(Dept. of Electronic Engineering, The Chinese University of Hong Kong), Ying Qin(The Chinese University of Hong Kong) and Tan Lee(The Chinese University of Hong Kong)
Abstract: This paper presents an extension of the Tacotron 2 end-to-end speech synthesis architecture, which aims to learn syllable-level discrete prosodic representations from speech data. The learned representations can be used for transferring or controlling prosody in expressive speech generation. The proposed design starts with a syllable-level text encoder that encodes input text at syllable level instead of phoneme level. The continuous prosodic representation for each syllable is then extracted. A Vector-Quantised Variational Auto-Encoder (VQ-VAE) is used to discretize the learned continuous prosodic representations. The discrete representations are finally concatenated with text encoder output to achieve prosody transfer or control. Subjective evaluation is carried out on the syllable-level TTS system, and the effectiveness of prosody transfer. The results show that the proposed Syllable-level neural TTS system produce more natural speech than conventional phoneme-level TTS system. It is also shown that prosody transfer could be achieved and the latent prosody codes are explainable with relation to specific prosody variation.
Student Information

Student Events

Travel Grants