Wed-3-10-5 Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS

Alexander Sorin(IBM Research - Haifa), Slava Shechtman(IBM Research - Haifa) and Ron Hoory(IBM Research - Haifa)

Abstract: We propose a novel semi-supervised technique that enables expressive style control and cross-speaker transfer in neural text to speech (TTS), when available training data contains a limited amount of labeled expressive speech from a single speaker. The technique is based on unsupervised learning of a style-related latent space, generated by a previously proposed reference audio encoding technique, and transforming it by means of Principal Component Analysis to another low-dimensional space. The latter space represents style information in a purified form, disentangled from text and speaker-related information. Encodings for expressive styles that are present in the training data are easily constructed in this space. Furthermore, this technique provides control over the speech rate, pitch level, and articulation type that can be used for TTS voice transformation. We present the results of subjective crowd evaluations confirming that the synthesized speech convincingly conveys the desired expressive styles and preserves a high level of quality.

Paper

prev Wed-3-10-4 Nonparallel Emotional Speech Conversion Using VAE-GAN

next Wed-3-10-6 Converting Anyone’s Emotion: Towards Speaker-Independent Emotional Voice Conversion

About

About the Conference

Welcome from the Chair

Conference Committees

Calls