Wed-1-3-3 Complex-Valued Variational Autoencoder: A Novel Deep Generative Model for Direct Representation of Complex Spectra

Toru Nakashika(The University of Electro-Communications)

Abstract: In recent years, variational autoencoders (VAEs) have been attracting interest for many applications and generative tasks. Although the VAE is one of the most powerful deep generative models, it still has difficulty representing complex-valued data such as the complex spectra of speech. In speech synthesis, we usually use the VAE to encode Mel-cepstra, or raw amplitude spectra, from a speech signal into normally distributed latent features and then synthesize the speech from the reconstruction by using the Griffin-Lim algorithm or other vocoders. Such inputs are originally calculated from complex spectra but lack the phase information, which leads to degradation when recovering speech. In this work, we propose a novel generative model to directly encode the complex spectra by extending the conventional VAE. The proposed model, which we call the complex-valued VAE (CVAE), consists of two complex-valued neural networks (CVNNs) of an encoder and a decoder. In the CVAE, not only the inputs and the parameters of the encoder and decoder but also the latent features are defined as complex-valued to preserve the phase information throughout the network. The results of our speech encoding experiments demonstrated the effectiveness of CVAE compared to the conventional VAE in both objective and subjective criteria.

Paper

prev Wed-1-3-2 Unconditional Audio Generation with Generative Adversarial Networks and Cycle Regularization

next Wed-1-3-4 Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding

About

About the Conference

Welcome from the Chair

Conference Committees

Calls