Patrick Lumban Tobing(Nagoya University), Tomoki Hayashi(Nagoya University), Yi-Chiao Wu(Nagoya University), Kazuhiro Kobayashi(Nagoya University) and Tomoki Toda(Nagoya University)
We present a novel approach of cyclic spectral modeling for unsupervised discovery of speech units into voice conversion with excitation network and waveform modeling. Specifically, we propose two spectral modeling techniques: 1) cyclic vector-quantized autoencoder (CycleVQVAE), and 2) cyclic variational autoencoder (CycleVAE). In CycleVQVAE, a discrete latent space is used for the speech units, whereas, in CycleVAE, a continuous latent space is used. The cyclic structure is developed using the reconstruction flow and the cyclic reconstruction flow of spectral features, where the latter is obtained by recycling the converted spectral features. This method is used to obtain a possible speaker-independent latent space because of marginalization on all possible speaker conversion pairs during training. On the other hand, speaker-dependent space is conditioned with a one-hot speaker-code. Excitation modeling is developed in a separate manner for CycleVQVAE, while it is in a joint manner for CycleVAE. To generate speech waveform, WaveNet-based waveform modeling is used. The proposed framework is entried for the ZeroSpeech Challenge 2020, and is capable of reaching a character error rate of 0.21, a speaker similarity score of 3.91, a mean opinion score of 3.84 for the naturalness of the converted speech in the 2019 voice conversion task.