Mon-1-5-9 Can Auditory Nerve models tell us what’s different about WaveNet vocoded speech?

Sébastien Le Maguer(Adapt Centre / Trinity College Dublin) and Naomi Harte(Trinity College Dublin)
Abstract: Nowadays, synthetic speech is almost indistinguishable from human speech. The remarkable quality is mainly due to the displacing of signal processing based vocoders in favour of neural vocoders and, in particular, the WaveNet architecture. At the same time, speech synthesis evaluation is still facing difficulties in adjusting to these improvements. These difficulties are even more prevalent in the case of objective evaluation methodologies which do not correlate well with human perception. Yet, an often forgotten use of objective evaluation is to uncover prominent differences between speech signals. Such differences are crucial to decipher the improvement introduced by the use of WaveNet. Therefore, abandoning objective evaluation could be a serious mistake. In this paper, we analyze vocoded synthetic speech re-rendered using WaveNet, comparing it to standard vocoded speech. To do so, we objectively compare spectrograms and neurograms, the latter being the output of Auditory Nerve (AN) models. The spectrograms allow us to look at the speech production side, and the neurograms relate to the speech perception path. While we were not yet able to pinpoint how WaveNet and WORLD differ, our results suggest that the Mean Rate (MR) neurograms in particular warrant further investigation.
Student Information

Student Events

Travel Grants