Wed-2-6-3 Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract

Tamás Gábor Csapó(Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)

Abstract: Articulatory-to-acoustic (forward) mapping is a technique to predict speech using various articulatory acquisition techniques (e.g. ultrasound tongue imaging, lip video). Real-time MRI (rtMRI) of the vocal tract has not been used before for this purpose. The advantage of MRI is that it has a high 'relative' spatial resolution: it can capture not only lingual, labial and jaw motion, but also the velum and the pharyngeal region, which is typically not possible with other techniques. In the current paper, we train various DNNs (fully connected, convolutional and recurrent neural networks) for articulatory-to-speech conversion, using rtMRI as input, in a speaker-specific way. We use two male and two female speakers of the USC-TIMIT articulatory database, each of them uttering 460 sentences. We evaluate the results with objective (Normalized MSE and MCD) and subjective measures (perceptual test) and show that CNN-LSTM networks are preferred which take multiple images as input, and achieve MCD scores between 2.8-4.5 dB. In the experiments, we find that the predictions of speaker 'm1' are significantly weaker than other speakers. We show that this is caused by the fact that 74 % of the recordings of speaker 'm1' are out of sync.

Paper

prev Wed-2-6-2 Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text

next Wed-2-6-4 Ultrasound-based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis

About

About the Conference

Welcome from the Chair

Conference Committees

Calls