Renuka Mannem(Indian Institute of Science), Navaneetha Gaddam(Rajiv Gandhi University of Knowledge Technologies, Kadapa) and Prasanta Ghosh(Assistant Professor, EE, IISc)
The real-time Magnetic Resonance Imaging (rtMRI) is often used for speech production research as it captures the complete view of the vocal tract during speech. Air-tissue boundaries (ATBs) are the contours that trace the transition between high-intensity tissue region and low-intensity airway cavity region in the rtMRI video. The ATBs are used in several speech related applications. However, the ATB segmentation is a challenging task as the rtMRI frames have low resolution and low signal-to-noise ratio. Several works have been proposed in the past for ATB segmentation. Among these, the supervised algorithms have been shown to perform well compared to the unsupervised algorithms. However, the supervised algorithms have limited generalizability towards unseen subjects which are not involved in training. In this work, we propose a 3-dimensional convolutional neural network (3D-CNN) which utilizes both spatial and temporal information from the rtMRI video for accurate ATB segmentation. The 3D-CNN model captures the vocal tract dynamics in an rtMRI video independent of the morphology of the subject leading to accurate ATB segmentation for unseen subjects. In a leave-one-subject-out experimental setup, it is observed that the proposed approach provides ~32% relative percentage improvement in the performance compared to the baseline SegNet based approach.