qiang fang(Phonetics Lab., Institute of Linguistics, Chinese Academy of Social Sciences)
For decades, average Root Mean Square Error (RMSE) over all the articulatory channels is one of the most prevalent cost functions for training statistical models for the task of acoustic-to-articulatory inversion (AAI). One of the underlying assumptions is that the samples of all the articulatory channels used for training are balanced and play the same role in AAI. However, this is not true from speech production point view. In this study, at each time instant, each articulatory channel is classified to be critical or noncritical according to their roles in the formation of constrictions along the vocal tract when producing speech sound. It is found that training set is dominated by the samples of noncritical articulatory channels. To deal with the unbalanced dataset problem, several Bi-LSTM networks are trained by removing the of noncritical portions of each articulatory articulatory channels if the training errors are less than some dynamic threshold. The results indicate that the average RMSE over all the articulatory channels, the average RMSE over the critical articulators, and the average RMSE over the noncritical articulators can be reduced significantly by the proposed method.