Yingming Gao(Institute of Acoustics and Speech Communication, Technische Universität Dresden), Xinyu Zhang(Institute of Acoustics and Speech Communication, TU Dresden), Yi Xu(University College London), Jinsong Zhang(Beijing Language and Culture University) and Peter Birkholz(Institute of Acoustics and Speech Communication, TU Dresden)
Abstract:
The complex f0 variations in continuous speech make it rather difficult to perform automatic recognition of tones in a language like Chinese. In this study, we tested the use of target approximation model (TAM) for continuous tone recognition on two datasets. TAM simulates f0 production from the articulatory point of view and so allow to discover the underlying pitch targets from the surface f0 contour. The f0 contour of each tone represented by 30 equidistant points in the first dataset was simulated by the TAM model. Using a support vector machine (SVM) to classify tones showed that, compared to the representation by 30 f0 values, the estimated three-dimensional TAM parameters had a comparable performance in characterizing tone patterns. TAM was further tested on the second dataset containing more complex tonal variations. With equal or a fewer number of features, the TAM parameters provided better performance than the coefficients of the cosine transform and a slightly worse performance than the statistical f0 parameters for tone recognition. Furthermore, we investigated BLSTM for modelling the sequential tonal variations, which proved to be more powerful than the SVM classifier. The BLSTM system incorporating TAM and statistical f0 parameters achieved the best accuracy of 87.56%.