Binghuai Lin(Tencent Technology Co., Ltd), Liyuan Wang(Tencent Technology Co., Ltd), Xiaoli FENG(Center for Studies of Chinese as a Second Language Beijing Language and Culture University) and Jinsong Zhang(Beijing Language and Culture University)
Prosodic event detection plays an important role in spoken language processing tasks and Computer-Assisted Pronunciation Training (CAPT) systems . Traditional methods for the detection of sentence stress and phrase boundary rely on machine learning methods that model limited contextual information and account little for interaction between these two prosodic events. In this paper, we propose a hierarchical network model- ing the contextual factors at the granularity of phoneme, syllable and word based on Bidirectional Long Short-Term Memory (BLSTM). Moreover, to account for the inherent connection be- tween sentence stress and phrase boundary, we perform a joint modeling of these two important prosodic events with a multi- task learning framework (MTL) which shares common prosodic features. We evaluate the network performance based on Aix- Machine Readable Spoken English Corpus (Aix-MARSEC). Experimental results show our proposed method obtains the F1-measure of 90% for sentence stress detection and 91% for phrase boundary detection, which outperforms the baseline utilizing Conditional Random Field (CRF) by about 4% and 9% respectively.