Speech Synthesis: Text Processing, Data and Evaluation

Tue-1-7-2 A Mask-based Model for Mandarin Chinese Polyphone Disambiguation

Haiteng Zhang (Databaker (Beijing) Technology Co., Ltd), Huashan Pan (Databaker (Beijing) Technology Co., Ltd), Xiulin Li (Databaker (Beijing) Technology Co., Ltd)
Abstract: Polyphone disambiguation serves as an essential part of Mandarin text-to-speech (TTS) system. However, conventional system modelling the entire Pinyin set causes the case that prediction belongs to the unrelated polyphonic character instead of the current input one, which has negative impacts on TTS performance. To address this issue, we introduce a mask-based model for polyphone disambiguation. The model takes a mask vector extracted from the context as an extra input. In our model, the mask vector not only acts as a weighting factor in Weightedsoftmax to prevent the case of mis-prediction but also eliminates the contribution of non-candidate set to the overall loss. Moreover, to mitigate the uneven distribution of pronunciation, we introduce a new loss called Modified Focal Loss. The experimental result shows the effectiveness of the proposed maskbased model. We also empirically studied the impact of Weighted-softmax and Modified Focal Loss. It was found that Weighted-softmax can effectively prevent the model from predicting outside the candidate set. Besides, Modified Focal Loss can reduce the adverse impacts of the uneven distribution of pronunciatio.
Student Information

Student Events

Travel Grants