Fenglin Ding(University of Science and Technology of China), Wu Guo(university of science and technology of china), Bin Gu(University of Science and Technology of China), Zhenhua Ling(University of Science and Technology of China) and Jun Du(University of Science and Technology of China)
Abstract:
In this paper, we propose two novel regularization-based
speaker adaptive training approaches for connectionist temporal
classification (CTC) based speech recognition. The first method
is center loss (CL) regularization, which is used to penalize the
distances between the embeddings of different speakers and the
only center. The second method is speaker variance loss (SVL)
regularization in which we directly minimize the speaker interclass
variance during model training. Both methods achieve the
purpose of training an adaptive model on the fly by adding regularization
terms to the training loss function. Our experiment on
the AISHELL-1 Mandarin recognition task shows that both methods
are effective at adapting the CTC model without requiring any
specific fine-tuning or additional complexity, achieving character
error rate improvements of up to 8.1% and 8.6% over the
speaker independent (SI) model, respectively.