Fenglin Ding(University of Science and Technology of China), Wu Guo(university of science and technology of china), Bin Gu(University of Science and Technology of China), Zhenhua Ling(University of Science and Technology of China) and Jun Du(University of Science and Technologoy of China)
Abstract:
In this paper, we propose a new speaker normalization technique
for acoustic model adaptation in connectionist temporal
classification (CTC)-based automatic speech recognition. In the
proposed method, for the inputs of a hidden layer, the mean
and variance of each activation are first estimated at the speaker
level. Then, we normalize each speaker representation independently
by making them follow a standard normal distribution.
Furthermore, we propose using an auxiliary network to dynamically
generate the scaling and shifting parameters of speaker
normalization, and an attention mechanism is introduced to improve
performance. The experiments are conducted on the public
Chinese dataset AISHELL-1. Our proposed methods present
high effectiveness in adapting the CTC model, achieving up
to 17.5% character error rate improvement over the speaker independent
(SI) model.