Kenichi Kumatani(Microsoft), Dimitrios Dimitriadis(Microsoft), Robert Gmyr(Microsoft), Yashesh Gaur(Microsoft.com), Sefik Emre Eskimez(Microsoft), Jinyu Li(Microsoft) and Michael Zeng(Microsoft)
Abstract:
In this work, we develop new unsupervised learning techniques with an attention-based sequence-to-sequence (seq2seq)
model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must
be used as a label. However, the imperfect ASR result makes
unsupervised learning difficult to consistently improve recognition performance, especially in the case that multiple powerful teacher models are unavailable. In contrast to
conventional approaches, we adopt the multi-task
learning (MTL) framework where the n-th best ASR hypothesis is used as the label of each task. The seq2seq network is
updated through the MTL framework so as to find the common
representation that can cover multiple hypotheses. By doing
so, the effect of the hard-decision errors can be alleviated. We
first demonstrate the effectiveness of our unsupervised learning
method through ASR experiments in an accent adaptation task
between the US and British English speech. Our experiment results show that our method can reduce the WER on the British
speech data from 14.55% to 10.36% compared to the baseline
model trained with the US English data only. Moreover, we investigate the effect of our unsupervised learning methods in a
federated learning scenario.