Yao Qian(Microsoft), Yu Shi(Microsoft) and Michael Zeng(Microsoft)
Abstract:
Spoken language understanding (SLU) tries to decode an input speech utterance such that effective semantic actions can be taken to continue meaningful and interactive spoken dialog (SD). The performance of SLU, however, can be adversely affected by automatic speech recognition (ASR) errors. In this paper, we exploit transfer learning in a Generative Pretrained Transformer (GPT) to jointly optimize ASR error correction and semantic labeling in terms of dialog act and slot-value for a given user’s spoken response in the context of SD system (SDS). With the encoded ASR output and dialog history as context, a conditional generative model is trained to generate transcripts correction, dialog act, and slot-values successively. The proposed generation model is jointly optimized as a classification task, which utilizes the ground-truth and N-best hypotheses in a multi-task, discriminative learning. We evaluate its effectiveness on a public SD corpus used in the Second Dialog State Tracking Challenge. The results show that our generation model can achieve a relative word error rate reduction of 25.12% from that in the original ASR 1-best result, and a sentence error rate (SER) lower than the oracle result from the 10-best ASR hypotheses. The proposed approach of generating dialog acts and slot-values, instead of classification and tagging, is promising. The refined ASR hypotheses are critical for improving semantic label generation.