Kshitiz Kumar(Microsoft Corporation), Chaojun Liu(Microsoft), Yifan Gong(Microsoft Corp) and Jian Wu(Microsoft Corp.)
In this work we develop a simple, efficient, and compact automatic speech recognition (ASR) model based on purely 1-dimensional row convolution (RC) operation. We refer to our proposed model as 1-dim row-convolution LSTM (RC-LSTM), where we embed limited future information to standard UniLSTMs in 1-dim RC operation. We target fast streaming ASR solutions and establish ASR accuracy parity with latency-control bidirectional-LSTM (LC-BLSTM). We develop an application of future information at ASR features and hidden layer stages. We study connections with related techniques, analyze trade-offs and recommend uniform future lookahead to all hidden layers. We argue that our architecture implicitly factorizes training into orthogonal time and ``frequency" dimensions for an effective learning on large scale tasks. We conduct a series of experiments on medium scale with 6k hrs of English corpus, as well as, large scale with 60k hrs training. We demonstrate our findings across unified ASR tasks. Compared to UniLSTM model, RC-LSTM achieved 16% relative reduction in word error rate (WER). RC-LSTM also achieved accuracy parity with LC-BLSTM on large scale tasks at significantly lower latency and computational cost.