Ngoc-Quan Pham(Karlsruhe Institute of Technology), Thanh-Le Ha(Karlsruhe Institute of Technology), Tuan Nam Nguyen(Karlsruhe Institute of Technology), Thai Son Nguyen(Karlsruhe Institute of Technology), Elizabeth Salesky(Johns Hopkins University), Sebastian Stüker(Karlsruhe Institute of Technology), Jan Niehues(Maastricht University) and Alexander Waibel(Carnegie Mellon)
Transformer models are powerful sequence-to-sequence architecture that is capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism of modeling positions in this model was tailored for text modeling and thus is less ideal for acoustic inputs. In this work, we adapted the relative position encoding scheme to the Speech Transformer, in which the key is to add relative distance between input states to the self-attention network. As a result, the network can adapt better with the large variation of the pattern distribution in speech data. Our experiments showed that the resulting model achieved the best recognition result on the Switchboard benchmark in the non-augmentation condition, and the best published result in the MuST-C speech translation benchmark. We also showed that this model is able to utilize better simulated data than the Transformer, and also adapt better with the segmentation quality in speech translation.