keyu An(Tsinghua University), Hongyu Xiang(Tsinghua University) and Zhijian Ou(Department of Electronic Engineering, Tsinghua University)
Abstract:
In this paper, we present a new open source toolkit for speech recognition, named CAT (\underline{C}TC-CRF based \underline{A}SR \underline{T}oolkit). CAT inherits the data-efficiency of the hybrid approach and the simplicity of the E2E approach, providing a full-fledged implementation of CTC-CRFs and complete training and testing scripts for a number of English and Chinese benchmarks.
Experiments show CAT obtains state-of-the-art results, which are comparable to the fine-tuned hybrid models in Kaldi but with a much simpler training pipeline. Compared to existing non-modularized E2E models, CAT performs better on limited-scale datasets, demonstrating its data efficiency.
Furthermore, we propose a new method called contextualized soft forgetting, which enables CAT to do streaming ASR without accuracy degradation.
We hope CAT, especially the CTC-CRF based framework and software, will be of broad interest to the community, and can be further explored and improved.