Home
About

About the Conference Welcome from the Chair Conference Committees Area Chairs Organizers ISCA
Calls

Papers Surveys Satellite Workshops Tutorials Show & Tell Special Sessions & Challenges Areas & Topics Important Dates
Authors

Author Resources Submission Policy ISCA Ethics Paper Submission Presentation Guidelines
Program

Program at a Glance Technical Program Presentation Videos Presentation Guidelines Keynotes Satellite Workshops Tutorials Special Sessions & Challenges Show & Tell
Student Information

Student Events Travel Grants
Venue & Travel

Conference Venue & Accommodations Transportations Visa About Shanghai
Registration

Registration Overview & Fees ISCA Membership ISCA Code of Conduct Online Registration
Sponsorships & Exhibition

Sponsors Virtual Booth Satellite Events Acknowledgement
Contact

Contact Us

Program

Technical Program

Presentation Videos

Presentation Guidelines

Keynotes

Satellite Workshops

Tutorials

Special Sessions & Challenges

Show & Tell

Evaluation of Speech Technology Systems and Methods for Resource Construction and Annotation

Position: Home > Program > Technical Program > Monday 20:30-21:30(GMT+8), October 26 > Evaluation of Speech Technology Systems and Methods for Resource Construction and Annotation >

Mon-2-3-5 Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System

Kai Fan(Alibaba Group), Bo Li(Alibaba Group), Jiayi Wang(Alibaba Group), Shiliang Zhang(Alibaba Group), Boxing Chen(Alibaba), Niyu Ge(IBM Research) and Zhi-Jie Yan(Microsoft Research Asia)

Abstract: The performances of automatic speech recognition (ASR) systems are usually evaluated by the metric word error rate (WER) when the manually transcribed data are provided, which are, however, expensively available in the real scenario. In addition, the empirical distribution of WER for most ASR systems usually tends to put a significant mass near zero, making it difficult to simulate with a single continuous distribution. In order to address the two issues of ASR quality estimation (QE), we pro- pose a novel neural zero-inflated model to predict the WER of the ASR result without transcripts. We design a neural zero- inflated beta regression on top of a bidirectional transformer language model conditional on speech features (speech-BERT). We adopt the pre-training strategy of token level masked language modeling for speech-BERT as well, and further fine-tune with our zero-inflated layer for the mixture of discrete and continuous outputs. The experimental results show that our approach achieves better performance on WER prediction com- pared with strong baselines.

Paper

prev Mon-2-3-4 SENTENCE LEVEL ESTIMATION OF PSYCHOLINGUISTIC NORMS USING JOINT MULTIDIMENSIONAL ANNOTATIONS

next Mon-2-3-6 Confidence measures in encoder-decoder models for speech recognition

About

About the Conference

Welcome from the Chair

Conference Committees

Calls