Kenichi Arai(NTT Communication SCience Laboratories), Shoko Araki(NTT Communication Science Laboratories), Atsunori Ogawa(NTT Communication Science Laboratories), Keisuke Kinoshita(NTT), Tomohiro Nakatani(NTT Corporation) and Toshio Irino(Wakayama University)
Abstract:
The measurement of speech intelligibility (SI) still mainly relies on time-consuming and expensive subjective experiments because no versatile objective measure can predict SI. One promising candidate of an SI prediction method is an approach with a deep neural network (DNN)-based automatic speech recognition (ASR) system, due to its recent great advance. In this paper, we propose and evaluate SI prediction methods based on the posteriors of DNN-based ASR systems. Posteriors, which are the probabilities of phones given acoustic features, are derived using forced alignments between clean speech and a phone sequence. We evaluated some variations of the posteriors to improve the prediction performance. As a result of our experiments, a prediction method using a squared cumulative posterior probability achieved better accuracy than the conventional SI predictors based on well-established objective measures (STOI and eSTOI).