Xinhao Wang(Educational Testing Service), Klaus Zechner(ETS) and Christopher O Hamill(Educational Testing Service)
This study aims to develop automatic models to provide accurate and actionable diagnostic feedback within the context of spoken language learning and assessment, in particular, targeting the content development skill. We focus on one type of test question widely used in speaking assessment where test takers are required to first listen to and/or read stimulus material and then create a spontaneous response to a question related to the stimulus. In a high-proficiency response, critical content from the source material – referred to as “key points” – should be properly covered. We propose Transformer-based models to automatically detect absent key points or location spans of key points present in a response. Furthermore, we introduce a multi-task learning approach to measure how well a key point is rendered within a response (quality score). Experimental results show that automatic models can surpass human expert performance on both tasks: for span detection, the system performance reached an F1 score of 74.5% (vs. human agreement of 68.3%); for quality score prediction, system performance reached a Pearson correlation coefficient (r) of 0.744 (vs. human agreement of 0.712). Finally, the proposed key point-based features can be used to predict speaking proficiency scores with a correlation of 0.730.