Raghavendra Pappagari(The Johns Hopkins University), Jaejin Cho(Johns Hopkins University), Laureano Moro Velazquez(Johns Hopkins University) and Najim Dehak(Johns Hopkins University)
Abstract:
In this study we analyze the use of state-of-the-art technologies for speaker recognition and natural language processing to detect Alzheimer's Disease (AD) and to assess its severity predicting Mini-mental status evaluation (MMSE) scores. With these purposes we study the use of speech signals and transcriptions. We focus on the adaptation of state-of-the-art models for both modalities individually and together to examine its complementarity. We used x-vectors to characterize speech signals and pre-trained BERT models to process human transcriptions with different back-ends in AD diagnosis and assessment. In addition, we evaluated features based on silence segments of the audio files as a complement to x-vectors. We trained and evaluated our systems in the Interspeech 2020 ADReSS challenge dataset, containing 78 AD patients and 78 sex and age-matched controls. Our results indicate that the fusion of scores obtained from the acoustic and the transcript-based models provides the best results for detection and assessment, suggesting that individual models for two modalities contain complementary information. Addition of the silence-related features improved the fusion system even further. A separate analysis of the models suggests that transcript-based models provide better results than acoustic models in the detection task but similar results in the MMSE prediction task.