Special Sessions & Challenges
The Organizing Committee of INTERSPEECH 2020 is proudly announcing the following special sessions and challenges for INTERSPEECH 2020.
Special sessions and challenges focus on relevant ‘special’ topics which may not be covered in regular conference sessions.
Papers have to be submitted following the same schedule and procedure as regular papers; the papers undergo the same review process by anonymous and independent reviewers.
The Meta Learning for Human Language Technology session will be focused on investigating and improving human language technology (HLT) tasks with meta learning methodologies, i.e., learn to learn. The methodologies include but not limited to: 1) network architecture search, 2) learning optimizer, 3) learning model initialization, 4) learning metrics or distance measurement, 5) learning training algorithm, 6) few shot learning.

This session is an oral session, and aims to cover major research directions for meta learning in HLT and provide opportunities for researchers working on meta learning for various HLT tasks to exchange ideas. Thus, we encourage all tasks in HLT to submit, for example, automatic speech recognition, speaker adaptation, speech synthesis, voice conversion, noise robustness, spoken language understanding, intent and slot recognition, and dialog management. The session will be dedicated to discussing how meta learning can improve these tasks (e.g., performance, data usage, and computation time) and highlighting future directions.

URL Link
      For more information visit: https://sunprinces.github.io/interspeech2020-meta-learning

      • Hung-yi Lee (National Taiwan University)
      • Shang-Wen Li (Amazon AWS AI)
      • Yu Zhang (Google Brain)
      • Ngoc Thang Vu (University of Stuttgart)
This shared task will help advance the state-of-the-art in automatic speech recognition (ASR) by considering a challenging domain for ASR: non-native children's speech.  A new data set containing English spoken responses produced by Italian students will be released for training and evaluation.  The spoken responses in the data set were produced in the context of an English speaking proficiency examination.  The following data will be released for this shared task: training set of 50 hours of transcribed speech, development set of 2 hours of transcribed speech, test set of 2 hours of speech, and a baseline Kaldi ASR system with evaluation scripts.

Important Dates
      • Release of training data, development data, and baseline system: February 7, 2020
      • Test data released: March 13, 2020
      • Submission of results on test set: March 20, 2020
      • Test results announced: March 23, 2020

URL Link

      • Daniele Falavigna, Fondazione Bruno Kessler
      • Keelan Evanini, Educational Testing Service
SdSV Challenge 2020: Large-Scale Evaluation of Short-Duration Speaker Verification

Are you searching for new challenges in speaker recognition? Join SdSV Challenge 2020 – The first challenge with a broad focus on systematic benchmark and analysis on varying degrees of phonetic variability on short-duration speaker recognition.

The SdSV Challenge 2020 consists of two tasks.
      • Task 1 is defined as speaker verification in text-dependent mode where the lexical content (in both English and Persian) of the test utterances is also taken into consideration.
      • Task 2 is defined as speaker verification in text-independent mode with same- and cross-language trials.

The participating teams will get access to a train set and the test set drawn from the DeepMine corpus which is the largest public corpus designed for short-duration speaker verification with voice recordings of 1800 speakers. The challenge leaderboard is hosted at CodaLab. There will be cash prizes for each task. Top performers will be determined based on the results of the primary systems on the evaluation subset. In addition to the cash prizes, winners will receive certificates for their achievement.

      For more information visit: https://sdsvc.github.io/
      Evaluation plan: https://sdsvc.github.io/assets/SdSV_Challenge_Evaluation_Plan.pdf
      Contact: sdsvc2020@gmail.com

      • Hossein Zeinali (Amirkabir University of Technology, Iran)
      • Kong Aik Lee (NEC Corporation, Japan)
      • Jahangir Alam (CRIM, Canada)
      • Lukáš Burget (Brno University of Technology, Czech Republic)
This session will focus on exploring weak aspects of existing automatic speaker verification (ASV) system by attacking the system from the attacker's perspective. The objective of the studies may focus on degrading the performance of existing systems by various possible attacks to highlight the loopholes in the system. The attacks can be performed on ASV systems as well as anti-spoofing countermeasures or joint attacks on both. The prospective submissions can use any standard corpus available for ASV or anti-spoofing countermeasures for the studies related to this session.

URL Link
For more information, please visit: https://sites.google.com/view/attackers-perspective-on-asv/

      • Rohan Kumar Das (National University of Singapore)
      • Xiaohai Tian (National University of Singapore)
      • Tomi H. Kinnunen (University of Eastern Finland)
      • Haizhou Li (National University of Singapore)
Elderly Emotion, Breathing & Masks

Interspeech ComParE is an open Challenge dealing with states and traits of speakers as manifested in their speech signal’s properties. In this 12th edition, we introduce three new tasks and Sub-Challenges:
      • Elderly Emotion assessment in spoken language,
      • Breathing tracking in audio,
      • Mask condition recognition of speakers with or without facial masks worn.

Sub-Challenges allow contributors to find their own features with their own machine learning algorithm. However, a standard feature set and tools are provided that may be used. Participants have five trials on the test set per Sub-Challenge. Participation has to be accompanied by a paper presenting the results that undergoes the Interspeech peer-review.

Contributions using the provided or equivalent data are sought for (but not limited to):
      • Participation in a Sub-Challenge
      • Contributions around the Challenge topics

Results of the Challenge and Prizes will be presented at Interspeech 2020 in Shanghai, China.

URL Link
      Please visit: http://www.compare.openaudio.eu/compare2020/

      • Björn Schuller (University of Augsburg, Germany / Imperial College, UK / audEERING)
      • Anton Batliner (University of Augsburg, Germany)
      • Christian Bergler (FAU, Germany)
      • Eva-Maria Meßner (University of Ulm, Germany)
      • Antonia Hamilton (University College London, UK)
Singing is a kind of human behavior generating musical sounds with the voice and augments regular speech by making use of tonality, rhythm, lyrics, and various vocal techniques. Singing voice is the central sound source in music that determines the song quality, as it conveys melody, lyrics, emotion and humanity with its high expressivity. However, it is still difficult to process and analyze singing voice to facilitate various its applications in artificial intelligence such as singing voice generation without sheet music and lyrics, due to its unstructured natural characteristics. Benefiting from fast-developing technologies, there are many probable proposals to overcome the challenges related to singing voice such as the recent progress on deep learning opens an exciting new era and has greatly advanced the state-of-the-art performance in a series of singing voice related tasks. This special session seeks innovative papers that exploit novel technologies and solutions from both industry and academia on highly effective and efficient singing voice computing and processing in music with singing voice.

Scope and Topics
This special session of Interspeech 2020 invites original and high quality research relating to all topics of singing voice computing and processing in music. It includes, but is not limited to, the following topics:
      • singing synthesis especially with expressiveness
      • singing voice detection
      • singing melody extraction
      • singer recognition
      • auto-tagging of singing voice
      • deep learning technologies in singing voice computing and processing
      • singing voice generation

      • Wei Li (Fudan University, China)
      • Shengchen Li (Beijing University of Posts and Telecommunications, China)
      • Yi Yu(National Institute of Informatics, Japan)
      • Xi Shao (Nanjing University of Posts and Telecommunications, China)
In the last decade, deep learning (DL) has achieved remarkable success in speech-related tasks, e.g., speaker verification, language identification and emotion classification. However, the complexity of these tasks is often beyond non-experts. The rapid growth of a vast range of speech classification applications has created a demand for off-the-shelf speech classification methods that can be used easily and without expert knowledge. Automated Deep Learning (AutoDL) is proposed to explore automatic pipeline to train an effective DL model given a specific task requirement without any human intervention. Since its proposal, AutoDL have been explored in various applications, and a series of AutoDL competitions, e.g., Automated natural language processing (AutoNLP) and Automated computer vision (AutoCV), have been organized by 4Paradigm, Inc. and ChaLearn (sponsored by Google).

In this challenge, we further propose the Automated Speech (AutoSpeech) competition which aims at proposing automated solutions for speech-related tasks. This challenge is restricted to multi-label classification problems, which come from different speech classification domains. The provided solutions are expected to discover various kinds of paralinguistic speech attribute information, such as speaker, language, emotion, etc, when only raw data (speech features) and meta information are provided.

URL Link

      • Wei-Wei Tu, 4Pardigm Inc., China and ChaLearn, USA
      • Tom Ko, Southern University of Science and Technology, China
      • Lei Xie, Northwestern Polytechnical University, China
      • Hugo Jair Escalante, IANOE, Mexico and ChaLearn, USA
      • Isabelle Guyon, Université Paris-Saclay, France and ChaLearn, USA
      • Qiang Yang, Hong Kong University of Science and Technology, China
      • Jingsong Wang, 4Paradigm Inc., China
      • Shouxiang Liu, 4Paradigm Inc., China
      • Xiawei Guo, 4Paradigm Inc., China
      • Zhen Xu, 4Paradigm Inc., China
Alzheimer's Dementia Recognition through Spontaneous Speech: The ADReSS Challenge

Dementia is a category of neurodegenerative diseases that entails a long-term and usually gradual decrease of cognitive functioning. The main risk factor for dementia is age, and therefore its greatest incidence is amongst the elderly. Due to the severity of the situation worldwide, institutions and researchers are investing considerably on dementia prevention and early detection, focusing on disease progression. There is a need for cost-effective and scalable methods for detection of dementia from its most subtle forms, such as the preclinical stage of Subjective Memory Loss (SML), to more severe conditions like Mild Cognitive Impairment (MCI) and Alzheimer's Dementia (AD) itself.

While a number of studies have investigated speech and language features for the detection of Alzheimer's Disease and mild cognitive impairment, and proposed various signal processing and machine learning methods for this prediction task, the field still lacks balanced and standardised data sets on which these different approaches can be systematically compared.

The main objective of the ADReSS challenge is to make available a benchmark dataset of spontaneous speech, which is acoustically pre-processed and balanced in terms of age and gender, defining a shared task through which different approaches to AD recognition in spontaneous speech can be compared. We expect that this challenge will bring together groups working on this active area of research, and provide the community with the very first comprehensive comparison of different approaches to AD recognition using this benchmark dataset.

Important Dates
      • March 15, 2020: ADReSS test data made available
      • March 17, 2020: Submission of results opens
      • March 30, 2020: Paper submission deadline
      • June 19, 2020: Paper acceptance/rejection notification

URL Link

      • Saturnino Luz, Usher Institute, University of Edinburgh
      • Fasih Haider, University of Edinburgh
      • Sofia de la Fuente, University of Edinburgh
      • Davida Fromm, Carnegie Mellon University
      • Brian MacWhinney, Carnegie Mellon University
The Zero Resource Speech Challenge 2020 is the fourth iteration of the Zero Resource Speech Challenge series. The overall goal of the series is to advance research in unsupervised training of speech and dialogue tools, taking inspiration from the fact that young infants learn to perceive and produce speech with no textual supervision, and with applications in the area of speech technology for under-resourced languages. In previous challenges, participants had to do unsupervised acoustic modelling, lexicon/word discovery, and speech synthesis (“TTS without T”). There is substantial further research to be done on all three tasks. Word discovery remains a very difficult task, with a large gap between the state-of-the-art and the ideal performance. High-quality unsupervised unit discovery appears to be feasible if the only criterion is good speech synthesis, but attempts to learn “text-like” units which are substantially compressed in time or space have so far been much less successful, leading to doubts about existing techniques learn units that would be useful in tasks traditionally relying on text (language modeling, ASR, machine translation). To bring together continued research on the existing tasks, the Zero Resource Speech Challenge 2020 proposes a consolidated challenge based on the 2017 and 2019 tasks, with training data unchanged and only minimal changes to the evaluation (to address bug fixes and permit novel analyses of the submissions).

URL Link

      • Ewan Dunbar (Université de Paris)
      • Emmanuel Dupoux (Ecole des Hautes Etudes en Sciences Sociales – Facebook AI Research)
      • Robin Algayres (Inria)
      • Sakriani Sakti (NAIST–RIKEN)
      • Xuan-Nga Cao (École Normale Supérieure)
      • Mathieu Bernard (Inria)
      • Julien Karadayi (École Normale Supérieure)
      • Lucas Ondel (Brno University of Technology)
      • Alan W. Black (Carnegie Mellon University)
      • Laurent Besacier (Université Grenoble Alpes)
The VoicePrivacy initiative is spearheading the effort to develop privacy preservation solutions for speech technology. It aims to gather a new community to define the task and metrics and to benchmark initial solutions using common datasets, protocols and metrics. VoicePrivacy takes the form of a competitive challenge. VoicePrivacy 2020 is the first of what is intended to become a bi-annual challenge series.

Participants are required to develop anonymization algorithms which suppress personally identifiable information contained within speech signals. At the same time, they should preserve linguistic content and naturalness of speech.  Performance will be evaluated using objective and subjective metrics to assess speaker verification/re-identification ability, subjective speech intelligibility and naturalness for human communication scenarios, and automatic speech recognition (ASR) performance for human-machine communication scenarios.

The VoicePrivacy Challenge aims to attract teams from academia and industry, including participants who have prior interest and expertise in privacy, and those that have never studied voice anonymization or related topics previously.

The special session will be dedicated to the discussion of applied technologies, performance evaluation, the analysis of challenge results and directions for future challenge editions.

URL Link
For more information please visit: https://www.voiceprivacychallenge.org/

Organizers (in alphabetical order)
      • Jean- François Bonastre - University of Avignon - LIA, France
      • Nicholas Evans - EURECOM, France
      • Andreas Nautsch - EURECOM, France
      • Paul-Gauthier Noé - University of Avignon - LIA, France
      • Jose Patino - EURECOM, France
      • Md Sahidullah - Inria, France
      • Brij Mohan Lal Srivastava - Inria, France
      • Natalia Tomashenko - University of Avignon - LIA, France
      • Massimiliano Todisco - EURECOM, France
      • Emmanuel Vincent - Inria, France
      • Xin Wang - NII, Japan
      • Junichi Yamagishi - NII, Japan and University of Edinburgh, UK
Self-supervised approaches for speech processing combine methods for utilizing no- or partial-labels, unpaired text and audio data, contextual text and video supervision, and signals from user interactions which are key for achieving cognitive intelligence in speech processing systems. This special session will bring concentrated discussions via a panel of leading researchers from industry and academia, and a poster session of high-quality papers on self-supervision for speech processing. Alongside research work on new self-supervised methods and results, this session will call for novel work on understanding, analyzing, and comparing different self-supervision approaches for speech.

URL Link
For more information visit: https://self-supervised-sp.github.io/Interspeech2020-Special-Session

      • Abdelrahman Mohamed (Facebook)
      • Hung-yi Lee (NTU)
      • Shinji Watanabe (JHU)
      • Tara Sainath (Google)
Speaker verification is a key technology in speech processing and biometrics, which has broad impact on our daily lives, e.g. security, customer service, mobile devices, smart speakers. Recently, speech based human computer interaction becomes more and more popular in far field smart home and smart city applications, e.g. mobile devices, smart speakers, smart TVs, automobiles, etc. Due to the usage of deep learning methods, the performances of speaker verification in telephone channel and close-talking microphone channel have been enhanced dramatically.  However, there are still open research questions that can be further explored for speaker verification in the far field and complex environments.

We think that an open, free and large-scale speech database collected from real speakers with both close-talking microphone and multiple far field distributed microphone arrays can serves as a benchmark for researches and boost the idea exchanging and discussion in this research area.

This challenge includes the following three tasks:
      • Task1: Far Field Text Dependent Speaker Verification from single microphone array
      • Task2: Far Field Text Independent Speaker Verification from single microphone array
      • Task3: Far Field Text Dependent Speaker Verification from distributed microphone arrays

URL Link

      • Ming Li, Duke Kunshan University
      • Haizhou Li, National University of Singapore
      • Shrikanth Narayanan, University of Southern California
      • Rohan Kumar Das, National University of Singapore
      • Rao Wei, National University of Singapore
      • Hui Bu, AISHELL foundation 
The Fearless Steps Initiative by UTDallas-CRSS led to the digitization, recovery, and diarization of 19,000 hours of original analog audio data, as well as the development of algorithms to extract meaningful information from this multichannel naturalistic data resource. As an initial step to motivate a stream-lined and collaborative effort from the speech and language community, UTDallas-CRSS is hosting a series of progressively complex tasks to promote advanced research on naturalistic “Big Data” corpora. This began with ISCA INTERSPEECH-2019: "The FEARLESS STEPS Challenge: Massive Naturalistic Audio (FS-#1)". This first edition of this challenge encouraged the development of core unsupervised/semi-supervised speech and language systems for single-channel data with low resource availability, serving as the “First Step” towards extracting high-level information from such massive unlabeled corpora.

As a natural progression following the successful Inaugural Challenge FS#1, the FEARLESS STEPS Challenge Phase-#2 focuses on development of single-channel supervised learning strategies. This FS#2 provides 80 hours of ground-truth data through Training and Development sets, with an additional 20 hours of blind-set Evaluation data. Based on feedback from the Fearless Steps participants, additional Tracks for streamlined speech recognition and speaker diarization have been included in the FS#2. The results for this Challenge will be presented at the ISCA INTERSPEECH-2020 Special Session. We encourage participants to explore any and all research tasks of interest with the Fearless Steps Corpus – with suggested Task Domains listed below. Research participants can however, also utilize the FS#2 corpus to explore additional problems dealing with naturalistic data, which we welcome as part of the special session. 

      • Challenge Start Date (Data Release):    January 25th 2020
      • INTERSPEECH-2020 Papers dealing with FEARLESS STEPS deadline:    March 30, 2020

Challenge Tasks in Phase-2 (FS#2)
      1.Speech Activity Detection (SAD)
      2.Speaker Identification (SID)
      3.Speaker Diarization:
          3a.Track 1: Diarization using reference SAD
          3b.Track 2: Diarization using system SAD
      4.Automatic Speech Recognition (ASR):
          4a.Track 1: ASR using reference Diarization
          4b.Track 2: Continuous stream ASR

Registration & Website Link

      • John H.L. Hansen, University of Texas at Dallas
      • Aditya Joglekar, University of Texas at Dallas
      • Meena Chandra Shekar, University of Texas at Dallas
      • Abhijeet Sangwan, University of Texas at Dallas
The goal of this special session is to compare the effectiveness of speech modification techniques to enhance intelligibility in noisy and reverberant conditions. The session will present the outcomes and entries of the international Hurricane Challenge 2.0 to allow for in-depth discussion of different approaches to the same challenging problem. In addition to focusing on recent advances in speech-modification techniques, one major extension of Hurricane 2.0 compared to the previous challenge (presented during a special session at Interspeech 2013) is that different degrees of reverberation in the listening environment are included as an additional detrimental factor for speech intelligibility (not only masking noise), thus allowing for a more general assessment of the algorithms’ benefit in real rooms. In addition, subjective evaluations were carried out at three different sites (Oldenburg, Germany; Vitoria, Spain; Edinburgh, UK) in the respective languages, thus allowing for a multilingual comparison of the algorithms’ effectiveness. Finally, the knowledge of the listening conditions was limited in a more realistic way by not providing the exact waveforms of the masking noise, but only a waveform and room impulse responses recorded in the same room in the proximity of the listener. Overall, the new challenge was designed to test the algorithms in a more realistic way, and this special session will discuss the advances and remaining limitations of speech modification techniques to improve speech intelligibility in challenging conditions.

URL Link

      • Jan Rennies-Hochmuth (Fraunhofer IDMT, Germany)
      • Henning Schepker (Signal Processing Group, University of Oldenburg, Germany)
      • Martin Cooke (Ikerbasque & University of the Basque Country, Spain)
      • Cassia Valentini-Botinhao (CSTR, University of Edinburgh, UK)
The DNS Challenge at Interspeech 2020 is intended to promote collaborative research in single-channel Speech Enhancement aimed to maximize the perceptual quality and intelligibility of the enhanced speech. The challenge will evaluate the speech quality using the online subjective evaluation framework ITU-T P.808. The challenge provides large datasets for training noise suppressors, but allows participants to use any datasets of their choice. Participants can also augment their datasets with the provided data. The challenge also provides a test set that is very extensive. The test set contains both synthetic noisy speech and also real recordings. The final evaluation will be conducted on a blind test set that is similar to the open sourced test set. We also provide model and inference scripts for a baseline noise suppressor that was recently published.

More details about the open sourced data, baseline noise suppressor, ITU-T P.808 and DNS challenge can be found here.

Submitted papers will fall under one of these two tracks based on the computational complexity.
      • Real-Time Track: This track focuses on low computational complexity. The algorithm should take less than T/2 (in ms) to process a frame of size T (in ms) on an Intel Core i5 quad core machine clocked at 2.4 GHz or equivalent processors. Frame length T should be less than or equal to 40ms.
      • Non-Real-Time Track: This track relaxes the constraints on computational time so that researchers can explore deeper models to attain exceptional speech quality.
In both the tracks, the SE method may have a maximum of 40ms look ahead. To infer the current frame T (in ms), the algorithm can access any number of past frames but only 40ms of future frames (T+40ms).

The blind test will be provided to the participating teams on March 18th, 2020. The enhanced clips should be sent back to the organizers by March 22nd, 2020. The organizers will conduct subjective evaluation using ITU-T P.808 framework to get the final ranking of the methods. Please vist Rules for more details.

Participants are forbidden from using the blind test set to retrain or tweak their models. They should not submit enhanced clips using other noise suppression methods that they are not submitting to INTERSPEECH 2020. Failing to adhere to these rules will lead to disqualification from the challenge.

Please feel free to reach out to us, if you have any questions or need clarification about any aspect of the challenge.

Top three winning teams from each track will be awarded prizes.

URL Link

      • Chandan K A Reddy (Microsoft Corp, USA)
      • Hari Dubey (Microsoft Corp, USA)
      • Ross Cutler (Microsoft Corp, USA)
      • Johannes Gehrke (Microsoft Corp, USA)
      • Vishak Gopal (Microsoft Corp, USA)
      • Robert Aichner (Microsoft Corp, USA)
Voice Production, Acoustics, and Auditory Perception

The appraisal of voice quality is relevant to the clinical care of disordered voices. It contributes to the selection and optimization of clinical treatment as well as to the assessment of the outcome of the treatment. Levels of description of voice quality include the biomechanics of the vocal folds and their kinematics, temporal and spectral acoustic features, as well as the auditory scoring of hoarseness, hyper- and hypo-functionality, creakiness, diplophonia, harshness, etc. Broad and fuzzy definitions of terms regarding voice quality are in use, which impede scientific and clinical communication.

Aim of the special session is to contribute to the improvement of the clinical assessment of voice quality via a translational approach, which focuses on quantifying and explaining relationships between several levels of description. The objective is to gather new insights, advancement of knowledge and practical tools to assist researchers and clinicians in obtaining effective descriptions of voice quality and reliable measures of its acoustic correlates. Topics of interest include, but are not limited to, (i) the statistical analysis and automatic classification, possibly relying on state-of-the-art machine learning approaches, of distinct types of voice quality via non-obtrusively recorded features, (ii) the analysis and simulation of vocal fold vibrations by means of analytical, kinematic or mechanical modelling, (iii) the interpretation and modeling of both acoustic emission and/or high–speed video recordings such as videolaryngoscopy and videokymography, (iv) the synthesis of disordered voices jointly with auditory experimentation involving synthetic and natural disordered voice stimuli.

URL Link
      For more information please visit: https://sites.google.com/view/voicequality-interspeech2020/home

      • Philipp Aichinger (Medical University of Vienna, Austria)
      • Abeer Alwan (University of California, Los Angeles, USA)
      • Carlo Drioli (University of Udine, Italy)
      • Jody Kreiman (University of California, Los Angeles, USA)
      • Jean Schoentgen (Universite Libre de Bruxelles, Belgium)
Speech is a complex process emitting a wide range of biosignals, including, but not limited to, acoustics. These biosignals – stemming from the articulators, the articulator muscle activities, the neural pathways, or the brain itself – can be used to circumvent limitations of conventional speech processing, and to gain insights into the process of speech production. Based on the increased interest and major progress in neural signals for spoken communication during the last years, we call to participate in a special session on the subject of Neural Signals for Spoken Communication.

We aim at bringing together researchers and interested users from multiple disciplines (e.g. linguistics, engineering, computer science, neuroscience, medicine) and fields (e.g. speech production, articulation, recognition and synthesis, data acquisition, signal processing, human-machine-interfaces) to discuss the current state and future research of this prosperous and active research field.

      • Tanja Schultz (University of Bremen, Germany)
      • Satoshi Nakamura (Nara Institute of Science and Technology, Japan)
      • Hiroki Tanaka (Nara Institute of Science and Technology, Japan)
      • Christian Herff (Maastricht University, The Netherlands)
      • Dean Krusienski (Virginia Commonwealth University, USA)
      • Jonathan Brumberg (University of Kansas, USA)

Student Information

Student Events

Travel Grants