Tutorials - INTERSPEECH 2020

► Sunday, 25 October, 9⁰⁰–12³⁰

Flexibility and speed are key features for a deep learning framework to allow fast transition from a research idea to prototyping and production code. We outline how to implement a unified framework for sequence processing that covers various kinds of models and applications. We will discuss our toolkit RETURNN as an example for such an implementation, that is easy to apply and understand for the user, flexible to allow for any kind of architecture or method, and at the same time also very efficient. In addition, a comparison of the properties of different machine learning toolkits for sequence classification is provided. The flexibility of using such specific implementations will be demonstrated describing the setup of recent state-of-the-art models for automatic speech recognition and machine translation, upon others.

Organizers：
● Albert Zeyer (RWTH; AppTek)
● Nick Rossenbach (RWTH; AppTek)
● Parnia Bahar (RWTH; AppTek)
● André Merboldt (RWTH)
● Ralf Schlüter (RWTH; AppTek)

Albert Zeyer is a Ph.D. student in the Human Language Technology Group at RWTH Aachen University, Germany, since 2014, under the supervision of Prof. Hermann Ney. He received both the Diplom (M.Sc.) in Mathematics and the Diplom (M.Sc.) in Computer Science from RWTH Aachen University in 2013. His research is focused on neural networks in general. The beginning of his first studies and passion for neural networks and connectionism goes back to 1996. The topics of his recent work include recurrent networks, attention models, and end-to-end models in general, with applications in speech recognition, translation and language modeling, where he achieved many state-of-the-art results. Albert started developing software in 1995, and has published a variety of Open Source projects since then. The TensorFlow based software RETURNN, which he has developed as the main architect for his Ph.D. research, is now widely used by his teammates at RWTH Aachen University, and even beyond. Albert Zeyer gave lectures at the university, and a workshop at eBay, partly with the same content as for this tutorial.

Nick Rossenbach is a Ph.D. student in the Human Language Technology Group at
RWTH Aachen University, Germany, under the supervision of Prof. Dr. Hermann Ney. He worked as student researcher at the chair since 2015, the first two years on the topic of neural machine translation, followed by a bachelor thesis on the same topic. Since 2018, he worked on speech recognition and implementing text-to-speech systems in RETURNN, and completed his master thesis on generating synthetic training data with TTS systems in 2020. Currently, his research focus is on multi-speaker TTS systems trained on noisy data.

Parnia Bahar holds a Master's degree in Electrical Engineering from Stuttgart University and is currently a Ph.D. student in the Human Language Technology Group at RWTH Aachen University, Germany, under the supervision of Prof. Dr. Hermann Ney. Her areas of research are human language technology, machine learning, and neural networks. Her focus includes designing end-to-end neural network translation models for both spoken and written forms as well as recognition systems. She develops the models using RETURNN. She is author or co-author of papers at high-ranking international conferences such as ACL, EMNLP, and ICASSP. Systems developed with her cooperation always performed in shared tasks at WMT and IWSLT among the best systems. She also gives lectures at the university, technical and scientific talks in workshops and supervises thesis in her field of interest as well as works on different research projects.

André Merboldt is a master student in the Human Language Technology Group at RWTH Aachen University, Germany, under the supervision of Prof. Dr. Hermann Ney. He worked since 2017 at the chair where he also wrote his bachelor thesis on end-to-end models for speech recognition. Since then his focus was on investigating and designing attention and transducer models for ASR using the RETURNN software.

Ralf Schlüter serves as Academic Director and Senior Lecturer in the Department of Computer Science of the Faculty of Computer Science, Mathematics and Natural Sciences at RWTH Aachen University. He leads the Automatic Speech Recognition Group at the Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition. He studied physics at RWTH Aachen University and Edinburgh University and received his Diploma in Physics (1995), PhD in Computer Science (2000) and Habilitation for Computer Science (2019), each at RWTH Aachen University. Dr. Schlüter works on all aspects of automatic speech recognition and lead the scientific work of the Lehrstuhl Informatik 6 in the area of automatic speech recognition in many large national and international research projects, e.g. EU-Bridge and TC-STAR (EU), Babel (US-IARPA) and Quaero (French OSEO).

► Sunday, 25 October, 9⁰⁰–12³⁰

Flexibility and speed are key features for a deep learning framework to allow fast transition from a research idea to prototyping and production code. We outline how to implement a unified framework for sequence processing that covers various kinds of models and applications. We will discuss our toolkit RETURNN as an example for such an implementation, that is easy to apply and understand for the user, flexible to allow for any kind of architecture or method, and at the same time also very efficient. In addition, a comparison of the properties of different machine learning toolkits for sequence classification is provided. The flexibility of using such specific implementations will be demonstrated describing the setup of recent state-of-the-art models for automatic speech recognition and machine translation, upon others.

Organizers：
● Albert Zeyer (RWTH; AppTek)
● Nick Rossenbach (RWTH; AppTek)
● Parnia Bahar (RWTH; AppTek)
● André Merboldt (RWTH)
● Ralf Schlüter (RWTH; AppTek)

Albert Zeyer is a Ph.D. student in the Human Language Technology Group at RWTH Aachen University, Germany, since 2014, under the supervision of Prof. Hermann Ney. He received both the Diplom (M.Sc.) in Mathematics and the Diplom (M.Sc.) in Computer Science from RWTH Aachen University in 2013. His research is focused on neural networks in general. The beginning of his first studies and passion for neural networks and connectionism goes back to 1996. The topics of his recent work include recurrent networks, attention models, and end-to-end models in general, with applications in speech recognition, translation and language modeling, where he achieved many state-of-the-art results. Albert started developing software in 1995, and has published a variety of Open Source projects since then. The TensorFlow based software RETURNN, which he has developed as the main architect for his Ph.D. research, is now widely used by his teammates at RWTH Aachen University, and even beyond. Albert Zeyer gave lectures at the university, and a workshop at eBay, partly with the same content as for this tutorial.

Nick Rossenbach is a Ph.D. student in the Human Language Technology Group at
RWTH Aachen University, Germany, under the supervision of Prof. Dr. Hermann Ney. He worked as student researcher at the chair since 2015, the first two years on the topic of neural machine translation, followed by a bachelor thesis on the same topic. Since 2018, he worked on speech recognition and implementing text-to-speech systems in RETURNN, and completed his master thesis on generating synthetic training data with TTS systems in 2020. Currently, his research focus is on multi-speaker TTS systems trained on noisy data.

Parnia Bahar holds a Master's degree in Electrical Engineering from Stuttgart University and is currently a Ph.D. student in the Human Language Technology Group at RWTH Aachen University, Germany, under the supervision of Prof. Dr. Hermann Ney. Her areas of research are human language technology, machine learning, and neural networks. Her focus includes designing end-to-end neural network translation models for both spoken and written forms as well as recognition systems. She develops the models using RETURNN. She is author or co-author of papers at high-ranking international conferences such as ACL, EMNLP, and ICASSP. Systems developed with her cooperation always performed in shared tasks at WMT and IWSLT among the best systems. She also gives lectures at the university, technical and scientific talks in workshops and supervises thesis in her field of interest as well as works on different research projects.

André Merboldt is a master student in the Human Language Technology Group at RWTH Aachen University, Germany, under the supervision of Prof. Dr. Hermann Ney. He worked since 2017 at the chair where he also wrote his bachelor thesis on end-to-end models for speech recognition. Since then his focus was on investigating and designing attention and transducer models for ASR using the RETURNN software.

Ralf Schlüter serves as Academic Director and Senior Lecturer in the Department of Computer Science of the Faculty of Computer Science, Mathematics and Natural Sciences at RWTH Aachen University. He leads the Automatic Speech Recognition Group at the Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition. He studied physics at RWTH Aachen University and Edinburgh University and received his Diploma in Physics (1995), PhD in Computer Science (2000) and Habilitation for Computer Science (2019), each at RWTH Aachen University. Dr. Schlüter works on all aspects of automatic speech recognition and lead the scientific work of the Lehrstuhl Informatik 6 in the area of automatic speech recognition in many large national and international research projects, e.g. EU-Bridge and TC-STAR (EU), Babel (US-IARPA) and Quaero (French OSEO).

► Sunday, 25 October, 9⁰⁰–12³⁰

While smartphone assistants and smart speakers are prevalent and there are high expectations for social communicative robots, spoken language interaction with these kinds of robots is not effectively deployed. This tutorial aims to give an overview of the issues and challenges related to the integration of natural multimodal dialogue processing for social robots. We first outline dialogue tasks and interfaces suitable for robots in comparison with the conventional dialogue systems and virtual agents. Then, challenges and approaches in the component technologies including ASR, TTS, SLU and dialogue management are reviewed with the focus on human-robot interaction. Issues related to multimodal processing are also addressed. In particular, we review non-verbal processing, including gaze and gesturing, for facilitating turn-taking, timing of backchannels, and indicating troubles in interaction. Finally, we will also briefly discuss open questions concerning architectures for integrating spoken dialogue systems and human-robot interaction.

Organizers：
• Tatsuya Kawahara (Kyoto University, Japan)
• Kristiina Jokinen (AI Research Center AIST Tokyo Waterfront, Japan)

Tatsuya Kawahara received B.E. in 1987, M.E. in 1989, and Ph.D. in 1995, all in information science, from Kyoto University, Kyoto, Japan. From 1995 to 1996, he was a Visiting Researcher at Bell Laboratories, Murray Hill, NJ, USA. Currently, he is a Professor and the Dean in the School of Informatics, Kyoto University. He was also an Invited Researcher at ATR and NICT.

He has published more than 400 academic papers on speech recognition, spoken language processing, and spoken dialogue systems. He has been conducting several projects including speech recognition software Julius, the automatic transcription system for the Japanese Parliament (Diet), and autonomous android ERICA.
From 2003 to 2006, he was a member of IEEE SPS Speech Technical Committee. He was a General Chair of ASRU 2007. He also served as a Tutorial Chair of INTERSPEECH 2010 and a Local Arrangement Chair of ICASSP 2012. He was an editorial board member of Elsevier Journal of Computer Speech and Language and IEEE/ACM Transactions on Audio, Speech, and Language Processing. He is the Editor-in-Chief of APSIPA Transactions on Signal and Information Processing.
Dr. Kawahara is a board member of ISCA and APSIPA, and a Fellow of IEEE.

Kristiina Jokinen is Senior Researcher at AI Research Center, AIST Tokyo Waterfront, Adjunct Professor at University of Helsinki, and Life Member of Clare Hall, University of Cambridge. She received her first degree at University of Helsinki, and her PhD at University of Manchester, UK. She was awarded a JSPS Fellowship to research at Nara Institute of Science and Technology. She was Invited Researcher at ATR, and Visiting Professor at Doshisha University, Kyoto.

She developed Constructive Dialogue Model, and has published widely, including four books, on spoken dialogue systems and AI-based multimodal human-robot interaction (eye-gaze, speech, gestures). Together with G. Wilcock she developed WikiTalk, a Wikipedia-based robot dialogue system which won an award for Best Robot Design (Software Category) at International Conference of Social Robotics 2017.

She has had a leading role in multiple national and international cooperation projects. Currently she is on the international Advisory Board for the H2020 project EMPATHIC and Steering Committee Member for the IWSDS Dialogue Workshop series. She served as General Chair for SIGDial 2017 and ICMI 2013, Area Chair for Interspeech 2017 and COLING 2014. She organised the northernmost dialogue conference IWSDS 2016 in Lapland, and the IJCAI AI-MHRI workshop in Stockholm 2018.

► Sunday, 25 October, 9⁰⁰–12³⁰

While smartphone assistants and smart speakers are prevalent and there are high expectations for social communicative robots, spoken language interaction with these kinds of robots is not effectively deployed. This tutorial aims to give an overview of the issues and challenges related to the integration of natural multimodal dialogue processing for social robots. We first outline dialogue tasks and interfaces suitable for robots in comparison with the conventional dialogue systems and virtual agents. Then, challenges and approaches in the component technologies including ASR, TTS, SLU and dialogue management are reviewed with the focus on human-robot interaction. Issues related to multimodal processing are also addressed. In particular, we review non-verbal processing, including gaze and gesturing, for facilitating turn-taking, timing of backchannels, and indicating troubles in interaction. Finally, we will also briefly discuss open questions concerning architectures for integrating spoken dialogue systems and human-robot interaction.

Organizers：
• Tatsuya Kawahara (Kyoto University, Japan)
• Kristiina Jokinen (AI Research Center AIST Tokyo Waterfront, Japan)

Tatsuya Kawahara received B.E. in 1987, M.E. in 1989, and Ph.D. in 1995, all in information science, from Kyoto University, Kyoto, Japan. From 1995 to 1996, he was a Visiting Researcher at Bell Laboratories, Murray Hill, NJ, USA. Currently, he is a Professor and the Dean in the School of Informatics, Kyoto University. He was also an Invited Researcher at ATR and NICT.

He has published more than 400 academic papers on speech recognition, spoken language processing, and spoken dialogue systems. He has been conducting several projects including speech recognition software Julius, the automatic transcription system for the Japanese Parliament (Diet), and autonomous android ERICA.
From 2003 to 2006, he was a member of IEEE SPS Speech Technical Committee. He was a General Chair of ASRU 2007. He also served as a Tutorial Chair of INTERSPEECH 2010 and a Local Arrangement Chair of ICASSP 2012. He was an editorial board member of Elsevier Journal of Computer Speech and Language and IEEE/ACM Transactions on Audio, Speech, and Language Processing. He is the Editor-in-Chief of APSIPA Transactions on Signal and Information Processing.
Dr. Kawahara is a board member of ISCA and APSIPA, and a Fellow of IEEE.

Kristiina Jokinen is Senior Researcher at AI Research Center, AIST Tokyo Waterfront, Adjunct Professor at University of Helsinki, and Life Member of Clare Hall, University of Cambridge. She received her first degree at University of Helsinki, and her PhD at University of Manchester, UK. She was awarded a JSPS Fellowship to research at Nara Institute of Science and Technology. She was Invited Researcher at ATR, and Visiting Professor at Doshisha University, Kyoto.

She developed Constructive Dialogue Model, and has published widely, including four books, on spoken dialogue systems and AI-based multimodal human-robot interaction (eye-gaze, speech, gestures). Together with G. Wilcock she developed WikiTalk, a Wikipedia-based robot dialogue system which won an award for Best Robot Design (Software Category) at International Conference of Social Robotics 2017.

She has had a leading role in multiple national and international cooperation projects. Currently she is on the international Advisory Board for the H2020 project EMPATHIC and Steering Committee Member for the IWSDS Dialogue Workshop series. She served as General Chair for SIGDial 2017 and ICMI 2013, Area Chair for Interspeech 2017 and COLING 2014. She organised the northernmost dialogue conference IWSDS 2016 in Lapland, and the IJCAI AI-MHRI workshop in Stockholm 2018.

► Sunday, 25 October, 9⁰⁰–12³⁰

Deep learning based human language technology (HLT), such as automatic speech recognition, intent and slot recognition, or dialog management, has become the mainstream of research in recent years and significantly outperforms conventional methods. However, deep learning models are notorious for being data and computation hungry. These downsides limit the application of such models from deployment to different languages, domains, or styles, since collecting in-genre data and training model from scratch are costly, and the long-tail nature of human language makes challenges even greater.
A typical machine learning algorithm, e.g., deep learning, can be considered as a sophisticated function. The function takes training data as input and a trained model as output. Today the learning algorithms are mostly human-designed. Usually, these algorithms are designed for one specific task and need a large amount of labeled training data to learn. One possible method which could potentially overcome these challenges is Meta Learning, also known as ‘Learning to Learn’ that aims at learning the learning algorithm, including better parameter initialization, optimization strategy, network architecture, distance metrics and beyond. Recently, in several HLT areas, Meta Learning has been shown high potential to allow faster fine-tuning, converge to better performance, and achieve few-shot learning. The goal of this tutorial is to introduce Meta Learning approaches and review the work applying this technology to HLT.

Organizers：
• Hung-yi Lee (Department of Electrical Engineering, National Taiwan University)
• Ngoc Thang Vu (Institute for Natural Language Processing, Stuttgart)
• Shang-Wen Li (Amazon AWS AI)

Hung-yi Lee received the M.S. and Ph.D. degrees from National Taiwan University (NTU), Taipei, Taiwan, in 2010 and 2012, respectively. From September 2012 to August 2013, he was a postdoctoral fellow in Research Center for Information Technology Innovation, Academia Sinica. From September 2013 to July 2014, he was a visiting scientist at the Spoken Language Systems Group of MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). He is currently an associate professor of the Department of Electrical Engineering of National Taiwan University, with a joint appointment at the Department of Computer Science & Information Engineering of the university. His research focuses on machine learning (especially deep learning), spoken language understanding and speech recognition. He owns a YouTube channel teaching deep learning (in Mandarin) with more than 4M views and 50k subscribers.

Ngoc Thang Vu received his Diploma (2009) and PhD (2014) degrees in computer science from Karlsruhe Institute of Technology, Germany. From 2014 to 2015, he worked at Nuance Communications as a senior research scientist and at Ludwig-Maximilian University Munich as an acting professor in computational linguistics. In 2015, he was appointed assistant professor at University of Stuttgart, Germany. Since 2018, he has been a full professor at the Institute for Natural Language Processing in Stuttgart. His main research interests are natural language processing (esp. speech, natural language understanding and dialog systems) and machine learning (esp. deep learning) for low-resource settings.

Shang-Wen Li is a senior Applied Scientist at Amazon AWS AI. His research in human language technology focuses on spoken language understanding, natural language generation, and dialog management. His recent interest is data augmentation for low-resourced conversational bots. He earned his Ph.D. from MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) advised by Professor Victor Zue. He received M.S. and B.S. from National Taiwan University. Before joining Amazon AWS, he also worked at Amazon Alexa and Apple Siri researching on dialog management for error recovery.

► Sunday, 25 October, 9⁰⁰–12³⁰

Deep learning based human language technology (HLT), such as automatic speech recognition, intent and slot recognition, or dialog management, has become the mainstream of research in recent years and significantly outperforms conventional methods. However, deep learning models are notorious for being data and computation hungry. These downsides limit the application of such models from deployment to different languages, domains, or styles, since collecting in-genre data and training model from scratch are costly, and the long-tail nature of human language makes challenges even greater.
A typical machine learning algorithm, e.g., deep learning, can be considered as a sophisticated function. The function takes training data as input and a trained model as output. Today the learning algorithms are mostly human-designed. Usually, these algorithms are designed for one specific task and need a large amount of labeled training data to learn. One possible method which could potentially overcome these challenges is Meta Learning, also known as ‘Learning to Learn’ that aims at learning the learning algorithm, including better parameter initialization, optimization strategy, network architecture, distance metrics and beyond. Recently, in several HLT areas, Meta Learning has been shown high potential to allow faster fine-tuning, converge to better performance, and achieve few-shot learning. The goal of this tutorial is to introduce Meta Learning approaches and review the work applying this technology to HLT.

Organizers：
• Hung-yi Lee (Department of Electrical Engineering, National Taiwan University)
• Ngoc Thang Vu (Institute for Natural Language Processing, Stuttgart)
• Shang-Wen Li (Amazon AWS AI)

Hung-yi Lee received the M.S. and Ph.D. degrees from National Taiwan University (NTU), Taipei, Taiwan, in 2010 and 2012, respectively. From September 2012 to August 2013, he was a postdoctoral fellow in Research Center for Information Technology Innovation, Academia Sinica. From September 2013 to July 2014, he was a visiting scientist at the Spoken Language Systems Group of MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). He is currently an associate professor of the Department of Electrical Engineering of National Taiwan University, with a joint appointment at the Department of Computer Science & Information Engineering of the university. His research focuses on machine learning (especially deep learning), spoken language understanding and speech recognition. He owns a YouTube channel teaching deep learning (in Mandarin) with more than 4M views and 50k subscribers.

Ngoc Thang Vu received his Diploma (2009) and PhD (2014) degrees in computer science from Karlsruhe Institute of Technology, Germany. From 2014 to 2015, he worked at Nuance Communications as a senior research scientist and at Ludwig-Maximilian University Munich as an acting professor in computational linguistics. In 2015, he was appointed assistant professor at University of Stuttgart, Germany. Since 2018, he has been a full professor at the Institute for Natural Language Processing in Stuttgart. His main research interests are natural language processing (esp. speech, natural language understanding and dialog systems) and machine learning (esp. deep learning) for low-resource settings.

Shang-Wen Li is a senior Applied Scientist at Amazon AWS AI. His research in human language technology focuses on spoken language understanding, natural language generation, and dialog management. His recent interest is data augmentation for low-resourced conversational bots. He earned his Ph.D. from MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) advised by Professor Victor Zue. He received M.S. and B.S. from National Taiwan University. Before joining Amazon AWS, he also worked at Amazon Alexa and Apple Siri researching on dialog management for error recovery.

► Sunday, 25 October, 9⁰⁰–12³⁰

Although recent success has demonstrated the effectiveness of adopting deep-learning-based models in the speech enhancement (SE) task, several directions are worthy explorations to further improve the SE performance. One direction is to derive a better objective function to replace the conventional mean squared error based one to train the deep-learning-based models. In this tutorial, we first present several well-known intelligibility evaluation metrics and then present the theory and implementation details of SE systems trained with metric-based objective functions. The effectiveness of these terms are confirmed by providing better standardized objective metric and subjective listening test scores, as well as higher automatic speech recognition accuracy.

Organizers：
• Yu Tsao (The Research Center for Information Technology Innovation (CITI), Academia Sinica)
• Fei Chen (Department of Electrical and Electronic Engineering, Southern University of Science and Technology)

Yu Tsao received the B.S. and M.S. degrees in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1999 and 2001, respectively, and the Ph.D. degree in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 2008. From 2009 to 2011, he was a Researcher with the National Institute of Information and Communications Technology, Tokyo, Japan, where he engaged in research and product development in automatic speech recognition for multilingual speech-to-speech translation. He is currently an Associate Research Fellow with the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan. His research interests include speech and speaker recognition, acoustic and language modeling, audio coding, and bio-signal processing. He is currently an Associate Editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing and IEICE transactions on Information and Systems. Dr. Tsao received the Academia Sinica Career Development Award in 2017, National Innovation Award in 2018 and 2019, and Outstanding Elite Award, Chung Hwa Rotary Educational Foundation 2019-2020.

Fei Chen received the B.Sc. and M.Phil. degrees from the Department of Electronic Science and Engineering, Nanjing University in 1998 and 2001, respectively, and the Ph.D. degree from the Department of Electronic Engineering, The Chinese University of Hong Kong in 2005. He continued his research as post-doctor and senior research fellow in University of Texas at Dallas (supervised by Prof. Philipos Loizou) and The University of Hong Kong, and joined Southern University of Science and Technology (SUSTech) as a faculty in 2014. Dr. Chen is leading the speech processing research group in SUSTech, with research focus on speech perception, speech intelligibility modeling, speech enhancement, and assistive hearing technology. He published over 80 journal papers and over 80 conference papers in IEEE journals/conferences, Interspeech, Journal of Acoustical Society of America, etc. He received the best presentation award in the 9th Asia Pacific Conference of Speech, Language and Hearing, and 2011 National Organization for Hearing Research Foundation Research Awards in States. Dr. Chen is now serving as associate editor/editorial member of 《Frontiers in Psychology》《Biomedical Signal Processing and Control》《Physiological Measurement》.

► Sunday, 25 October, 9⁰⁰–12³⁰

Although recent success has demonstrated the effectiveness of adopting deep-learning-based models in the speech enhancement (SE) task, several directions are worthy explorations to further improve the SE performance. One direction is to derive a better objective function to replace the conventional mean squared error based one to train the deep-learning-based models. In this tutorial, we first present several well-known intelligibility evaluation metrics and then present the theory and implementation details of SE systems trained with metric-based objective functions. The effectiveness of these terms are confirmed by providing better standardized objective metric and subjective listening test scores, as well as higher automatic speech recognition accuracy.

Organizers：
• Yu Tsao (The Research Center for Information Technology Innovation (CITI), Academia Sinica)
• Fei Chen (Department of Electrical and Electronic Engineering, Southern University of Science and Technology)

Yu Tsao received the B.S. and M.S. degrees in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1999 and 2001, respectively, and the Ph.D. degree in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 2008. From 2009 to 2011, he was a Researcher with the National Institute of Information and Communications Technology, Tokyo, Japan, where he engaged in research and product development in automatic speech recognition for multilingual speech-to-speech translation. He is currently an Associate Research Fellow with the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan. His research interests include speech and speaker recognition, acoustic and language modeling, audio coding, and bio-signal processing. He is currently an Associate Editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing and IEICE transactions on Information and Systems. Dr. Tsao received the Academia Sinica Career Development Award in 2017, National Innovation Award in 2018 and 2019, and Outstanding Elite Award, Chung Hwa Rotary Educational Foundation 2019-2020.

Fei Chen received the B.Sc. and M.Phil. degrees from the Department of Electronic Science and Engineering, Nanjing University in 1998 and 2001, respectively, and the Ph.D. degree from the Department of Electronic Engineering, The Chinese University of Hong Kong in 2005. He continued his research as post-doctor and senior research fellow in University of Texas at Dallas (supervised by Prof. Philipos Loizou) and The University of Hong Kong, and joined Southern University of Science and Technology (SUSTech) as a faculty in 2014. Dr. Chen is leading the speech processing research group in SUSTech, with research focus on speech perception, speech intelligibility modeling, speech enhancement, and assistive hearing technology. He published over 80 journal papers and over 80 conference papers in IEEE journals/conferences, Interspeech, Journal of Acoustical Society of America, etc. He received the best presentation award in the 9th Asia Pacific Conference of Speech, Language and Hearing, and 2011 National Organization for Hearing Research Foundation Research Awards in States. Dr. Chen is now serving as associate editor/editorial member of 《Frontiers in Psychology》《Biomedical Signal Processing and Control》《Physiological Measurement》.

► Sunday, 25 October, 14⁰⁰–17³⁰

In recent years, the field of spoken language processing has been moving at a very fast pace. The impact of deep learning coupled with access to vast data resources has given rise to unprecedented improvements in the performance of speech processing algorithms and systems. However, the availability of such pre-recorded datasets and open-source machine-learning toolkits means that practitioners – especially students – are in real danger of becoming detached from the nature and behaviour of actual speech signals. This tutorial is aimed at providing an appreciation of the fundamental properties of spoken language, from low-level phonetic detail to high-level communicative behaviour, with a special emphasis on aspects that may have significance for current and future research.

Organizer:
• Roger K. Moore (University of Sheffield)

Prof. Moore (http://staffwww.dcs.shef.ac.uk/people/R.K.Moore/) has over 40 years’ experience in Speech Technology R&D and, although an engineer by training, much of his research has been based on insights from human speech perception and production. As Head of the UK Government's Speech Research Unit from 1985 to 1999, he was responsible for the development of the Aurix range of speech technology products and the subsequent formation of 20/20 Speech Ltd. Since 2004 he has been Professor of Spoken Language Processing at the University of Sheffield, and also holds Visiting Chairs at Bristol Robotics Laboratory and University College London Psychology & Language Sciences. Prof. Moore was President of the European/International Speech Communication Association from 1997 to 2001, General Chair for INTERSPEECH-2009 and ISCA Distinguished Lecturer during 2014-15. In 2017 he organised the first international workshop on ‘Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR)’. Prof. Moore is the current Editor-in-Chief of Computer Speech & Language and in 2016 he was awarded the LREC Antonio Zampoli Prize for "Outstanding Contributions to the Advancement of Language Resources & Language Technology Evaluation within Human Language Technologies".

► Sunday, 25 October, 14⁰⁰–17³⁰

In recent years, the field of spoken language processing has been moving at a very fast pace. The impact of deep learning coupled with access to vast data resources has given rise to unprecedented improvements in the performance of speech processing algorithms and systems. However, the availability of such pre-recorded datasets and open-source machine-learning toolkits means that practitioners – especially students – are in real danger of becoming detached from the nature and behaviour of actual speech signals. This tutorial is aimed at providing an appreciation of the fundamental properties of spoken language, from low-level phonetic detail to high-level communicative behaviour, with a special emphasis on aspects that may have significance for current and future research.

Organizer:
• Roger K. Moore (University of Sheffield)

Prof. Moore (http://staffwww.dcs.shef.ac.uk/people/R.K.Moore/) has over 40 years’ experience in Speech Technology R&D and, although an engineer by training, much of his research has been based on insights from human speech perception and production. As Head of the UK Government's Speech Research Unit from 1985 to 1999, he was responsible for the development of the Aurix range of speech technology products and the subsequent formation of 20/20 Speech Ltd. Since 2004 he has been Professor of Spoken Language Processing at the University of Sheffield, and also holds Visiting Chairs at Bristol Robotics Laboratory and University College London Psychology & Language Sciences. Prof. Moore was President of the European/International Speech Communication Association from 1997 to 2001, General Chair for INTERSPEECH-2009 and ISCA Distinguished Lecturer during 2014-15. In 2017 he organised the first international workshop on ‘Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR)’. Prof. Moore is the current Editor-in-Chief of Computer Speech & Language and in 2016 he was awarded the LREC Antonio Zampoli Prize for "Outstanding Contributions to the Advancement of Language Resources & Language Technology Evaluation within Human Language Technologies".

► Sunday, 25 October, 14⁰⁰–17³⁰

A conversational information retrieval (CIR) system is an information retrieval (IR) system with a conversational interface which allows users to interact with the system to seek information via multi-turn conversations of natural language (in spoken or written form). This tutorial surveys recent advances in CIR, focusing on neural approaches that have been developed in the last few years. We present (1) a typical architecture of a CIR system, (2) new tasks and applications which arise from the needs of developing such a system, in comparison with traditional keyword-based IR systems, (3) new methods of conversational question answering, and (4) case studies of several CIR systems developed in research communities and industry.

Organizers：
• Jianfeng Gao (Microsoft Research, Redmond)
• Chenyan Xiong (Microsoft Research, Redmond)
• Paul Bennett (Microsoft Research, Redmond)

Jianfeng Gao (primary contact) is a Partner Research Manager at Microsoft Research AI, Redmond. IEEE Fellow. He leads the development of AI systems for natural language understanding, vision language processing, dialogue, and business applications. He frequently gives tutorials on similar or related topics at conferences and summer schools. Examples include the tutorials on “deep learning for NLP and IR” at ICASSP 2014, HLT-NAACL 2015, IJCAI 2016, and International Summer School on Deep Learning 2017 in Bilbao, as well as the tutorials on “neural approaches to conversational AI” at ACL 2018, SIGIR 2018, and ICML 2019, etc.
From 2014 to 2017, he was Partner Research Manager at Deep Learning Technology Center at Microsoft Research, Redmond, where he was leading the research on deep learning for text and image processing. From 2006 to 2014, he was Principal Researcher at Natural Language Processing Group at Microsoft Research, Redmond, where he worked on Web search, query understanding and reformulation, ads prediction, and statistical machine translation. From 2005 to 2006, he was a Research Lead in Natural Interactive Services Division at Microsoft, where he worked on Project X, an effort of developing natural user interface for Windows. From 2000 to 2005, he was Research Lead in Natural Language Computing Group at Microsoft Research Asia, where he and his colleagues developed the first Chinese speech recognition system released with Microsoft Office, the Chinese/Japanese Input Method Editors (IME) which were the leading products in the market, and the natural language platform for Microsoft Windows.

Chenyan Xiong is a Senior Researcher in Microsoft Research AI, Redmond. His research area is in the intersection of information retrieval, natural language processing, and deep learning. His current research focus is on long-form text understanding, conversational information access, and neural information retrieval. Before joining Microsoft Research AI, Chenyan obtained his Ph.D. at Language Technologies Institute, Carnegie Mellon University in 2018.
He has published 30+ papers on top IR, NLP, and Machine Learning conferences and is a PC member of SIGIR, WebConf, WSDM, ACL, EMNLP, KDD, NeurIPS, etc. He has organized three workshops on knowledge graph in IR, was a guest editor for Information Retrieval Journal, and is organizing the Conversational Assistance Track (CAsT) at TREC.

Paul Bennett is a Sr. Principal Research Manager at Microsoft Research AI where he leads the Information Data Sciences group. His published research has focused on a variety of topics surrounding the use of machine learning in information retrieval – including ensemble methods and the combination of information sources, calibration, consensus methods for noisy supervision labels, active learning and evaluation, supervised classification and ranking, crowdsourcing, behavioral modeling and analysis, and personalization. Paul gave the tutorial on “Machine Learning and IR: Recent Successes and New Opportunities” in ICML 2009 and ECIR 2010.
Some of his work has been recognized with awards at SIGIR, CHI, and ACM UMAP as well as an ECIR Test of Time Honorable Mention award. Prior to joining Microsoft Research in 2006, he completed his dissertation in the Computer Science Department at Carnegie Mellon with Jaime Carbonell and John Lafferty. While at CMU, he also acted as the Chief Learning Architect on the RADAR project from 2005-2006 while a postdoctoral fellow in the Language Technologies Institute.

► Sunday, 25 October, 14⁰⁰–17³⁰

A conversational information retrieval (CIR) system is an information retrieval (IR) system with a conversational interface which allows users to interact with the system to seek information via multi-turn conversations of natural language (in spoken or written form). This tutorial surveys recent advances in CIR, focusing on neural approaches that have been developed in the last few years. We present (1) a typical architecture of a CIR system, (2) new tasks and applications which arise from the needs of developing such a system, in comparison with traditional keyword-based IR systems, (3) new methods of conversational question answering, and (4) case studies of several CIR systems developed in research communities and industry.

Organizers：
• Jianfeng Gao (Microsoft Research, Redmond)
• Chenyan Xiong (Microsoft Research, Redmond)
• Paul Bennett (Microsoft Research, Redmond)

Jianfeng Gao (primary contact) is a Partner Research Manager at Microsoft Research AI, Redmond. IEEE Fellow. He leads the development of AI systems for natural language understanding, vision language processing, dialogue, and business applications. He frequently gives tutorials on similar or related topics at conferences and summer schools. Examples include the tutorials on “deep learning for NLP and IR” at ICASSP 2014, HLT-NAACL 2015, IJCAI 2016, and International Summer School on Deep Learning 2017 in Bilbao, as well as the tutorials on “neural approaches to conversational AI” at ACL 2018, SIGIR 2018, and ICML 2019, etc.
From 2014 to 2017, he was Partner Research Manager at Deep Learning Technology Center at Microsoft Research, Redmond, where he was leading the research on deep learning for text and image processing. From 2006 to 2014, he was Principal Researcher at Natural Language Processing Group at Microsoft Research, Redmond, where he worked on Web search, query understanding and reformulation, ads prediction, and statistical machine translation. From 2005 to 2006, he was a Research Lead in Natural Interactive Services Division at Microsoft, where he worked on Project X, an effort of developing natural user interface for Windows. From 2000 to 2005, he was Research Lead in Natural Language Computing Group at Microsoft Research Asia, where he and his colleagues developed the first Chinese speech recognition system released with Microsoft Office, the Chinese/Japanese Input Method Editors (IME) which were the leading products in the market, and the natural language platform for Microsoft Windows.

Chenyan Xiong is a Senior Researcher in Microsoft Research AI, Redmond. His research area is in the intersection of information retrieval, natural language processing, and deep learning. His current research focus is on long-form text understanding, conversational information access, and neural information retrieval. Before joining Microsoft Research AI, Chenyan obtained his Ph.D. at Language Technologies Institute, Carnegie Mellon University in 2018.
He has published 30+ papers on top IR, NLP, and Machine Learning conferences and is a PC member of SIGIR, WebConf, WSDM, ACL, EMNLP, KDD, NeurIPS, etc. He has organized three workshops on knowledge graph in IR, was a guest editor for Information Retrieval Journal, and is organizing the Conversational Assistance Track (CAsT) at TREC.

Paul Bennett is a Sr. Principal Research Manager at Microsoft Research AI where he leads the Information Data Sciences group. His published research has focused on a variety of topics surrounding the use of machine learning in information retrieval – including ensemble methods and the combination of information sources, calibration, consensus methods for noisy supervision labels, active learning and evaluation, supervised classification and ranking, crowdsourcing, behavioral modeling and analysis, and personalization. Paul gave the tutorial on “Machine Learning and IR: Recent Successes and New Opportunities” in ICML 2009 and ECIR 2010.
Some of his work has been recognized with awards at SIGIR, CHI, and ACM UMAP as well as an ECIR Test of Time Honorable Mention award. Prior to joining Microsoft Research in 2006, he completed his dissertation in the Computer Science Department at Carnegie Mellon with Jaime Carbonell and John Lafferty. While at CMU, he also acted as the Chief Learning Architect on the RADAR project from 2005-2006 while a postdoctoral fellow in the Language Technologies Institute.

► Sunday, 25 October, 14⁰⁰–17³⁰

Speaker diarization is an essential component for speech applications in multi-speaker settings. Spoken utterances need to be attributed to speaker-specific classes with or without prior knowledge of the speakers' identity or profile. Initially, speaker diarization technologies were developed as standalone processes without requiring much context of other components in a given speech application. As speech recognition technology has become more accessible, there is an emerging trend considering speaker diarization as an integral part of an overall speech recognition application; while benefiting from the speech recognition output to improve speaker diarization accuracy. As of lately, joint model training for speaker diarization and speech recognition is investigated in an attempt to consolidate the training objectives, enhancing the overall performance. In this tutorial, we will overview the development of speaker diarization in the era of deep learning, present the recent approaches to speaker diarization in the context of speech recognition, and share the industry perspectives on speaker diarization and its challenges. Finally, we will provide insights about future directions of speaker diarization as a part of context-aware interactive system.

Organizers:
• Kyu J. Han (ASAPP Inc.)
• Tae Jin Park (University of Southern California)
• Dimitrios Dimitriadis (Microsoft, WA)

Kyu Jeong Han received his PhD from USC in 2009 and is currently working for ASAPP Inc. as a Principal Speech Scientist, focusing on deep learning technologies for speech applications. Dr. Han held research positions at IBM, Ford, Capio.ai (acquired by Twilio) and JD.com. He is actively involved in the speech community as well, serving as reviewers for IEEE, ISCA and ACL journals and conferences, and a Technical Committee member in the Speech and Language Processing committee of the IEEE SPS since 2019. He also serves for IEEE SLT-2020 as part of the Organizing Committee. In 2018, he won the ISCA Award for the Best Paper Published in Computer Speech & Language 2013-2017.

Tae Jin Park received his B.S. degree in electrical engineering and M.S. degree in Electric Engineering and Computer Science from Seoul National University, Seoul, South Korea. in 2010 and 2012, respectively. In 2012, he joined Electrical and Telecommunication Research Institute (ETRI), Daejeon, South Korea, as a researcher. He is currently a Ph.D. candidate in Signal Analysis and Interpretation Laboratory (SAIL) at University of Southern California (USC). He is interested in machine learning and speech signal processing concentrating on speaker diarization.

Dr. D. Dimitriadis is a Principal Researcher in Microsoft, WA, where he is leading the Federated Learning research project. He worked as a Researcher in IBM Research, NY and AT&T Labs, NJ, and lecturer P.D 407/80 in School of ECE, NTUA, Greece. He is a Senior Member of IEEE. He was part of the Program Committee for the Multi-Learn’17 Workshop, and the Organizing Committee for IEEE SLT'18 and ICASSP'23. He has also served as session chair in multiple conferences. Dr. Dimitriadis has published more than 60 papers in peer-reviewed scientific journals and conferences with over 1500 citations. He received his PhD degree from NTUA in February 2005. His PhD Thesis title is "Non-Linear Speech Processing, Modulation Models and Applications to Speech Recognition". His major was in D.S.P. with Specialization in Speech Processing.

► Sunday, 25 October, 14⁰⁰–17³⁰

Speaker diarization is an essential component for speech applications in multi-speaker settings. Spoken utterances need to be attributed to speaker-specific classes with or without prior knowledge of the speakers' identity or profile. Initially, speaker diarization technologies were developed as standalone processes without requiring much context of other components in a given speech application. As speech recognition technology has become more accessible, there is an emerging trend considering speaker diarization as an integral part of an overall speech recognition application; while benefiting from the speech recognition output to improve speaker diarization accuracy. As of lately, joint model training for speaker diarization and speech recognition is investigated in an attempt to consolidate the training objectives, enhancing the overall performance. In this tutorial, we will overview the development of speaker diarization in the era of deep learning, present the recent approaches to speaker diarization in the context of speech recognition, and share the industry perspectives on speaker diarization and its challenges. Finally, we will provide insights about future directions of speaker diarization as a part of context-aware interactive system.

Organizers:
• Kyu J. Han (ASAPP Inc.)
• Tae Jin Park (University of Southern California)
• Dimitrios Dimitriadis (Microsoft, WA)

Kyu Jeong Han received his PhD from USC in 2009 and is currently working for ASAPP Inc. as a Principal Speech Scientist, focusing on deep learning technologies for speech applications. Dr. Han held research positions at IBM, Ford, Capio.ai (acquired by Twilio) and JD.com. He is actively involved in the speech community as well, serving as reviewers for IEEE, ISCA and ACL journals and conferences, and a Technical Committee member in the Speech and Language Processing committee of the IEEE SPS since 2019. He also serves for IEEE SLT-2020 as part of the Organizing Committee. In 2018, he won the ISCA Award for the Best Paper Published in Computer Speech & Language 2013-2017.

Tae Jin Park received his B.S. degree in electrical engineering and M.S. degree in Electric Engineering and Computer Science from Seoul National University, Seoul, South Korea. in 2010 and 2012, respectively. In 2012, he joined Electrical and Telecommunication Research Institute (ETRI), Daejeon, South Korea, as a researcher. He is currently a Ph.D. candidate in Signal Analysis and Interpretation Laboratory (SAIL) at University of Southern California (USC). He is interested in machine learning and speech signal processing concentrating on speaker diarization.

Dr. D. Dimitriadis is a Principal Researcher in Microsoft, WA, where he is leading the Federated Learning research project. He worked as a Researcher in IBM Research, NY and AT&T Labs, NJ, and lecturer P.D 407/80 in School of ECE, NTUA, Greece. He is a Senior Member of IEEE. He was part of the Program Committee for the Multi-Learn’17 Workshop, and the Organizing Committee for IEEE SLT'18 and ICASSP'23. He has also served as session chair in multiple conferences. Dr. Dimitriadis has published more than 60 papers in peer-reviewed scientific journals and conferences with over 1500 citations. He received his PhD degree from NTUA in February 2005. His PhD Thesis title is "Non-Linear Speech Processing, Modulation Models and Applications to Speech Recognition". His major was in D.S.P. with Specialization in Speech Processing.

► Sunday, 25 October, 14⁰⁰–17³⁰

This tutorial will provide an in-depth survey of the state of the art in spoken language processing in language learning and assessment from a practitioner’s perspective. The first part of the tutorial will discuss in detail the acoustic, speech, and language processing challenges in recognizing and dealing with native and non-native speech from both adults and children from different language backgrounds at scale. The second part of the tutorial will examine the current state of the art in both knowledge-driven and data-driven approaches to automated scoring of such data along various dimensions of spoken language proficiency, be it monologic or dialogic in nature. The final part of the tutorial will look at a hot topic and key challenge facing the field at the moment – that of automatically generating targeted feedback for language learners that can help them improve their overall spoken language proficiency.

The presenters, based at Educational Testing Service R&D in Princeton and San Francisco, USA, have more than 40 years of combined R&D experience in spoken language processing for education, speech recognition, spoken dialog systems and automated speech scoring.

Organizers:
• Vikram Ramanarayanan (Educational Testing Service R&D)
• Klaus Zechner (Educational Testing Service R&D)
• Keelan Evanini (Educational Testing Service R&D)

Vikram Ramanarayanan is a Senior Research Scientist in the Speech and NLP Group of Educational Testing Service R&D based out of the San Francisco office, where is he is also the Office Manager. He also holds an Assistant Adjunct Professor appointment in the Department of Otolaryngology - Head and Neck Surgery at the University of California, San Francisco. His work at ETS on dialog and multimodal systems with applications to language learning and behavioral assessment won the prestigious ETS Presidential Award. Vikram's research interests lie in applying scientific knowledge to interdisciplinary engineering problems in speech, language and vision and in turn using engineering approaches to drive scientific understanding. He holds M.S and Ph.D degrees in Electrical Engineering from the University of Southern California, Los Angeles, and is a Fellow of the USC Sidney Harman Academy for Polymathic Study and a Senior Member of the IEEE. Vikram’s work has won two Best Paper awards at top international conferences and has resulted in over 75 publications at refereed international journals and conferences and 10 patents filed. Webpage: http://www.vikramr.com/.

Klaus Zechner (Ph.D., Carnegie Mellon University) is a Senior Research Scientist, leading a team of speech scientists within the Natural Language Processing and Speech Group in the Research and Development Division of Educational Testing Service (ETS) in Princeton, New Jersey, USA. Since joining ETS in 2002, he has been pioneering research and development of technologies for automated scoring of non-native speech. Since 2011 he has been leading large annual R&D projects dedicated to the continuous improvement of automated speech scoring technology. He holds around 20 patents on technology related to SpeechRater®, an automated speech scoring system he and his team have been developing at ETS. SpeechRater is currently used operationally as contributory scoring system, along with human raters, for the TOEFL® iBT Speaking assessment, and further as sole scoring system for the TOEFL® Practice Online (TPO) Speaking assessment, the TOEFL MOOC, and is licensed by multiple external clients to support English language learning. Klaus Zechner authored more than 80 peer-reviewed publications in journals, book chapters, conference and workshop proceedings, and research reports. In 2019, a book on automated speaking assessment was published by Routledge where he was the main editor; it provides an overview of the current state-of-the-art in automated speech scoring of spontaneous non-native speech. Webpage: https://www.researchgate.net/profile/Klaus_Zechner

Keelan Evanini is a Research Director at Educational Testing Service in Princeton, NJ. His research interests include automated assessment of non-native spoken English for large-scale assessments, automated feedback in computer assisted language learning applications, and spoken dialog systems. He leads a team of research scientists that conducts foundational research into automated speech scoring and spoken dialog technology. He also leads a team of research engineers that focuses on applied engineering and capability implementation for ETS automated scoring engines. He received his Ph.D. in Linguistics from the University of Pennsylvania in 2009 under the supervision of Bill Labov, and has worked at ETS Research since then. He has published over 70 papers in peer-reviewed journals and conference proceedings, has been awarded 9 patents, and is a senior member of the IEEE. Webpage: http://evanini.com/keelan.html

► Sunday, 25 October, 14⁰⁰–17³⁰

This tutorial will provide an in-depth survey of the state of the art in spoken language processing in language learning and assessment from a practitioner’s perspective. The first part of the tutorial will discuss in detail the acoustic, speech, and language processing challenges in recognizing and dealing with native and non-native speech from both adults and children from different language backgrounds at scale. The second part of the tutorial will examine the current state of the art in both knowledge-driven and data-driven approaches to automated scoring of such data along various dimensions of spoken language proficiency, be it monologic or dialogic in nature. The final part of the tutorial will look at a hot topic and key challenge facing the field at the moment – that of automatically generating targeted feedback for language learners that can help them improve their overall spoken language proficiency.

The presenters, based at Educational Testing Service R&D in Princeton and San Francisco, USA, have more than 40 years of combined R&D experience in spoken language processing for education, speech recognition, spoken dialog systems and automated speech scoring.

Organizers:
• Vikram Ramanarayanan (Educational Testing Service R&D)
• Klaus Zechner (Educational Testing Service R&D)
• Keelan Evanini (Educational Testing Service R&D)

Vikram Ramanarayanan is a Senior Research Scientist in the Speech and NLP Group of Educational Testing Service R&D based out of the San Francisco office, where is he is also the Office Manager. He also holds an Assistant Adjunct Professor appointment in the Department of Otolaryngology - Head and Neck Surgery at the University of California, San Francisco. His work at ETS on dialog and multimodal systems with applications to language learning and behavioral assessment won the prestigious ETS Presidential Award. Vikram's research interests lie in applying scientific knowledge to interdisciplinary engineering problems in speech, language and vision and in turn using engineering approaches to drive scientific understanding. He holds M.S and Ph.D degrees in Electrical Engineering from the University of Southern California, Los Angeles, and is a Fellow of the USC Sidney Harman Academy for Polymathic Study and a Senior Member of the IEEE. Vikram’s work has won two Best Paper awards at top international conferences and has resulted in over 75 publications at refereed international journals and conferences and 10 patents filed. Webpage: http://www.vikramr.com/.

Klaus Zechner (Ph.D., Carnegie Mellon University) is a Senior Research Scientist, leading a team of speech scientists within the Natural Language Processing and Speech Group in the Research and Development Division of Educational Testing Service (ETS) in Princeton, New Jersey, USA. Since joining ETS in 2002, he has been pioneering research and development of technologies for automated scoring of non-native speech. Since 2011 he has been leading large annual R&D projects dedicated to the continuous improvement of automated speech scoring technology. He holds around 20 patents on technology related to SpeechRater®, an automated speech scoring system he and his team have been developing at ETS. SpeechRater is currently used operationally as contributory scoring system, along with human raters, for the TOEFL® iBT Speaking assessment, and further as sole scoring system for the TOEFL® Practice Online (TPO) Speaking assessment, the TOEFL MOOC, and is licensed by multiple external clients to support English language learning. Klaus Zechner authored more than 80 peer-reviewed publications in journals, book chapters, conference and workshop proceedings, and research reports. In 2019, a book on automated speaking assessment was published by Routledge where he was the main editor; it provides an overview of the current state-of-the-art in automated speech scoring of spontaneous non-native speech. Webpage: https://www.researchgate.net/profile/Klaus_Zechner

Keelan Evanini is a Research Director at Educational Testing Service in Princeton, NJ. His research interests include automated assessment of non-native spoken English for large-scale assessments, automated feedback in computer assisted language learning applications, and spoken dialog systems. He leads a team of research scientists that conducts foundational research into automated speech scoring and spoken dialog technology. He also leads a team of research engineers that focuses on applied engineering and capability implementation for ETS automated scoring engines. He received his Ph.D. in Linguistics from the University of Pennsylvania in 2009 under the supervision of Bill Labov, and has worked at ETS Research since then. He has published over 70 papers in peer-reviewed journals and conference proceedings, has been awarded 9 patents, and is a senior member of the IEEE. Webpage: http://evanini.com/keelan.html

1. Efficient and Flexible Implementation of Machine Learning for ASR and MT

2. Spoken Dialogue for Social Robots

3. Meta Learning and Its Applications to Human Language Processing

4. Intelligibility Evaluation and Speech Enhancement based on Deep Learning

1. 'Speech 101' - What Everyone Working on Spoken Language Processing Needs to Know about Spoken Language

2. Neural Approaches to Conversational Information Retrieval

3. Neural Models for Speaker Diarization in the Context of Speech Recognition

4. Spoken Language Processing for Language Learning and Assessment