Shehzad Mevawalla(Amazon Alexa)
Abstract:
From the early days of modern ASR research in the 1990s, one of the driving visions of the field has been a computer-based assistant that could accomplish tasks for the user, simply by being spoken to. Today, we are close to achieving that vision, with a whole array of speech-enabled AI agents eager to help users. Amazon’s Alexa pioneered the AI assistant concept for smart speaker devices enabled by far-field ASR. It currently supports billions of customer interactions per week, on over 100 million devices across multiple languages. This keynote will give an overview of the interplay between underlying speech technologies, including wakeword detection, endpointing, speaker identification, and speech recognition that enable Alexa. We highlight the complexities of combining these technologies into a seamless and robust speech-enabled user experience under large production load and real-time constraints. Interesting algorithmic and engineering challenges arise from choices between deployment in the cloud versus on edge devices, and from constraints on latency and memory versus trade-offs in accuracy. Adapting recognition systems to trending topics, changing domain knowledge bases, and to the customer’s personal catalogs adds additional complexity, as does the need to support adaptive conversational behavior (such as normal versus whispered speech). We also dive into the unique data aspects of large-scale deployments like Alexa, where a continuous stream of unlabeled data enables successful applications of weakly supervised learning. Finally, we highlight problems for the speech research community that remain to be solved before the promise of a fully natural, conversational assistant is fully realized.