Hongyi Liu(Amazon), Apurva Abhyankar(Amazon), Yuriy Mishchenko(Amazon), Thibaud Sénéchal(Amazon), Gengshen Fu(Amazon), Brian Kulis(Amazon), Noah Stein(Amazon), Anish Shah(Amazon) and Shiv Naga Prasad Vitaladevuni(Amazon)
As a crucial part of Alexa products, our on-device keyword spotting system detects the wakeword in conversation and initiates subsequent user-device interactions. Convolutional neural networks (CNNs) have been widely used to model the relationship between time and frequency in the audio spectrum. However, it is not obvious how to appropriately leverage the rich descriptive information from device state metadata (such as player state, device type, volume, etc) in a CNN architecture. In this paper, we propose to use metadata information as an additional input feature to improve the performance of a single CNN keyword-spotting model under different conditions. We design a new network architecture for metadata-aware end-to-end keyword spotting which learns to convert the categorical metadata to a fixed length embedding, and then uses the embedding to: 1) modulate convolutional feature maps via conditional batch normalization, and 2) contribute to the fully connected layer via feature concatenation. The experiment shows that the proposed architecture is able to learn the meta-specific characteristics from combined datasets, and the best candidate achieves an average relative false reject rate (FRR) improvement of $14.63\%$ at the same false accept rate (FAR) compared with CNN that does not use device state metadata.