Weimin Wang(Amazon), Weiran Wang(Amazon.com), Ming Sun(Amazon.com) and Chao Wang(Amazon)
Acoustic Scene Classification (ASC) is a challenging task, as a single scene may involve multiple events that contain com- plex sound patterns. For example, a cooking scene may con- tain several sound sources including silverware clinking, chop- ping, frying, etc. What complicates ASC more is that classes of different activities could have overlapping sounds patterns (e.g. both cooking and dishwashing could have silverware clinking sound). In this paper, we propose a multi-head attention net- work to model the complex temporal input structures for ASC. The proposed network takes the audio’s time-frequency repre- sentation as input, and it leverages standard VGG plus LSTM layers to extract high-level feature representation. Further more, it applies multiple attention heads to summarize various pat- terns of sound events into fixed dimensional representation, for the purpose of final scene classification. The whole network is trained in an end-to-end fashion with backpropagation. Ex- perimental results confirm that our model discovers meaningful sound patterns through the attention mechanism, without using explicit supervision in the alignment. We evaluated our pro- posed model using DCASE 2018 Task 5 dataset, and achieved competitive performance on par with previous winner’s results.