Mon-3-4-5 Acoustic Scene Analysis with Multi-head Attention Networks

Weimin Wang(Amazon), Weiran Wang(, Ming Sun( and Chao Wang(Amazon)
Abstract: Acoustic Scene Classification (ASC) is a challenging task, as a single scene may involve multiple events that contain com- plex sound patterns. For example, a cooking scene may con- tain several sound sources including silverware clinking, chop- ping, frying, etc. What complicates ASC more is that classes of different activities could have overlapping sounds patterns (e.g. both cooking and dishwashing could have silverware clinking sound). In this paper, we propose a multi-head attention net- work to model the complex temporal input structures for ASC. The proposed network takes the audio’s time-frequency repre- sentation as input, and it leverages standard VGG plus LSTM layers to extract high-level feature representation. Further more, it applies multiple attention heads to summarize various pat- terns of sound events into fixed dimensional representation, for the purpose of final scene classification. The whole network is trained in an end-to-end fashion with backpropagation. Ex- perimental results confirm that our model discovers meaningful sound patterns through the attention mechanism, without using explicit supervision in the alignment. We evaluated our pro- posed model using DCASE 2018 Task 5 dataset, and achieved competitive performance on par with previous winner’s results.
Student Information

Student Events

Travel Grants