Liwen Zhang(Harbin Institute of Technology), Jiqing Han(Harbin Institute of Technology) and Ziqiang Shi(Fujitsu Research and Development Center)
Convolutional Neural Networks (CNNs) have been widely investigated on Acoustic Scene Classification (ASC). Where the convolutional operation can extract useful semantic contents from a local receptive field in the input spectrogram within certain Manhattan distance, i.e., the kernel size. Although stacking multiple convolution layers can increase the range of the receptive field, without explicitly considering the temporal relations of different receptive fields, the increased range is limited around the kernel. In this paper, we propose a 3D CNN for ASC, named ATReSN-Net, which can capture
temporal relations of different receptive fields from arbitrary time-frequency locations by mapping the semantic features obtained from the residual block into a semantic space. The ATReSN module has two primary components: first, a k-NNbased grouper for gathering a semantic neighborhood for each feature point in the feature maps. Second, an attentive poolingbased temporal relations aggregator for generating the temporal relations embedding of each feature point and its neighborhood. Experiments showed that our ATReSN-Net outperforms most of the state-of-the-art CNN models. We shared our code at https://github.com/zlw9161/ATReSN-Net.