Batuhan Gundogdu(Bogazici University), Bolaji Yusuf(Bogazici University), Mansur Yesilbursa(Bogazici University) and Murat Saraclar(Bogazici University)
A recent task posed by the Zerospeech challenge is the unsupervised learning of the basic acoustic units that exist in an unknown language. Previously, we introduced recurrent sparse autoencoders fine-tuned with corresponding speech segments obtained by unsupervised term discovery. There, the clustering was obtained on the intermediate layer where the nodes represent the acoustic unit assignments. In this paper, we extend this system by incorporating vector quantization and an adaptation of the winner-take-all networks. This way, symbol continuity could be enforced by excitatory and inhibitory weights along the temporal axis. Furthermore, in this work, we utilized the speaker information in a speaker adversarial training on the encoder. The ABX discriminability and the low bitrate results of our proposed approach on the Zerospeech 2020 challenge demonstrate the effect of the enhanced continuity of the encoding brought by the temporal-awareness and sparsity techniques proposed in this work.