Vinith Kishore(Samsung Research Institute Banglore), Nitya Tiwari(Samsung Research Institute Bangalore) and periyasamy Paramasivam(Samsung)
A deep learning based time domain single-channel speech enhancement technique using multilayer encoder-decoder and a temporal convolutional network is proposed for use in applications such as smart speakers and voice assistants. The technique uses encoder-decoder with convolutional layers for obtaining representation suitable for speech enhancement and a temporal convolutional network (TCN) based separator between the encoder and decoder to learn long-range dependencies. The technique derives inspiration from speech separation techniques that use TCN based separator between a single layer encoder-decoder. We propose to use a multilayer encoder-decoder to obtain a noise-independent representation useful for separating clean speech and noise. We present t-SNE –based analysis of the representation learned using different architectures for selecting the optimal number of encoder-decoder layers. We evaluate the proposed architectures using an objective measure of speech quality, scale-invariant source-to-noise ratio, and by obtaining word error rate on a speech recognition platform. The proposed two-layer encoder-decoder architecture resulted in 48% improvement in WER over unprocessed noisy data and 33% and 44% improvement in WER over two baselines.