Ju Lin(Clemson University), Sufeng Niu(Linkedin), Adriaan J. van Wijngaarden(Nokia Bell Labs), Jerome L. McClendon(Clemson University), Melissa C. Smith(Clemson University) and Kuang-Ching Wang(Clemson University)
Abstract:
Speech enhancement is an essential component in robust automatic
speech recognition (ASR) systems. Most speech enhancement methods
are nowadays based on neural networks that use feature-mapping or mask-learning.
This paper proposes a novel speech enhancement method that integrates time-domain
feature mapping and mask learning into a unified framework using a Generative
Adversarial Network (GAN). The proposed framework processes the received waveform
and decouples speech and noise signals, which are fed into two short-time Fourier
transform (STFT) convolution 1-D layers that map the waveforms to spectrograms in
the complex domain. These speech and noise spectrograms are then used to compute
the speech mask loss. The proposed method is evaluated using the TIMIT data set for
seen and unseen signal-to-noise ratio conditions. It is shown that the
proposed method outperforms the speech enhancement methods that
use Deep Neural Network (DNN) based speech enhancement or a Speech Enhancement
Generative Adversarial Network (SEGAN).