Yi Zhang(Didi Research America), Chengyun Deng(Didi Chuxing), Shiqian Ma(Didi Chuxing), Yongtao Sha(Didi Chuxing), Hui Song(Didi Chuxing) and Xiangang Li(Didi Chuxing)
Generative adversarial networks (GANs) have become a popular research topic in speech enhancement like noise suppression. By training the noise suppression algorithm in an adversarial scenario, GAN based solutions often yield good performance. In this paper, a convolutional recurrent GAN architecture (CRGAN-EC) is proposed to address both linear and nonlinear echo scenarios. The proposed architecture is trained in frequency domain and predicts the time-frequency (TF) mask for the target speech. Several metric loss functions are deployed and their influence on echo cancellation performance is studied. Experimental results suggest that the proposed method outperforms the existing methods for unseen speakers in terms of echo return loss enhancement (ERLE) and perceptual evaluation of speech quality (PESQ). Moreover, multiple metric loss functions provide more freedom to achieve specific goals, e.g., more echo suppression or less distortion.
Index Terms: nonlinear echo cancellation, deep learning, generative adversarial network, convolutional recurrent network