Julius Richter(Signal Processing (SP), Universität Hamburg, Germany), Guillaume Carbajal(Signal Processing (SP), Universität Hamburg, Germany) and Timo Gerkmann(Universität Hamburg)
We consider the problem of speech modeling in speech enhancement. Recently, deep generative approaches based on variational autoencoders have been proposed to model speech spectrograms. However, these approaches are based either on hierarchical or temporal dependencies of stochastic latent variables. In this paper, we propose a generative approach to speech enhancement based on a stochastic temporal convolutional network, which combines both hierarchical and temporal dependencies of stochastic variables. We evaluate our method with real recordings of different noisy environments. The proposed speech enhancement method outperforms a previous non-sequential approach based on feed-forward fully-connected networks in terms of speech distortion, instrumental speech quality and intelligibility. At the same time, the computational cost of the proposed generative speech model remains feasible, due to inherent parallelism of the convolutional architecture.