Shogo Seki(Nagoya University), Moe Takada(Nagoya University) and Tomoki Toda(Nagoya University)
Abstract:
This paper proposes a semi-supervised method for enhancing and suppressing self-produced speech, using a variational autoencoder (VAE) to jointly model self-produced speech recorded with air- and body-conductive microphones. In speech enhancement and suppression for self-produced speech, body-conducted signals can be used as an acoustical clue since they are robust against external noise and include self-produced speech predominantly. We have previously developed a semi-supervised method taking an improved source modeling approach called the joint source modeling, which can capture a nonlinear correspondence of air- and body-conducted signals using non-negative matrix factorization (NMF). This allows enhanced and suppressed air-conducted self-produced speech to be prevented from contaminating by the characteristics of body-conducted signals. However, our previous method employs a rank-1 spatial model, which is effective but difficult to consider in more practical situations. Furthermore, joint source modeling depends on the representation capability of NMF. As a result, enhancement and suppression performances are limited. To overcome these limitations, this paper employs a full-rank spatial model and proposes a joint source modeling of air- and body-conducted signals using a VAE, which has shown to represent source signals more accurately than NMF. Experimental results revealed that the proposed method outperformed baseline methods.