Jaejin Cho(Johns Hopkins University), Piotr Zelasko(Johns Hopkins University), Jesus Villalba(Johns Hopkins University), Shinji Watanabe(Johns Hopkins University) and Najim Dehak(Johns Hopkins University)
Abstract:
Zero-shot multi-speaker Text-to-Speech (TTS) generates target
speaker voices given an input text and the corresponding
speaker embedding. In this work, we investigate the effectiveness
of the Text-to-Speech (TTS) reconstruction objective to
improve representation learning for speaker verification. We
jointly trained end-to-end Tacotron 2 TTS and speaker embedding
networks in a self-supervised fashion. We hypothesize
that the embeddings will contain minimal phonetic information
since the TTS decoder will obtain that information from
the textual input. TTS reconstruction can also be combined
with speaker classification to further enhance these embeddings.
Once trained, the speaker encoder computes representations for
the speaker verification task, while the rest of the TTS blocks
are discarded. We investigated training TTS from either manual
or ASR-generated transcripts. The latter allows us to train embeddings
on datasets without manual transcripts. We compared
ASR transcripts and Kaldi phone alignments as TTS inputs, and
shows that the latter performed better due to their finer resolution.
Unsupervised TTS embeddings improved EER by 2.06%
absolute w.r.t. i-vectors for the LibriTTS dataset. TTS with
speaker classification loss improved EER by 0.28% and 2.88%
absolutely from a state-of-the-art method using only speaker
loss in LibriTTS and Voxceleb1 respectively.