Tom Kenter(Google UK), Manish Sharma(Google) and Robert Clark(Google, UK)
The prosody of currently available speech synthesis systems can be unnatural due to the systems only having access to the text, possibly enriched by linguistic information such as part- of-speech tags and parse trees. We show that incorporating a BERT model in an RNN-based speech synthesis model — where the BERT model is pretrained on large amounts of un- labeled data, and fine-tuned to the speech domain — improves prosody. Additionally, we propose a way of handling arbitrarily long sequences with BERT. Our findings indicate that small BERT models work better than big ones, and that fine-tuning the BERT part of the model is pivotal for getting good results.