Jason Fong(University of Edinburgh), Jason Taylor(University of Edinburgh) and Simon King(University of Edinburgh)
Accurate pronunciation is an essential requirement for text-to-speech (TTS) systems. Systems trained on raw text exhibit pronunciation errors in output speech due to ambiguous letter-to-sound relations. Without an intermediate phonemic representation, it is difficult to intervene and correct these errors. Retaining explicit control over pronunciation runs counter to the current drive toward end-to-end (E2E) TTS using sequence-to-sequence models. On the one hand, E2E TTS aims to eliminate manual intervention, especially expert skill such as phonemic transcription of words in a lexicon. On the other, a system making difficult-to-correct pronunciation errors is of little practical use. Some intervention is necessary. We explore the minimal amount of linguistic features required to correct pronunciation errors in an otherwise E2E TTS system that accepts graphemic input. We use representation-mixing: within each sequence the system accepts either graphemic and/or phonemic input. We quantify how little training data needs to be phonemically labelled - that is, how small a lexicon must be written - to ensure control over pronunciation. We find modest correction is possible with 500 phonemised word types from the LJ speech dataset but correction works best when the majority of word types are phonemised with syllable boundaries.