Wed-1-3-7 High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency

Nikolaos Ellinas(Innoetics, Samsung Electronics), Georgios Vamvoukakis(Innoetics, Samsung Electronics), Konstantinos Markopoulos(Innoetics, Samsung Electronics), Aimilios Chalamandaris(Innoetics, Samsung Electronics), Georgia Maniati(Innoetics, Samsung Electronics), Panos Kakoulidis(Innoetics, Samsung Electronics), Spyros Raptis(Innoetics, Samsung Electronics), June Sig Sung(Mobile Communications Business, Samsung Electronics), Hyoungmin Park(Mobile Communications Business, Samsung Electronics) and Pirros Tsiakoulis(Innoetics, Samsung Electronics)
Abstract: This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by using a recently proposed purely location-based attention mechanism, suitable for arbitrary sentence length generation. During inference, the decoder is unrolled and acoustic feature generation is performed in a streaming manner, allowing for a nearly constant latency which is independent from the sentence length. Experimental results show that the acoustic model can produce feature sequences with minimal latency about 31 times faster than real-time on a computer CPU and 6.5 times on a mobile CPU, enabling it to meet the conditions required for real-time applications on both devices. The full end-to-end system can generate almost natural quality speech, which is verified by listening tests.
Student Information

Student Events

Travel Grants