Adding expressive capabilities to speech synthesis.

2019 edition

Marc Freixes

With the spread of smart devices and personal assistants synthetic voices are increasingly present in our lives. They are very intelligible and their naturalness has incredibly increased in the last few years, however their expressiveness is in some cases limited. The main focus of my thesis is to explore ways to add new expressive capabilities to synthetic voices.

Currently, the most popular approaches for speech synthesis stand on the use of corpus of pre-recorded speech. Such is the case of the unit selection systems that chose small fragments of speech (units) from the corpus and join them to compound the message to be spoken. The quality delivered by these systems is very good because they use natural speech, but the expression is restricted to the style of the recorded corpus.

Starting from a neutral speech corpus, our proposal pretends to add expressive capabilities for a storytelling style without having to record new samples. Specifically we have focused on increasing suspense and on singing. The proposal stands on two basic pillars. On the one hand, the parametrization of the speech in the corpus to allow their transformation, for example to change the tempo, the intonation or even the timbre. On the other hand, the definition of a model to guide the transformation from neutral speech to the desired style.

In the case of suspense, a set of rules for intonation and tempo has been defined from a very small set of sentences uttered by a professional storyteller. The results have been positively rated regarding naturalness and storytelling resemblance, however suspense arousal could be improved by incorporating timbre transformations.

With regard to singing, recording singing samples from the original speaker can be infeasible if the original speaker is inaccessible or unable to properly sing. As an alternative we have taken advantage of approaches focused on the production of singing from speech following the so-called speech-to-singing conversion. Although the singing obtained by our approach is not as natural as that delivered by a professional singing synthesiser it could address eventual singing needs reasonably.

Corpus-based approaches are the most widespread methods, specially in commercial speech synthesisers. However, speech can be also generated by simulating the physics involved in the human voice production. When we utter a vowel the vocal folds vibrate, generating an acoustic wave that travels through the vocal tract until reaching the lips. This process can be simulated on a computer by solving the equations that describe the wave propagation in a 3D model of a vocal tract obtained from Magnetic Resonance Imaging (MRI). In order to add expressivity to the voice generated with numerical simulation we have started studying the influence of the phonation, i.e. how the vocal folds vibrate.