Background to Text-to-Speech¶

Consider what is required in order for speech-based communication to work. A speaker decides to utter a word, contracts their diaphragm to pull air into their lungs, and upon exhaling, returns the air through their vocal tract. They then contort their vocal tract in highly specific ways to reach a series of articulatory targets that they have associated with a particular meaning. The flow of air past these orchestrated contortions causes pressure fluctuations at varying frequencies that, upon impinging on the listener's ear drums, are processed and understood to represent the same meaning the speaker intended - magic!

The idea of creating machines to simulate speech has origins as early as the 18th century when Hungarian inventor Wolfgang von Kempelen created his 'speaking machine' to woo crowds. Speech synthesis has since made tremendous gains, and is now employed to solve a variety of real world problems. While von Kempelen's machine attempted to replicate the anatomy required for speech, modern techniques use computers to work with discrete representations of sound and the last decade of improvements to speech synthesis have grown in tandem with the progress of the field of neural network-based machine learning.

We intend to update this section with a variety of resources to help provide background on Text-to-Speech (TTS). In the meantime, please visit this excellent TTS primer from the NVIDIA NeMo TTS toolkit. If you are interested in more in-depth learning about TTS and speech processing, we recommend the Speech Processing and Speech Synthesis courses on Speech Zone.