Speech Recognition and Text-to-Speech

Although in most cases an IVR system presents prerecorded prompts to the caller and accepts input by way of the dialpad, it is also possible to: a) generate prompts artificially, popularly known as text-to-speech; and b) accept verbal inputs through a speech recognition engine.

While the concept of being able to have an intelligent conversation with a machine is something sci-fi authors have been promising us for many long years, the actual science of this remains complex and error-prone. Despite their amazing capabilities, computers are ill-suited to the task of appreciating the subtle nuances of human speech.

Having said that, it should be noted that over the last 50 years or so, amazing advances have been made in both text-to-speech and speech recognition. A well-designed system created for a very specific purpose can work very well indeed.

Despite what the marketing people will say, your computer still can’t talk to you, and you need to bear this in mind if you are contemplating any sort of system that combines your telephone system with these technologies.


Text-to-speech (also known as speech synthesis) requires that a system be able to artificially construct speech from stored data. While it would be nice if we could simply assign a sound to a letter and have the computer produce each sound as it reads the letters, the written English language is not totally phonetic.

While on the surface, the idea of a speaking computer is very attractive, the reality is that it has limited usefulness. More information about integration of text-to-speech with Asterisk can be found in Chapter 18, External Services.

Speech Recognition

As soon as we’ve convinced computers to talk to us, we will naturally want to be able to talk to them.[153] Anyone who has tried to learn a foreign language can begin to recognize the complexity of teaching a computer to understand words; however, speech recognition also has to take into account the fact that before a computer can even attempt the task of understanding the words, it must first convert the audio into a digital format. This challenge is larger than one might at first think. For example, as humans we are naturally able to recognize speech as distinct from, say, the sound of a barking dog or a car horn. For a computer, this is a very complicated thing. Additionally, for a telephone-based speech recognition system, the audio that is received is always going to be of very low fidelity, and thus the computer will have that much less information to work with.[154]

Asterisk does not have speech recognition built in, but there are many third-party speech recognition packages that integrate with Asterisk.

[153] Actually, most of us talk to our computers, but this is seldom polite.

[154] If the speech recognition has to happen from a cell phone in a noisy conference hall, it becomes near-impossible.