Speech Enabled IVR systems in 2022

We’ve come a long way with IVR technology as well as speech recognition and natural language processing. Today’s call centers should all be taking advantage of speech enabled IVRs with natural language.

IVRs with speech recognition allow you to interact with your customers or callers more cheaply and efficiently. An agent handled call is almost always more expensive than an IVR handled call. However, not all calls can be completely handled by an IVR (even a sophisticated speech IVR). Often, a caller in a speech recognition IVR may ask to speak with an agent. Even in these situations, the speech recognition IVR can collect information from the caller that can be helpful to the agent, reducing the handle time by the agent.

There are generally three technologies that are used to implement modern speech enabled IVRs. The great news is that you don’t need to be an expert on any of these technologies to implement highly sophisticated IVRs — a number of vendors do all the work for you. However, you should have some basic understanding of these technologies so that you can make the best choice of which vendor to use.

Text-to-Speech (TTS) Processing

The first building block of speech enabled IVR systems is text-to-speech processing (or “TTS”). This technology has come a long way in just a few years. We are currently blessed with a wide range of highly accurate TTS providers, including TTS providers that easily integrate with most call center platforms. For example, the open source Asterisk and freeSWITCH PBXs each come with modules allowing call center administrators to easily enable text-to-speech processing.

In general, text-to-speech processing takes text inputs and produces natural sounding human voice outputs. Previously, generating natural sounding human voices required substantial tuning of voice models and specification of how text should be read by the models. Now, most text-to-speech systems simply require a text input (preferably with punctuation). Text can be read in many different voice styles and languages.

TTS processing is an important building block of speech enabled IVRs because TTS is used to generate the prompts in the IVR. The IVR prompt that asks “How can I help you today?” is generated using TTS processing.

To hear a sample, try out the IBM Watson text to speech demo here.

A large number of providers integrate with call center platforms to provide excellent text-to-speech processing. Contact us to get recommendations of some of the better providers.

Speech-to-Text (STT) Processing

The second building block of speech enabled IVRs is speech-to-text (STT) processing — essentially the inverse of TTS processing. STT processing converts spoken language (e.g., the words from a caller) into text so that the text can be processed. Speech-to-text processing may also be referred to as automated speech recognition (ASR) processing. As with TTS processing, speech-to-text processing has come a long ways and is now a simply add-on to most call center platforms. An important aspect of STT processing is how the voice is recorded or streamed for conversion to text. For example, if an IVR is simply asking for one or two word answers (like, please say “operator” or “balance”), then a simple recording of the user's utterance can be performed and then processed using a STT module. If more words or sentences need to be processed, the voice may be streamed (e.g, via a websocket or MRCP socket) to a server for further processing.

Rather than go into these details, however, there are a number of excellent vendors that make STT processing easy.

Once a caller's voice has been converted to text, the text can be analyzed or manipulated to cause the IVR to perform further processing. One way the text may be analyzed is with Natural Language Processing.

Natural Language Processing (NLP)

Natural language processing (NLP) is a type of artificial intelligence that specifically focuses on software that can give computers an ability to understand text and spoken words — the ability to perform natural language processing just like a human can.

For example, in a speech enabled IVR, a computer software program is configured to capture the caller's speech (e.g., as a recording or as a stream of encoded audio), convert the speech to text (using speech to text processing as described above), and then perform processing on that text to try to determine what the caller's intent is. When you call your bank and you are prompted (in an IVR) to say what you are calling about, the IVR is likely converting your speech to text and then performing some further processing on the text. In some simple voice IVRs, the IVR may simply be trying to see if you said a specific word — like “operator” or “balance” and then route you to the appropriate department based on that detected word.

But what if the IVR allows you to speak in a more free-form way? What if the IVR prompts you to “tell us what you are calling about”? A user may say a sentence or more of unstructured words (words that don't fit into a simple classification like “operator” or “balance”) and the IVR needs to discern some intent. Natural language processing allows the IVR to perform such processing.

The result of NLP processing may be a classification of what the user wants. For example, the caller may say something like “I really need to speak to someone about an unexpected charge.” The NLP engine may classify the caller's intent as “operator” even though the caller never said the word “operator”. As another example, the caller may say something like “This is outrageous, I've been on hold for 20 minutes” and the NLP engine may determine that the user's sentiment is highly negative and may cause the caller to be connected to an agent trained in deescalating irate customers.

A number of excellent NLP resources are available online for those interested in learning more. For call centers looking to implement NLP processing in their IVRs, many service providers make adoption of this technology easy. Contact us and we can recommend services that match your needs.