Basics of Conversational AI

What should you know about conversational AI applications? If you're new to conversational AI, be sure to read this article before you throw yourself into creating!

When you talk to a machine, it's usually not (yet) the same experience as if you chatted with a good friend of yours. Even though AI researchers and developers all over the world strive for making conversational AI as human-like as possible, technology still has certain limitations that are reflected in the interaction. On the other hand, as the use of "talking machines" increases, people are getting used to the differences and they constantly modify their expectations for future interactions.

So what can you and your end users generally expect from a conversational voice application, just like one created in Flowstorm? What are the basic principles and general shortcomings?

Different components take care of different tasks.

What happens when you're talking to a person? In a nutshell: when they say something, your ears catch the sound waves. Afterward, your brain analyzes the sound wave and interprets the meaning of the resulting words; then it comes up with the most suitable reaction. Finally, your speech organs articulate the message. Conversational systems have these different organs/components, too, although they are a bit more independent:

  • ASR (Automatic Speech Recognition) is like the system's ears (and partially the brain). It's a component that detects the sound wave and transcribes it into words.

  • NLP (Natural Language Processing) algorithms are the brain. They analyze the meaning of the words and, based on the context, generate the most suitable reaction in the form of written text.

  • TTS (Text-To-Speech) is the system's speech organ. It takes the written text and "reads it out loud", in a specific voice.

  • And, following this analogy, the whole architecture interconnecting all the other components would be the body.

It's important to keep this split in mind, especially when analyzing the mistakes of a bot. Problems may originate from any of these components. It's not always the fault of the brain - perhaps the ASR just misheard what the user said!

Conversations are based on the so-called "turn-taking principle".

This means that the bot and the user take turns in saying stuff. Each line enters the conversation as a whole, at once: much like in written chat, you receive the whole message only when it is finished. So while you're talking, there's usually no real-time analysis of the input until you finish. You can imagine the succession like this:

  1. The bot says something, then starts listening.

  2. The user says something.

  3. The bot stops listening, analyzes the response, then says something, and then starts listening.

  4. etc.

Most current voice systems work with the so-called half-duplex (semiduplex) mode: the bot is either speaking or listening (usually signaling somehow the transition between the states). Another listening mode would be the full-duplex, where the bot is listening to you even while it is speaking, so you can barge in just by starting to talk.

Full-duplex can be used also on certain devices that run Flowstorm apps.

The interaction can be multimodal.

A conversational application doesn't have to be just about words. Very often, you can interact via voice, text, or touch; and you can expect to not only hear the voice but also listen to sounds, see images, watch videos, or even talk to a 3D avatar!

Last updated