Alex is curious about her new companion robot, Miko.
Alex: Hello Miko, can you tell me a fun fact about planets?
Miko: Up and above, into the space we go! Jupiter is so large that it could fit 1300 Earths in it. It is the largest planet in the solar system. That space fact rocked!
And the conversation continues on as Miko tells Alex an exciting fact about Jupiter.
This is not an imagined scene out of a science fiction movie, but is the reality today. Thanks to technologies like artificial intelligence, machine learning, natural language processing, speech synthesis and recognition and so on, robots can now hold meaningful, continuous and emotionally intelligent conversations with humans. But how does this happen? Let us break down the action behind the scenes to decode the seemingly smooth journey from Alex’s question to Miko’s response.
Real-time semantic understanding of phrases, sentences and paragraphs are the foundation to holding seamless conversations with your bot companion. Any natural language input, also called the input query, has to be first processed, analyzed and mapped to a structured representation. This is called semantic parsing. This structured representation of the input query has to be further studied to decipher the intent - this is called intent inference. In natural language processing, the intent of a query refers to the underlying purpose or goal that a user wants to achieve with their input. For example, in the case of Alex, the intent of her query - “Can you tell me a fun fact about planets?”, could be to satisfy her curiosity and learn something fun about space.
Once the intent of the input query is inferred, an appropriate, personalized and emotionally intelligent response can be framed. Miko rightly interpreted Alex’s query and hence came up with an appropriate response that tells Alex what she wants to know. So, broadly the journey from a natural language query to a response involves three steps - 1) Semantic Parsing, 2) Intent Inference and 3) Response Formulation. We will now delve deeper into each of these.
Semantic parsing involves analyzing the meaning of a natural language sentence or phrase and mapping it to a formal representation. To ensure high accuracy and consistency in semantic parsing systems, the input queries as well as the training data needs to be preprocessed. Preprocessing cleans and prepares the natural language text data for accurate semantic analysis.
Preprocessing typically involves text normalization, which is the conversion to lowercase and the removal of punctuations, special characters, common phrases like “Could you please tell me” or “I would like to know” and wake words such as “Hi Miko” or “Hola Miko”. These do not add to the semantics and hence need to be removed to ensure less variance and higher confidence scores in the ensuing steps.
After preprocessing the natural language queries need to be transformed into sentence embeddings, which are dense vector representations of sentences that capture their semantic meaning. The embedding is a fixed-length representation that captures the essential features or characteristics of variable length input data. These embeddings are often learned through a process of training a neural network on a large corpus of text data. During this training process, the network learns to map each word or sentence to a fixed-length vector in such a way that semantically similar words or sentences are mapped to similar vectors.
Embedding extraction is an important step in many NLP tasks, such as text classification, sentiment analysis, and language translation. By representing text data as embeddings, it becomes possible to apply machine learning algorithms to analyze and manipulate the text in a more efficient and effective manner. In the case of multilingual robot-child interaction, the embedding extraction needs to be language agnostic, that is, sentences that mean the same in different languages should have similar embeddings. For example, “How are you?” and "comment allez-vous" (French) are semantically similar, and hence should generate similar results.
In intent inference, the goal is to identify the underlying intent or meaning of a user's query or statement. This is typically done by comparing the query against a database of known intents, and selecting the most similar intent based on a similarity score. Similarity search algorithms include metrics like cosine similarity and euclidean distance which can be used to calculate the similarity score between the user's query and each intent in the database. The queries, in this case, would be the sentence embeddings generated in semantic parsing. Each intent in the database can also be represented as a vector. By calculating the similarity score between the query vector and each intent vector using a similarity search algorithm, it becomes possible to identify the most similar intent and thus infer the user's intent.
When dealing with a large database of intents, it becomes necessary to employ techniques such as clustering or hierarchical search in order to scale the intent inference process. This is because searching through a large database of intents can be computationally expensive, and clustering or hierarchical search can help to group similar intents together and reduce the number of comparisons required to find the most relevant intent for a given query.
After performing intent inference on a user's query, the next step is to formulate a response that is appropriate and relevant to the inferred intent. This can involve a few different steps, depending on the specific application and the nature of the intent.
One common approach is to use pre-written responses that are associated with each intent in the database. For example, if the intent is to ask about the weather, the system might have a set of pre-written responses that provide information about the current weather conditions.
Another approach is to generate a response on-the-fly, based on the intent and any additional context or information available. This can involve using natural language generation (NLG) and Generative AI techniques to create a response that is tailored to the specific user and their query.
In either case, the response should be relevant, accurate, and expressed in a way that is understandable to the user. It should also take into account any relevant context, such as the user's location, previous interactions, or preferences. Overall, the goal is to provide a response that effectively addresses the user's needs and supports a natural and engaging conversation.
This flowchart summarizes the basic steps involved in a conversation between a robot and a human. The user enters or speaks a query, which is then processed using semantic parsing techniques. This involves preprocessing and embedding extraction through architectures like BERT. The resulting information is then used to perform intent inference through similarity search, which determines the user's intent based on the query content. Finally, the robot formulates a response using either pre-written responses, generative AI or some API, and provides that response to the user.
In this article, we have attempted to demystify the magic of child-robot conversations by providing a high-level overview of the steps involved in converting a question by a child into a response by a robot. The process starts with semantic parsing, which comprises preprocessing and embedding extraction, followed by intent inference using similarity search, and ends with response formulation.
While this may seem like a lot of work, these steps are essential to ensure that child-robot conversations are barrier-free and effective. In the coming weeks, we will delve deeper into the functioning of each of these steps and explore how they contribute to successful child-robot conversations. We will also look at how Miko, a popular child companion robot, addresses some of the challenges in the field pertaining to human-child conversations.