Knowledge-Grounded Dialogue Flow Management for Social Robots and Conversational Agents

Page created by Brittany Harris
 
CONTINUE READING
Knowledge-Grounded Dialogue Flow Management for Social Robots and Conversational Agents
Int J Soc Robot manuscript No.
                                        (will be inserted by the editor)

                                        Knowledge-Grounded Dialogue Flow Management for Social
                                        Robots and Conversational Agents
                                        Lucrezia Grassi · Carmine Tommaso Recchiuto · Antonio Sgorbissa
arXiv:2108.02174v1 [cs.RO] 4 Aug 2021

                                        Received: date / Accepted: date

                                        Abstract The article proposes a system for knowledge-       1 Introduction
                                        based conversation designed for Social Robots and other
                                        conversational agents. The proposed system relies on an     Social Robotics is a research field aimed at providing
                                        Ontology for the description of all concepts that may       robots with a brand new set of skills, specifically re-
                                        be relevant conversation topics, as well as their mutual    lated to social behaviour and natural interaction with
                                        relationships. The article focuses on the algorithm for     humans.
                                        Dialogue Management that selects the most appropri-             Social robots can be used in many contexts such
                                        ate conversation topic depending on the user’s input.       as education [1], welcoming guests in hotels [2], cruise
                                        Moreover, it discusses strategies to ensure a conversa-     ships [3], malls [4], and elderly care [5]. A noteworthy
                                        tion flow that captures, as more coherently as possible,    application of Socially Assistive Robots (SARs) is in
                                        the user’s intention to drive the conversation in spe-      the healthcare field: it has been argued that robots can
                                        cific directions while avoiding purely reactive responses   be used to make people feel less lonely and help human
                                        to what the user says. To measure the quality of the        caregivers taking care of elders in care homes [6][7]. In
                                        conversation, the article reports the tests performed       particular, robots may help to cope with caregiver bur-
                                        with 100 recruited participants, comparing five conver-     den, i.e., the stress perceived by formal and informal
                                        sational agents: (i) an agent addressing dialogue flow      caregivers, a relevant problem both in care homes and
                                        management based only on the detection of keywords          domestic environments. This subjective burden is one of
                                        in the speech, (ii) an agent based both on the detec-       the most important predictors for adverse outcomes of
                                        tion of keywords and the Content Classification feature     the care situation for the caregivers themselves, as well
                                        of Google Cloud Natural Language, (iii) an agent that       as for the one who requires care. Clinicians frequently
                                        picks conversation topics randomly, (iv) a human pre-       overlook the caregiver burden problem [8].
                                        tending to be a chatbot, and (v) one of the most famous         Recently, social robots have been recognized as a
                                        chatbots worldwide: Replika. The subjective perception      very important resource by one of the most recent ed-
                                        of the participants is measured both with the SASSI         itorials of Science Robotics [9], which analyzes the po-
                                        (Subjective Assessment of Speech System Interfaces)         tential of robots during the COVID-19 pandemic and
                                        tool, as well as with a custom survey for measuring the     underlines how Social Robotics is a very challenging
                                        subjective perception of coherence.                         area: social interactions require the capability of au-
                                                                                                    tonomously handling people’s knowledge, beliefs, and
                                        Keywords Social Robotics · Conversational Agents ·          emotions. In the last year, COVID-19 threatened the
                                        Knowledge-Grounded Conversation                             life of people with a higher risk for severe illness, i.e.,
                                                                                                    older adults or people with certain underlying medi-
                                        L. Grassi (B) · C.T. Recchiuto · A. Sgoribissa              cal conditions. To slow down the spread of the virus
                                        University of Genoa, DIBRIS, via all’Opera Pia 13
                                        Tel.: +39 010 3532801
                                                                                                    [10], there is the need to reduce social contacts and, as
                                        Fax: +39 010 3532154                                        a consequence, many older adults have been left even
                                        E-mail: lucrezia.grassi@edu.unige.it                        more socially isolated than before. However, in many
                                                                                                    cases, loneliness can spring adverse psychological effects
Knowledge-Grounded Dialogue Flow Management for Social Robots and Conversational Agents
2                                                                                                    Lucrezia Grassi et al.

such as anxiety, psychiatric disorders, depression, and         conversation to provide coherent replies, even when it
decline of cognitive functions [11]. In this scenario, So-      is not able to fully understand what the user is talking
cial Robotics and Artificial Intelligence, in general, may      about (which is likely to happen very often if the user
have a crucial role: conversational robots and virtual          is free to raise any topic).
agents can provide social interactions, without spread-             To increase the impact of the conversational sys-
ing the virus.                                                  tems of Social Robots, [19] argues that they shall be
    In many Social Robotics applications, the main fo-          designed to overcome a number of limitations. Some
cus is on the conversation. Based on its purpose, the           desirable features that are not present in most systems
conversation can be subdivided into:                            are: (1) Breaking the “simple commands only” bar-
 – task-oriented : it is used to give commands to per-          rier; (2) Multiple speech acts; (3) Mixed initiative dia-
   form a task or retrieve information. Examples of             logue;(4) Situated language and the symbol grounding
   task-oriented conversations can be found both dur-           problem; (5) Affective interaction; (6) Motor correlates
   ing the interaction with a robot, i.e., “Go to the           and Non-Verbal Communication; (7) Purposeful speech
   kitchen”, and during the interaction with a smart            and planning; (8) Multi-level learning; (9) Utilization of
   assistant, i.e., “Turn off the light”. In this case, the     online resources and services.
   problem is to understand the semantics of what the               The main contribution of the article is to investigate
   user is saying, to provide the required service.             possible solutions to issue 7, that we aim to achieve by
 – chit-chatting: it has the purpose of gratifying the          properly managing the conversation flow towards the
   user in the medium to long term, keeping him/her             execution of tasks or the exploration of relevant top-
   engaged and interested. For this kind of conversa-           ics, thus ultimately leading to a more engaging interac-
   tion, the most relevant problem, rather than per-            tion with the user. During the conversation, it typically
   fectly grasping the semantics of what the person             happens that a robot may provide answers that have
   says, is to be able to manage the conversation, show-        nothing to do with what the user says: this has a very
   ing the knowledge and competence on a huge num-              negative impact on the “suspension of disbelief” that is
   ber of different topics coherently and entertainingly        required to give the user the impression of genuine in-
   [12].                                                        telligence, and ultimately it generates frustration. Here,
                                                                a key element is the capability to know when to further
    In this scenario, the CARESSES project1 is the first
                                                                explore the current topic or choose the next conversa-
project having the goal of designing SARs that are cul-
                                                                tion topic, coherently with the user’s sentence: if the
turally competent, i.e., able to understand the culture,
                                                                algorithm does not pick a topic coherently, the agent’s
customs, and etiquette of the person they are assisting,
                                                                reply will not be appropriate, even if the system had
while autonomously reconfiguring their way of acting
                                                                the knowledge to interact consistently.
and speaking [13]. CARESSES exploited the humanoid
                                                                    Specifically, we proceed as follows.
robot Pepper2 as the main robotic platform for inter-
acting with people [6], in particular with older people             First, we propose a novel system for knowledge-
in care homes, to make them feel less isolated and re-          based conversation and Dialogue Management that re-
duce the caregiver burden. Under these conditions, the          lies on an Ontology for the description of all relevant
capability of engaging the user to chit-chat with the sys-      concepts that may play a key role in the conversation:
tem becomes of the utmost importance, as it has been            the Ontology is designed to take into account the pos-
shown [7] that this may have a positive impact on qual-         sible cultural differences between different users in a
ity of life [16], negative attitude towards robots [17], and    no-stereotyped way, and it stores chunks of sentences
loneliness [18]. However, for a proper conversation to be       that can be composed in run-time, therefore, enabling
possible, the system shall be able to talk about a huge         the system to talk about the aforementioned concepts
number of topics that may be more or less relevant for          in a culture-aware and engaging way [14][20].
different cultures, by properly managing the flow of the            Second, we test the developed solutions by compar-
                                                                ing five Artificial Conversational Agents during a con-
    1
    http://caressesrobot.org/                                   versation with 100 recruited participants: (i) an agent
    2
    CARESSES architecture has been recently revised to al-
                                                                addressing dialogue flow management based only on the
low for a Cloud-based implementation [14], thus enabling vir-
tually any device equipped with a network interface to per-     detection of keywords, (ii) an agent addressing dialogue
form long-term, culture-aware conversation with the user.       flow management based both on the detection of key-
Currently, the robots Pepper and NAO, the pill-dispenser        words and the Content Classification feature of Google
Pillo, Prof. Einstein, and a virtual character implemented as
                                                                Cloud Natural Language, (iii) an agent that picks ran-
an Android application have been provided with the onboard
functionalities to implement the instructions received by the   dom topics among those present in the Ontology, (iv) a
CARESSES Cloud [15].                                            human pretending to be a chatbot, and (v) one of the
Knowledge-Grounded Dialogue Flow Management for Social Robots and Conversational Agents                               3

most famous chatbots worldwide: Replika. The subjec-         as “I see, very interesting” as it has no previous knowl-
tive perception of the participants is measured both         edge about Pizza. On the other side, the latter may be
with the SASSI (Subjective Assessment of Speech Sys-         more specific, i.e., “Oh, pizza is a very delicious Italian
tem Interfaces) tool [21][22] and a custom survey for        food”, if it had the chance to recall internal or exter-
measuring the subjective perception of coherence in the      nal knowledge about pizza, e.g., using resources on the
conversation.                                                web.
    The remainder of this article is organized as follows.       Building a knowledge-grounded conversational sys-
Section 2 presents an overview of previous works related     tem raises many challenges. For instance, the internal
to Knowledge-Grounded Conversation, the most popu-           knowledge may be static or “expandable”, i.e., updated
lar validated tools to evaluate the User Experience, and     in run-time with new external information retrieved
introduces the concept of “coherence” in Dialogue Man-       from the user utterances, which will be considered as
agement. Eventually, it presents a typical classification    internal knowledge in the next interactions. The exter-
for Artificial Conversational Agents and describes the       nal knowledge in most cases will come from websites:
up-to-date most famous chatbots. An overview of the          however, even if the system can find some documents
CARESSES knowledge-based conversational system is            associated with the current conversation topic by using
given in Section 3. Section 4 describes the experiment       a state-of-the-art information retrieval system, it may
carried on to evaluate the user satisfaction when inter-     be difficult to extract knowledge from the search results
acting with the agents, along with the statistical tools     because this would typically require complex Natural
used to analyse the collected data. Section 5 presents       Language Processing techniques to select the appropri-
the results obtained, discussed in detail in Section 6.      ate knowledge.
Eventually, Section 7 draws the conclusions.                     This is done in [23] using a Generative Transformer
                                                             Memory Network and in [24] using an SVM classifier,
                                                             to name a few. Another complex task is to generate
2 State of the Art                                           appropriate responses that reflect the acquired knowl-
                                                             edge as a consequence of the conversation history. The
This section addresses Knowledge-Grounded Conversa-          problems of knowledge extraction and response gener-
tion, presenting a brief analysis of the related Litera-     ation, among the others, are addressed in [25], where a
ture. Moreover, it gives an overview of the most used        knowledge-grounded multi-turn chatbot model is pro-
tools to evaluate the User Experience during the in-         posed (see also 2.3 for popular chatbots based on this
teraction with conversational agents and introduces a        principle).
new metric exploited in our experiment. Eventually,              A data-driven and knowledge-grounded conversa-
it presents a classification for Artificial Conversational   tion model, where both conversation history and rel-
Agents and describes the up-to-date most famous chat-        evant facts are fed into a neural architecture that fea-
bots.                                                        tures distinct encoders for the two entries, is proposed
                                                             in [26]. Such model aims to produce more appropriate
                                                             responses: the article shows that the outputs generated
2.1 Knowledge-Grounded Conversation                          by a model trained with sentences and facts related
                                                             to the conversation history were evaluated by human
A knowledge-grounded conversational system is a dia-         judges as remarkably more informative (i.e., knowledge-
logue system that can communicate by recalling inter-        able, helpful, specific) with respect to those of a model
nal and external knowledge, similarly to how humans          trained with no facts.
do, typically to increase the engagement of the user dur-        In [27], a large corpus of status-response pairs found
ing chit-chatting. The internal knowledge is composed        on Twitter has been employed to develop a system
of things that the system already knows, while exter-        based on phrase-based Statistical Machine Translation,
nal knowledge refers to the knowledge acquired in run-       able to respond to Twitter status posts. As stated in the
time. To this end, knowledge-grounded systems must           article, data-driven response generation will provide an
not only understand what the user says but also store        important breakthrough in the conversational capabil-
external knowledge during the conversations, and their       ities of a system. Authors claim that when this kind of
responses should be based on the stored knowledge.           approach is used inside a broad dialogue system and is
There are many differences between the replies of a          combined with the dialogue state, it can generate lo-
normal conversational system and those of a knowledge-       cally coherent, purposeful, and more natural dialogue.
grounded system. For example, if the user says “I love           Finally, the problem of knowledge representation
pizza”, the former might provide a general answer such       and knowledge-based chit-chatting, keeping into account
4                                                                                                    Lucrezia Grassi et al.

the cultural identity of the person, is addressed in [20]      utterance, and then act appropriately. However, accord-
and [14] using an Ontology designed by cultural experts        ing to the aforementioned definition, what the System
coupled with a Bayesian network, to avoid rigid rep-           Response Accuracy scale aims to measure is not equiv-
resentations of cultures that may lead to stereotypes.         alent to the broader concept of coherence that has been
Such an approach allows the system to have more con-           given: it measures the quality of the system’s reply to
trol over what the robot says during the conversation:         the user’s input in a purely reactive fashion, without re-
the idea of having a knowledge base designed by experts        lying on the concept of “current topic of conversation”.
is particularly suitable for sensitive situations, i.e. when       For this reason, a coherence measure in the spirit of
dealing with the more fragile population such as older         [31] and [32] has been used to supplement the SASSI
adults or children. Approaches that automatically ac-          questionnaire, which requires users to rate individual
quire knowledge from the Internet may not be optimal           sentences pronounced by the robot at the light of the
in these scenarios.                                            context (details are given in Section 4.2.1). Notice that
                                                               [32] and [33] propose also methods to automatically
                                                               evaluate the coherence, as this is considered an impor-
2.2 User Experience
                                                               tant metrics to evaluate multi-turn conversation in open
                                                               domains: however, in this work we are only interested
In the Literature, tools exist that can be used to eval-
                                                               in evaluating the subjective perception of users, which
uate the User Experience (UX) and the overall qual-
                                                               makes a simple rating mechanism perfectly fitting our
ity of the conversation. The work carried on in [28] re-
                                                               purposes.
views the six main questionnaires for evaluating conver-
sational systems: AttrakDiff, SASSI, SUISQ, MOS-X,
PARADISE, and SUS. Moreover, it assesses the poten-            2.3 Artificial Conversational Agents
tial suitability of these questionnaires to measure vari-
ous UX dimensions.                                             Depending on the emphasis they put on a task-oriented
    As a measure of the quality of the flow of conversa-       conversation or chit-chatting, a typical classification for
tion, we aim to evaluate how “coherent” the system is:         artificial conversational agents is based on their scope:
this shall be done not only taking into account the last
user’s utterance but also the current conversation topic        – Question answering bots: knowledge-based conver-
(i.e., what is referred to as context in [29][30]). To clar-      sational systems that answer to users’ queries by
ify what this means, suppose a conversational system              analysing the underlying information collected from
that replies very accurately to everything the user says:         various sources like Wikipedia, DailyMail, Allen AI
if the user talks about football, the system will reply           science and Quiz Bowl [34];
talking about football, if the user talks about apples,         – Task-oriented bots: conversational systems that as-
the system will reply talking about apples. But what if           sist in achieving a particular task or attempt to solve
the person and the system are talking about football,             a specific problem such as a flight booking or hotel
and the person says something that is not recognized as           reservation [35];
related to anything specific, such as “I will think about       – Social bots: conversational systems that communi-
that”, “I think so”, “It’s so nice to talk with you”, and         cate with users as companions, and possibly en-
so on? A coherent Dialogue Management system shall                tertain or give recommendations to them [36]: re-
be able to understand when it is more appropriate to              cent notable examples are Microsoft Xiaoice [37] and
further explore the current conversation topic (in this           Replika3 .
example, by taking the initiative to ask the person a
question about his/her preferred team or players) or               In the following, we discuss more in detail the agents
lead the dialogue to another topic, coherently with what       belonging to the third class pointing out differences and
the user said.                                                 similarities with out solution, without making a dis-
    Unfortunately, the aforementioned UX tools are not         tinction between robots and chatbots, unless strictly
meant to evaluate a medium-term mixed-initiative dia-          required.
logue, as in our case. The most similar measure to what            For many decades, the development of social bots,
we want to evaluate is the System Response Accuracy            or intelligent dialogue systems that can engage in em-
scale of the SASSI questionnaire: such a questionnaire,        pathetic conversations with humans, is one of the main
described more in detail in Section 4.2.2, has been used       goals of Artificial Intelligence. As stated in [38], early
during our experiments. The System Response Accuracy           conversational systems such as Eliza [39], Parry [40],
is defined as the system’s ability to correctly recognise       3
                                                                  https://help.replika.ai/hc/en-us/articles/
the speech input, correctly interpret the meaning of the       115001070951-What-is-Replika-
Knowledge-Grounded Dialogue Flow Management for Social Robots and Conversational Agents                                5

and Alice [41] were designed to mimic human behaviour        conversation. Unfortunately, we could not try XiaoIce
in a text-based conversation, hence to pass the Tur-         as it is not available in Italy.
ing Test [42] within a controlled scope. Despite their            Replika is presented as a messaging app where users
impressive successes, these systems, which were pre-         answer questions to build a digital library of informa-
cursors to today’s social chatbots, worked well only in      tion about themselves. Its creator, a San Francisco-
constrained environments. In more recent times, among        based startup called Luka, sees see a whole bunch of
the most successful conversational agents for general        possible uses for it: a digital twin to serve as a compan-
use that can act as digital friends and entertainers, Xi-    ion for the lonely, a living memorial of the dead, created
aoice, Replika, Mitsuku, and Insomnobot-3000 are the         for those left behind, or even, one day, a version of our-
most frequently mentioned [43].                              selves that can carry out all the mundane tasks that we
    XiaoIce (“Little Ice” in Chinese) is one of the most     humans have to do, but never want to.
popular chatbots in the world. It is available in 5 coun-         To the best of the authors’ knowledge, there is no
tries (i.e., China, Japan, US, India, and Indonesia) un-     detailed explanation of the implementation of Replika,
der different names (e.g., Rinna in Japan) on more           hence we can only make assumptions on how issues re-
than 40 platforms, including WeChat, QQ, Weibo, and          lated to Dialogue Management are faced. After having
Meipai in China, Facebook Messenger in the United            intensively tried Replika, it seems that the conversation
States and India, and LINE in Japan and Indonesia.           is partitioned into “sessions” in which it has specific
Its primary goal is to be an AI companion with which         competencies, somehow playing a similar role as top-
users form long-term emotional connections: this dis-        ics in XiaoIce and our system. Concerning issue 7 in
tinguishes XiaoIce not only from early chatbots but          Section 1, the chatbot uses a Neural Network to hold
also from other recently developed conversational AI         an ongoing, one-on-one conversation with its users, and
personal assistants such as Apple Siri, Amazon Alexa,        over time, learn how to talk back [44]. The agent is
Google Assistant, and Microsoft Cortana.                     trained on texts from more than 8 million web pages,
     As stated in [37], the topic database of XiaoIce is     from Twitter posts to Reddit forums, and it can re-
periodically updated by collecting popular topics and        spond in a thoughtful and human-like way. From time
related comments and discussions from high-quality In-       to time, it will push the user to have a “session” to-
ternet forums, such as Instagram in the US and the           gether: in this session, it will ask questions regarding
website douban.com in China. To generate responses,          what the user did during the day, what was the best
XiaoIce has a paired database that consists of query-        part of the day, what is the person looking forward to
response pairs collected from two data sources: human        tomorrow, and eventually, the user has to rate his/her
conversational data from the Internet, (e.g., social net-    mood on a scale of 1 to 10. During this session, Rep-
works, public forums, news comments, etc.), and human-       lika takes control of the conversation and insists that
machine conversations generated by XiaoIce and her           the user answers its questions about a specific matter,
users. Even if the data collected from the Internet are      something that may be annoying and frustrating, and
subjected to quality control to remove personally iden-      we avoid by letting the user free to easily switch to an-
tifiable information (PII), messy code, inappropriate        other topic at any time. Replika’s responses are backed
content, spelling mistakes, etc., the knowledge acqui-       by Open AI’s GPT-24 text-generating AI system [45].
sition process is automated and it is not supervised              Mitsuku 5 , or Kuki as her close friends call her, is a
by humans: differently from our system, this solution        chatbot created by Pandorabots6 : an open-source chat-
may not be suitable in sensitive situations, e.g., when      bot framework that allows people to build and publish
dealing with the more fragile population such as older       AI-powered chatbots on the web, mobile applications,
people or children. Regarding issue 7 in Section 1 (i.e.,    and messaging apps like LINE, Slack, WhatsApp, and
the lack of flow in the conversation), the implementa-       Telegram.
tion of XiaoIce addressed this problem by using a Topic           The Pandorabots chatbot framework is based on the
Manager that mimics human behavior of changing top-          Artificial Intelligence Markup Language (AIML) script-
ics during a conversation. It consists of a classifier for   ing language, which developers can use to create con-
deciding at each dialogue turn whether or not to switch      versational bots. The downside is that Pandorabots do
topics and a topic recommendation engine for suggest-        not include machine learning tools that are common
ing a new topic. Topic switching is triggered if XiaoIce     on other chatbot building platforms. A specific AIML
does not have sufficient knowledge about the topic to        file allows users to teach Mitsuku new facts: the user
engage in a meaningful conversation, an approach that         4
                                                                  https://en.wikipedia.org/wiki/GPT-2
our system adopts as well by relying on an Ontology           5
                                                                  https://www.pandorabots.com/mitsuku/
                                                              6
for representing relationships among different topics of          https://home.pandorabots.com/home.html
6                                                                                                 Lucrezia Grassi et al.

should say “Learn” followed by the fact (i.e., “Learn        time, more specific to the context. Authors report also
the sun is hot”). The taught information is emailed to       that issues related to inappropriate and biased language
its creator Steve Worswick, which will personally su-        have still to be solved: to the best of our knowledge, a
pervise the learning process. To address issue 7 in Sec-     publicly available version of this chatbot has not been
tion 1, specific areas of competence of the chatbot are      released yet.
managed by different AIML files, which are responsible           BlenderBot [30] is a very recent open-domain chat-
for maintaining coherence in the responses: AIML files,      bot developed at Facebook AI Research. As like as we
therefore, play a similar role as topics/concepts in the     do, the authors put a stronger emphasis on aspects that
Ontology that are the core elements of our system, but       are not only related to sentence generation: they argue
without the benefit of having a hierarchical structure       that good conversation requires a number of skills to
as a key element for Dialogue Management. Mitsuku is         be blended in a seamless way, including providing en-
a five-time Loebner Prize winner (in 2013, 2016, 2017,       gaging topics of conversation as well as listening and
2018, 2019), and it is available in a web version7 , Face-   showing interest to what the user says, among the oth-
book Messenger, Kik Messenger, and Telegram.                 ers. Current generative models, they argue, tend to pro-
    Insomnobot-3000 8 is the world’s first bot that is       duce dull and repetitive responses (and, we would add,
only available to chat, exclusively via SMS, between         they do not take cultural appropriateness into account),
11pm and 5am regardless of the time zone. The bot was        whereas retrieval models may produce human written
built by the mattress company Casper and, according          utterances that tend to include more vibrant (and cul-
to its creators, it was programmed to sound like a real      turally appropriate) language. To overcome the limita-
person and talk about almost anything. Insomnobot-           tions of purely generative models, BlenderBot includes
3000 cannot learn new things and expand its knowl-           a retrieval step before generation according to a retrieve
edge base: however, it can generate over 2,000 different     and refine model [46], which retrieves initial dialogue
responses, depending on which category and emotion           utterances and/or knowledge from a large knowledge
the keyword falls under. Concerning issue 7 in Section       base. As like Meena, the system is trained to choose
1, when the bot receives a text, it chooses an appro-        the next dialogue utterance given the recent dialogue
priate response by identifying keywords: however, dif-       history, referred to as context: training is performed us-
ferently from the aforementioned systems and our solu-       ing a number of datasets that are key to exhibit differ-
tion, it lacks a more sophisticated mechanism for Dia-       ent skills (i.e., engaging personality, emotional talking,
logue Management to switch in-between different areas        knowledge-grounded conversation), as well as a Blended
of competence or topics to provide contextual replies.       Skill Talk dataset that merges utterances taken from
    In addition to the aforementioned chatbot avail-         the three. Authors claim that their model outperforms
able to the large public, recent research on data-driven     Meena in human evaluations, however, they acknowl-
knowledge-grounded conversation based on sophisticated       edge that the system still suffers from a lack of in-
generative models is worth being mentioned.                  depth knowledge if sufficiently interrogated, a tendency
    Meena [29] is a generative chatbot model trained         to stick to simpler language, and a tendency to repeat
end-to-end on 40B words mined from public domain             often used phrases.
social media conversations, that addresses the problem           DialoGPT [47] is a neural conversational response
of multi-turn conversation in open domains. Accord-          generation model trained on 147M conversation-like ex-
ing to its authors, this chatbot can conduct conversa-       changes extracted from Reddit comment chains over a
tions that are more sensible and specific than existing      period spanning from 2005 through 2017 and it extends
state-of-the-art chatbots. Such improvements are mea-        GPT-2. Authors claim that conversational systems that
sured by a new human evaluation metric, called Sensi-        leverage DialoGPT generate more relevant, contentful,
bleness and Specificity Average (SSA), which captures        and context-consistent responses than strong baseline
important attributes for human conversation. Meena is        systems. As it is fully open-sourced and easy to deploy,
based on the concept of multi-turn contexts to evaluate      users can extend the pre-trained conversational system
the quality of a response: not only the last sentences       to bootstrap training using various datasets. Authors
but the recent conversation history in terms of (con-        claim that the detection and control of toxic output
text, response) pairs is used to train and evaluate the      will be a major focus of future investigation.
network. Experiments show significant improvements in
                                                                 All generative models based on human-human data
SSA with respect to competitors, providing replies that
                                                             offers many advantages in producing human-like utter-
are coherent with what the user said and, at the same
                                                             ances fitting the context, but have the major drawback
    7
        https://chat.kuki.ai                                 that they can learn undesirable features leading to toxic
    8
        https://insomnobot3000.com                           or biased language. Classifiers to filter out inappropri-
Knowledge-Grounded Dialogue Flow Management for Social Robots and Conversational Agents                               7

ate language exist [48], but they still have limitations:     whichever the cultural identity of the user is, to avoid
this issue is particularly important when dealing with        stereotypes (an example related to different kind of bev-
the more frail populations, especially by considering         erages is shown in Figure 2). The system will initially
that classifying language as inappropriate or offensive       guess the user’s beliefs, values, habits, customs, pref-
typically depends on cultural factors, and the devel-         erences based on the culture he/she declared to self-
opment of classifiers working properly in multiple cul-       identify with, but it will then be open to considering
tures may be very challenging. Since we claim that a          choices that may be less likely in a given culture, as the
proper flow of conversation deserves higher attention         user explicitly declares his/her attitude towards them.
than the automatic generation of sentences, we address        According to this principle, the system may initially in-
these problems through an Ontology of conversation            fer that an English person may be more interested to
topics and topic-related chunks of sentences generated        talk about Tea rather than Coffee, and the opposite
by cultural experts, that are then composed in run-time       may be initially inferred for an Italian user. However,
using the hierarchical structure of the knowledge-base        during the conversation, initial assumptions may be re-
to produce a huge variety of sentences. Finally, even if      vised, thus finally leading to a fully personalized repre-
this is not the main objective of this article, it is worth   sentation of the user’s attitude towards all concepts in
reminding that our system may take cultural aspects           the TBox, to be used for conversation.
into account in Dialogue Management, an element that
is completely ignored by all state-of-the-art systems.

3 System Architecture

Even if the focus of this article is on knowledge-based
chit-chatting and dialogue flow management, and not
on cultural adaptation, it is necessary to briefly in-
troduce the structure of the cultural knowledge base
used by the CARESSES conversational system since it
strongly influences the algorithms we implemented.

3.1 Knowledge Representation

In CARESSES, the ability of the companion robot to
                                                              Fig. 1 Knowledge representation architecture for a cultur-
naturally converse with the user has been achieved by         ally competent robot.
creating a framework for cultural knowledge represen-
tation that relies on an Ontology [49] implemented in
OWL2 [50]. According to the Description Logics formal-            To implement this mechanism, the Culture-Specific
ism, concepts (i.e., topics of conversation the system is     ABox layer comprises instances of concepts (with prefix
capable of talking about) and their mutual relations are      EN for “English” in Figure 2) encoding culturally ap-
stored in the terminological box (TBox) of the Ontol-         propriate chunks of sentences to be automatically com-
ogy. Instead, instances of concepts and their associated      posed (Data Property hasSentence) and the probabil-
data (e.g., chunks of sentences automatically composed        ity that the user would have a positive attitude toward
to enable the system to talk about the corresponding          that concept, given that he/she belongs to that cultural
topics) are stored in the assertional box (ABox).             group (Data Property hasLikeliness).
    To deal with representations of the world that may            Eventually, the Person-Specific ABox comprises in-
vary across different cultures [51], the Ontology is or-      stances of concepts (with prefix DS for a user called
ganized into three layers, as shown in Figure 1. The          “Dorothy Smith” in Figure 2) encoding the actual user’s
TBox (layer I) encodes concepts at a generic, culture-        attitude towards a concept (Mrs Dorothy Smith may be
agnostic level, which can be inherited from existing          more familiar with having tea than the average English
upper and domain-specific Ontologies (grey boxes) or          person, hasLikeliness=Very High), sentences to talk
explicitly defined to enable culture-aware conversation       about a topic explicitly taught by the user to the sys-
with the user (white boxes). An important point to be         tem (hasSentence=“You can never get a cup of tea
highlighted is that the TBox should include concepts          large enough or a book long enough to suit me”) or
that are typical of the union all cultures considered,        other knowledge explicitly added during setup (e.g., the
8                                                                                                    Lucrezia Grassi et al.

                                                               user’s name and the town of residence). At the first en-
                                                               counter between the robot and a user, many instances
                                                               of the Ontology will not contain Person-Specific knowl-
                                                               edge: the robot will acquire this awareness at run-time
                                                               either from its perceptual system or during the interac-
                                                               tion with the user, e.g., asking questions. Figure 1 also
                                                               shows that some instances of existing Ontologies that
                                                               are not culture- or person-dependent (dark circles) may
                                                               not change between the two ABox layers: this may re-
                                                               fer, for instance, to technical data such as the serial
                                                               number of the robot or physical quantities, if they need
                                                               to be encoded in the Ontology. For a detailed descrip-
                                                               tion of the terminological box (TBox) and assertional
                                                               box (ABox) of the Ontology, as well as the algorithms
                                                               for cultural adaptation, see [14][20].
                                                                    The Dialogue Tree (DT) (Figure 2), used by the
                                                               chit-chatting system (Section 3.2), is built starting from
                                                               the Ontology structure: each concept of the TBox and
                                                               the corresponding instances of the ABox are mapped
                                                               into a conversation topic, i.e., a node of the tree. The re-
                                                               lation between topics is borrowed from the structure of
                                                               the Ontology: specifically, the Object Property hasTopic
                                                               and the hierarchical relationships among concepts and
                                                               instances are analyzed to define the branches of the
                                                               DT. In the example of Figure 2, the instance of Tea
                                                               for the English culture is connected in the DT to its
                                                               child node GreenTea (which is a subclass of Tea in the
                                                               Ontology), and its sibling MilkTea (since EN MILK is a
                                                               filler of EN TEA for the Object Property hasTopic).
                                                                   As mentioned, each conversation topic has chunks
                                                               of culturally appropriate sentences associated with it
                                                               that are automatically composed and used during the
                                                               conversation. Such sentences can be of different types
                                                               (i.e., positive assertions, negative assertions, different
                                                               kinds of questions, or proposals for activities) and are
                                                               selected in subsequent iterations, also depending on in-
                                                               put received from the user. Typically, when exploring a
                                                               topic, the system may first ask a question to understand
                                                               if the user is familiar with that topic: if so, this may be
                                                               followed by general considerations, proposals for activ-
                                                               ities, or open questions, until the system moves to the
                                                               next topic if the conditions hold. Multiple sentences of
                                                               the same type are present, randomly chosen to reduce
                                                               the chance of repetitions. At any time, the user is free to
                                                               take the initiative to express his/her considerations or
                                                               lead the conversation somewhere else. Sentences may
                                                               contain variables that are instantiated when creating
                                                               the DT. For instance, a hypothetical sentence “Do you
                                                               like $hasName?” encoded in the concept Coffee might
Fig. 2 The three layers of the Ontology: TBox, CS-ABox (for
                                                               be used to automatically produce both “Do you like
the English culture), PS-ABox (for the user Dorothy Smith),
and the Dialogue Tree generated from the Ontology structure.   Coffee?” and “Do you like Espresso?” (being Espresso
                                                               a subclass of Coffee). As further example, the sentence
                                                               “I love $hasName with $hasTopic*hasName” encoded
Knowledge-Grounded Dialogue Flow Management for Social Robots and Conversational Agents                             9

in the concept Movie can be used to produce both “I            BreakfastHabits) to more specific (e.g., the topic
love movies with great actors” and “You know... I love         HavingTeaForBreakfast) or related ones (e.g., the
Bollywood movies with Amitabh Bachchan”, depend-               topic TakingCareOfOneself). This is achieved by
ing on the taxonomy of the knowledge base and ad-              selecting the branches to maximise the probability
ditional rules for sentence composition. In the current        that the user will be interested in the next topic
version used for testing, exploiting the taxonomy of the       (also based on cultural factors). When in a topic,
Ontology allowed us to easily produce a DT with 2,470          the system asks questions, reply with positive or
topics of conversation and 24,033 sentences, with ran-         negative assertions, propose activities or simply en-
dom variations made in run-time.                               courage the user to freely express his/her view, by
    A few additional words are worth spending to clar-         composing chunks of sentences encoded in the on-
ify this concept. Currently, as in the aforementioned ex-      tology, until the topic has been sufficiently explored.
ample, one sentence may contain two variables whose            Probabilities associated with topics have been ini-
value depends on (i) the specific concept that inherited       tially assigned with the help of Transcultural Nurs-
that Data Property from superconcepts in the Ontol-            ing experts as well as looking for information on the
ogy, and (ii) related concepts connected to it through an      Internet about habits, foods, sports, religions, etc. in
Object Property. Then, when manually adding a sen-             different cultures. Please, notice however that these
tence as a Data Property of a given concept, a num-            are only initial guesses, as the system will update
ber of sentences will be inherited and added automat-          probabilities during the conversation depending on
ically depending on the number of (i) its subconcepts          the user’s preferences and attitudes to avoid stereo-
and (ii) their role-fillers. By considering a DT gener-        typed representations.
ated from the Ontology and having a branching factor        2. The second mechanism enables the Dialogue Man-
B = Bs + Bp , where Bs is the branching factor due             agement algorithm to jump to a different topic of
to subconcepts and Bp is the branching factor due to           the DT depending on what the user says (e.g., when
role-fillers, adding a sentence in a concept at a height       talking about BreakfastHabits the person may start
H from DT leaves produces a number of sentence in              talking about a Restaurant he/she loves). After do-
the order of O(H B ); by considering that all sentences        ing so, the system reverts to the first mechanism
are randomly associated in run-time with a prefix (e.g.,       but starting from the new topic: it asks questions,
“You know...”, “I heard that...”) and may be appended          makes positive and negative comments, proposes ac-
to each other, this increases the number of variations         tivities, or allows the person to freely express his/her
even more. For instance, in the current configuration          view about the new topic, and then re-start follow-
with 2,470 concepts, adding a sentence at the root of          ing branches as previously.
the DT will produce a maximum of 2,470 inherited sen-
                                                              With reference to Figure 2, a possible dialogue pat-
tences in the whole Ontology (which however, may lack
                                                           tern between the robot and Mrs Dorothy Smith, using
specificity, as it is unlikely that the same pattern may
                                                           the aforementioned mechanisms, would be the follow-
be easily adapted to talk about everything in the ontol-
                                                           ing (labels 1 and 2 identify the mechanism currently
ogy: i.e., “I love talking about $hasTopic with you” may
                                                           adopted by the system to navigate the DT):
work for all subconcepts, but “I like eating $hasTopic”
                                                              R. (1): Do you usually drink something with your
will not).
                                                           meals?
                                                              Mrs S.: Always! At least I try...
                                                              R. (1): Drinking water is very important for health!
3.2 Chit-Chatting
                                                           Please, tell me more about your drinking preferences.
When the user pronounces a sentence that triggers the         Mrs S.: Well, I mostly drink water during lunch.
start of a conversation, the system switches to the chit-     R. (1): I know that many English persons love tea.
chatting state. In this case, the Dialogue Management      Do  you?
algorithm operates to keep the conversation as engaging       Mrs   S.: Yes, of course! But mostly in the afternoon.
and natural as possible by implementing two different,        R.  (1): You can never get a cup of tea large enough
intertwined methods to navigate the DT based on what       or a book  long enough to suit me. Do you usually have
the user says:                                             milk with  your tea?
                                                              Mrs S.: Yes, and scones, too! But they are not very
 1. The first mechanism implemented by the Dialogue        healthy.
    Management algorithm simply chooses the topic based       R. (2): I love scones with jam and whipped cream!
    on the DT, following the branches from more general       Figure 3 depicts a detailed flow-chart that clarifies
    conversation topics close to the root (e.g., the topic what happens during the interaction with the user.
10                                                                                                  Lucrezia Grassi et al.

    The diagram shows that, as soon as the person says
something, the Dialogue Management algorithm first
checks if the user’s sentence contains some command
that is aimed to trigger an activity (box Perform the
required activity). This task-oriented part of the di-
alogue has been implemented in our experiments with
caring robots during the CARESSES project, but not
in the tests described in this article, and it may rely on
a third-party system for NLP9 or a pre-existing com-
mand interface that is already implemented onboard
the robot (e.g., verbal commands to clean the room in
case of a robotic vacuum cleaner, or to deliver medicines
in case of a robotic pill dispenser).
    If a command is not found, the user’s input is pro-
cessed (box Process user sentence with Dialogue
Management mechanism 1 or 2) by using the two mech-
anisms described so far: as a first step the user’s utter-
ance is analysed to check if something relevant is de-
tected, and possibly used to jump to another topic us-
ing mechanism 2; if no information is found to jump
to another topic, mechanism 1 checks if the current
topic shall be further explored or it is time to follow the
branches of the DT from a parent node to a children
node that is semantically related in the Ontology. Even-
tually, it should be mentioned that activities (e.g., vac-
uum cleaning or pill delivery) may even be proposed by
the Dialogue Management System itself when related            Fig. 3 Flow chart of the basic chit-chatting framework.
to the current topic of discussion (i.e., if the person is
talking about cleaning and the robot has this capabil-
ity). The interaction continues in this way until the user    at evaluating the quality of the conversation: the anal-
explicitly asks to stop it.                                   ysis performed was rather aimed to measure improve-
    This approach to Dialogue Management, despite its         ment in health-related quality of life (SF-36 [16]), neg-
simplicity, reveals to be very effective for chit-chatting.   ative attitude towards robots (NARS [17]), loneliness
It should be reminded that here Dialogue Management           (ULS-8 [18]). Indeed, additional analysis with a wider
is not meant to understand all the nuances of human           population is needed to evaluate the quality of the con-
language when giving a specific command to be ex-             versation produced by our system in terms of issue 7 in
ecuted (i.e., “Put the red ball on the green table”),         Section 1, and motivates the present study. When nav-
but rather it aims at engaging the user in rewarding          igating in the knowledge base following the branches
conversations: an objective that we try to achieve by         of the DT (aforementioned mechanism 1), coherence in
enabling the system to coherently talk about a huge           the flow of conversation is straightforwardly preserved
number of different topics while properly managing the        by the fact that two nodes of the DT are semantically
conversation flow. Please notice that the approach has        close to each other by construction. However, to pre-
been already tested in the CARESSES project with              serve coherence when deciding if it is needed to jump
more than 20 older people of different nationalities,         from one node to another (mechanism 2), proper strate-
for about 18 hours of conversation split into 6 ses-          gies should be implemented to avoid the feeling that
sions: both the quantitative and qualitative analysis         the system is “going off-topic” with respect to what
performed revealed very positive feedback from the par-       the person says: on the one side, the system needs to be
ticipants [6][7]. However, the quantitative analysis per-     enough responsive to promptly move to another topic if
formed with care home residents was not explicitly aimed      the person wants to; on the other side, the system needs
 9
                                                              to avoid jumping from one topic to another, overesti-
    For instance, https://dialogflow.cloud.google.com/,
                                                              mating the desire of the person to talk about something
that in the current implementation also manages small-talk
requests triggered by sentences such as “Hello”, “How are     else when he/she would be happier to further explore
you”, or similar).                                            the current topic of conversation.
Knowledge-Grounded Dialogue Flow Management for Social Robots and Conversational Agents                            11

3.3 Jumping to a different discussion topic

When using the two mechanisms described in the pre-
vious section to navigate the DT, to preserve coher-
ence in the conversation flow the system uses one of
the two following methods: keyword-based topic match-
ing and keyword- and category-based topic matching.
These methods differ in terms of complexity and in
terms of the need to rely on third-party services to an-
alyze the semantic content of sentences.

3.3.1 Keyword-Based Topic Matching

The first, and the simpler, method is exclusively based
on the detection of keywords in the user sentence (key-
words are manually encoded in the Ontology in a corre-
sponding Data Property). To match a topic, at least two
keywords associated with that topic should be detected
in the sentence pronounced by the user, using wildcards
to enable more versatility in keyword matching. The
use of multiple keywords allows the system to differen-
tiate between semantically close topics (i.e., Green Tea
rather than the more general concept of Tea).
    Figure 4 shows a possible implementation of the box
Process user sentence with Dialogue Management              Fig. 4 Flow chart of the keyword-based topic matching
mechanism 1 or 2 in Figure 3 when using the keyword-        method.
based topic matching method:
 – If the user’s sentence contains keywords correspond-     the keyword-based method, to find the most appropri-
   ing to a topic in the Ontology (left part of the di-     ate topic to continue the conversation. The algorithm
   agram), the algorithm returns a topic of the DT          finds the word “interest” matching with the keyword 1
   that matches the keywords. In case more matches          of the topic HOBBY. Then it should ideally check if the
   are found, the algorithm randomly chooses one of         sentence contains a word matching with keyword 2 as-
   the matching topics.                                     sociated with the same topic: however, in this case, it
 – Otherwise:                                               it turns out that no second keyword is needed since a
    – If there are still relevant questions to be asked     wildcard is used in the topic HOBBY to guarantee that
       or assertions to be made about that topic, the       one keyword is sufficient. As a result, the algorithm will
       system stays on the same topic.                      return the only matching node of the DT: the system
    – Otherwise:                                            will start asking the user about his/her hobbies (which
         • If not at the bottom of the DT, the Dialogue     does not look very appropriate). The second limitation
            Management System keeps on exploring the        related to the behaviour of this algorithm arises when
            DT along its branches;                          no keywords are found and we are at the bottom of the
         • Otherwise the algorithm returns a random         DT: in this case, as already mentioned, the algorithm
            topic immediately below the root.               has no information to continue the conversation and
    This basic approach, despite its simplicity, has been   jumps to a random topic close to the root.
successfully exploited during the experimental trial in
care homes, and its capability to provide (almost al-       3.3.2 Keyword- and Category-Based Topic Matching
ways) coherent, knowledge-grounded replies, was con-
firmed in public exhibitions. However, when focusing on     The second algorithm is meant to solve the issues of the
coherence, the approach has obvious limitations. Let’s      previous algorithm, or at least reduce their frequency,
suppose that the user is having a conversation with the     by enabling a more complex understanding of the se-
system and at some point, for some reason, the user         mantic content of the sentence. For this purpose, this
says “My bank account has a high interest”. This sen-       algorithm not only exploits the keywords, but it also
tence, like any other sentence, is provided as input to     takes into account the “category” of the user’s sentence:
12                                                                                                Lucrezia Grassi et al.

                                                             rent Ontology, only 122 topics have not been associated
                                                             with a category, as none was found to be appropriate.
                                                             Hence, the topics mapped to at least a CNL Category11
                                                             are 2,348.
                                                                 Figure 5 shows a possible implementation of the box
                                                             Process user sentence with Dialogue Management
                                                             mechanism 1 or 2 in Figure 3 when using both key-
                                                             words and categories for topic matching. As a first step,
                                                             the algorithm checks whether the sentence contains at
                                                             least 20 tokens: the Content Classification feature of
                                                             CNL does not work if the input does not satisfy this
                                                             requirement. If the sentence is not long enough, it is
                                                             replicated until it contains at least 20 words. Then it
                                                             proceeds as follows:
                                                              – If the user’s sentence contains keywords correspond-
                                                                ing to a topic in the Ontology and CNL returns at
                                                                least a category associated with the sentence (left
                                                                part of the diagram), the algorithm selects the topic(s)
                                                                in the Ontology with the greatest number of cate-
                                                                gories in common: if there is more than one topic, it
                                                                picks the closest to the DT root or, if there are mul-
                                                                tiple topics at the same level, it chooses randomly;
Fig. 5 Flow chart of the keyword- and category-based topic
                                                              – Otherwise, if no keywords match or the sentence
matching method.
                                                                does not have any category:
                                                                 – If there are still relevant questions to be asked
in the current implementation, it uses the third-party              or assertions to be made about that topic, the
services provided by Cloud Natural Language (CNL)                   system stays on the same topic.
API10 , an advanced tool for Sentiment Analysis, En-             – Otherwise:
tity Recognition, and Content Classification, among the               • If not at the bottom of the DT, the Dialogue
others.                                                                  Management System keeps on exploring the
    To exploit the information related to the category                   DT along its branches;
of the sentence, the Content Classification of the topics             • Otherwise:
contained in the Ontology is a fundamental operation                       · If there are keywords or CNL categories
and needs to be performed in a setup phase to create                         associated with the sentence (we already
a mapping between the topics contained in the Ontol-                         know there are not both), the algorithm
ogy and the CNL category hierarchy. Off-line, before                         selects the topic(s) with matching key-
the system starts, the classification procedure is per-                      words or the greatest number of cate-
formed for all the topics: the algorithm puts together                       gories in common: if multiple, it picks
all the sentences associated with each topic (inserted                       the one closer to the root in the DT or
into the Ontology by experts and then automatically                          randomly if they are at the same level.
composed, or added by the users during the conversa-                       · Otherwise, the algorithm returns a ran-
tion through a mechanism not described in this article                       dom topic immediately below the root.
[52]), sends them to CNL, and associates the returned
                                                                 This second method is more complex and compu-
categories to the corresponding topic in the Ontology
                                                             tationally expensive than the previous one (and, in the
and then in the DT (notice that, whilst keywords are
                                                             current implementation, it requires third-party services),
manually encoded in the Ontology, categories are auto-
                                                             but it provides more stability and consistency when se-
matically assigned). On-line, during the conversation,
                                                             lecting the next conversation topic. Also, it diminishes
this mapping will allow the system to find which topics
                                                             the chances of performing random changes of conver-
of the Ontology match best (i.e., have more categories
                                                             sation topic when at the bottom of the DT and no
in common) with the category (if any) of the sentence
                                                             keywords are found: if the sentence has at least one
pronounced by the user according to CNL. In the cur-
                                                             11
                                                                https://cloud.google.com/natural-language/docs/
10
     https://cloud.google.com/natural-language?hl=en         categories
Knowledge-Grounded Dialogue Flow Management for Social Robots and Conversational Agents                             13

category associated, recognized by CNL, there is still a     2. The system exploiting the keyword and category-
chance of finding a coherent topic to jump to in order          based Dialogue Management algorithm;
to continue the conversation.                                3. A system that chooses the next topic randomly (i.e.,
                                                                regardless of what the user says);
                                                             4. A human pretending to be a chatbot (i.e., in a Turing-
4 Materials and Methods                                         test fashion);
                                                             5. Replika.
Experimental tests with recruited participants have been
                                                                 It shall be reminded that the purpose of the exper-
performed for multiple purposes:
                                                             iments is to test the impact of the Dialogue Manage-
 – Evaluate the impact of different solutions for Dia-       ment solutions we proposed, which emphasize the prob-
   logue Management we developed, either keyword-            lem of switching in-between different topics: Replika is
   based or both keyword- and category-based, to im-         expected to perform better than systems 1, 2, and 3
   prove the subjective perception of the system in          in generating individual sentences, and this is true in
   terms of coherence and user satisfaction;                 the case of a human pretending to be a chatbot. Then,
 – Compare our solutions with Replika (one of the most       the purpose of the comparison is to show that, even
   advanced commercial social chatbots) as well as with      in presence of a more advanced approach for produc-
   baselines consisting of a system choosing replies ran-    ing individual sentences, still our solution for Dialogue
   domly, and with a human pretending to be a chat-          Management may have a significantly positive impact
   bot.                                                      on user evaluation.
                                                                 For reasons due to the Covid-19 pandemic, partic-
    Among the most famous chatbots mentioned in Sec-         ipants could not take the tests with a robot, but they
tion 2.3 we chose to use Replika for our experiment as       had to interact with their assigned conversational sys-
it is available as Android12 and IOS application, and        tem (1, 2, 3, 4, or 5) using their computer, by establish-
in a web version. Moreover, as already mentioned, it is      ing a connection to a remote server in our laboratory. It
backed by Open AI’s sophisticated GPT-213 : a trans-         is crucial to mention that the text-based user interface
former machine learning model that uses deep learning        was identical for all the systems: therefore, participants
to generate text output that has been claimed to be          assigned group 4 were not aware that they were inter-
indistinguishable from that of humans [45].                  acting with a human, writing at a terminal, and par-
                                                             ticipants assigned to group 5 were not aware that they
                                                             were interacting with Replika (with a human manually
4.1 Participants and Test Groups
                                                             copying and pasting sentences from Replika to the ter-
                                                             minal and vice-versa).
The participants for this test have been recruited through
                                                                 In total, each session included 20 exchanges (i.e.,
posts on Social Networks (Facebook and Twitter) as
                                                             a pair of utterances) between the participant and the
well as in Robotics and Computer Science classes at the
                                                             system.
University of Genoa, for a total of 100 volunteer par-
ticipants aged between 25 and 65. The only inclusion
criteria are that participants shall be able to read and     4.2 Questionnaire
write in English and to use a PC with a network con-
nection from home. Since the test is anonymous and no        Immediately after the conversation session, participants
personal data is collected, the whole procedure is fully     were asked to fill in a questionnaire, divided into two
compliant with the EU General Data Protection Regu-          parts.
lation. Each time a new participant is recruited, he/she         The first section of the questionnaire required to
is randomly assigned to a group from 1 to 5 (where each      evaluate what we defined as the Coherence of system’s
group corresponds to a different conversation system),       replies (Section 4.2.1 below), for a total of 20 replies,
and given the instructions to perform the test. Based        individually scored, in a 7-point Likert scale, where 1
on the group they were assigned, participants had to         means not Coherent and 7 means Coherent. The sec-
interact with:                                               ond part consisted of the SASSI questionnaire (Section
                                                             4.2.2).
1. The system exploiting the keyword-based Dialogue
   Management algorithm;                                     4.2.1 Coherence Measure
12
    Replika’s Android application has been downloaded more
than 5,000,000 times from the Google Play Store.             The instructions provided to each participant clarified
 13
    https://en.wikipedia.org/wiki/GPT-2                      what is meant by “evaluating the Coherence”. The aim
You can also read