Abstract Meaning Representation for Human-Robot Dialogue

Page created by Katie Zimmerman
 
CONTINUE READING
Abstract Meaning Representation for Human-Robot Dialogue

                Claire Bonial1 , Lucia Donatelli2 , Jessica Ervin3 , and Clare R. Voss1
                        1
                            U.S Army Research Laboratory, Adelphi, MD 20783
                             2
                               Georgetown University, Washington D.C. 20057
                               3
                                 University of Rochester, Rochester, NY 14627
                                         claire.n.bonial.civ@mail.mil

                     Abstract                         only be trained to process and execute the ac-
                                                      tions corresponding to semantic elements of the
    In this research, we begin to tackle the
    challenge of natural language understanding       representation (ii) there are a variety of fairly ro-
    (NLU) in the context of the development of        bust AMR parsers we can employ for this work,
    a robot dialogue system. We explore the ad-       enabling us to forego manual annotation of sub-
    equacy of Abstract Meaning Representation         stantial portions of our data and facilitating ef-
    (AMR) as a conduit for NLU. First, we con-        ficient automatic parsing in a future end-to-end
    sider the feasibility of using existing AMR       system; and (iii) the structured representation fa-
    parsers for automatically creating meaning
                                                      cilitates the interpretation of novel instructions
    representations for robot-directed transcribed
    speech data. We evaluate the quality of out-
                                                      and grounding instructions with respect to the
    put of two parsers on this data against a manu-   robot’s current physical surroundings and set of
    ally annotated gold-standard data set. Second,    executable actions. The latter motivation is espe-
    we evaluate the semantic coverage and distinc-    cially important given that our human-robot dia-
    tions made in AMR overall: how well does it       logue is physically situated. This stands in con-
    capture the meaning and distinctions needed       trast to many other dialogue systems, such as task-
    in our collaborative human-robot dialogue do-     oriented chat bots, which do not require establish-
    main? We find that AMR has gaps that align
                                                      ing and acting upon a shared understanding of the
    with linguistic information critical for effec-
    tive human-robot collaboration in search and      physical environment and often do not require any
    navigation tasks, and we present task-specific    intermediate semantic representation (see §6 for
    modifications to AMR to address the deficien-     further comparison to related work).
    cies.                                                Our paper is structured as follows: First, we
                                                      present background both on the corpus of human-
1   Introduction
                                                      robot dialogue we are leveraging (§2), and on
A central challenge in human-agent collaboration      AMR (§3). §4 discusses the implementation and
is that robots (or their virtual counterparts) do     results of two AMR parsers on the human-robot
not have sufficient linguistic or world knowledge     dialogue data. §5 assesses the semantic coverage
to communicate in a timely and effective manner       of AMR for the human-robot dialogue data in par-
with their human collaborators (Chai et al., 2017;    ticular. We then discuss related work that informs
She and Chai, 2017). We address this challenge        the current research in §6. Finally, §7 concludes
in ongoing research directed at analyzing robot-      and presents ideas for future work.
directed communication in collaborative human-
agent exploration tasks, with the ultimate goal of    2 Background: Human-Robot Dialogue
enabling robots to adapt to domain-specific lan-        Corpus
guage.
   In this paper, we choose to adopt an interme-      We aim to support NLU within the broader context
diate semantic representation and select Abstract     of ongoing research to develop a human-robot di-
Meaning Representation (AMR) (Banarescu et al.,       alogue system (Marge et al., 2016a) to be used on-
2013) in particular for three reasons: (i) the se-    board a remotely located agent collaborating with
mantic representation framework abstracts away        humans in search and navigation tasks (e.g., disas-
from surface variation, therefore the robot will      ter relief). In developing this dialogue system, we

                                                  236
          Proceedings of the Society for Computation in Linguistics (SCiL) 2019, pages 236-246.
                              New York City, New York, January 3-6, 2019
are making use of portions of the corpus of human-                     change surrounding the successful execution of a
robot dialogue data collected under this effort (Bo-                   command are grouped and annotated for the rela-
nial et al., 2018; Traum et al., 2018).1 This corpus                   tions they hold to one another. Relation types (Rel)
was collected via a phased ‘Wizard-of-Oz’ (WoZ)                        signal how utterances within the same TU relate
methodology, in which human experimenters per-                         to one another in the context of the ultimate goal
form the dialogue and navigation capabilities of                       of the TU (e.g. “ack-done” in table 1, shortened
the robot during experimental trials, unbeknownst                      from “acknowledge-done,” signals that an utter-
to participants interacting with the ‘robot’(Marge                     ance acknowledges completion of a previous ut-
et al., 2016b).                                                        terance; for full details on Rel types, see Traum et
   Specifically, a naı̈ve participant (unaware of the                  al. (2018)). Antecedents (Ant) specify which utter-
wizards) is tasked with instructing a robot to navi-                   ance is related to which. An example of a TU may
gate through a remote, unfamiliar house-like en-                       be seen in table 1. It is notable that the existing an-
vironment, and asked to find and count objects                         notation scheme highlights dialogue structure and
such as shoes and shovels. In reality, the partic-                     does not provide a markup of the semantic content
ipant is not speaking directly to a robot, but to                      of participant instructions.
an unseen Dialogue Manager (DM) Wizard who
listens to the participant’s spoken instructions and                   3 Background: Abstract Meaning
responds with text messages in a chat window or                          Representation
passes a simplified text version of the instructions                   The Abstract Meaning Representation (AMR)
to a Robot Navigator (RN) Wizard, who joysticks                        project (Banarescu et al., 2013) has created a man-
the robot to complete the instructions. Given that                     ually annotated “semantics bank” of text drawn
the DM acts as an intermediary passing communi-                        from a variety of genres. The AMR project anno-
cations between the participant and the RN, the di-                    tations are completed on a sentence-by-sentence
alogue takes place across multiple conversational                      basis, where each sentence is represented by a
floors. The flow of dialogue from participant to                       rooted directed acyclic graph (DAG). For ease of
DM, DM to RN and subsequent feedback to the                            creation and manipulation, annotators work with
participant can be seen in table 1.                                    the PENMAN representation of the same infor-
   The corpus comprises 20 participants and about                      mation (Penman Natural Language Group, 1989).
20 hours of audio, with 3,573 participant utter-                       For example:
ances (continuous speech) totaling 18,336 words,
                                                                       (w / want-01
as well as 13,550 words from DM-Wizard text                               :ARG0 (d / dog)
messages. The corpus includes speech transcrip-                           :ARG1 (p / pet-01
                                                                                    :ARG0 (g / girl)
tions from participants as well as the speech of the                                :ARG1 d))
RN-Wizard. These transcriptions are time-aligned
with the DM-Wizard text messages passed either                         Figure 1: PENMAN notation of The dog wants the girl to
to the participant or to the RN-Wizard.                                pet him.
   The corpus also includes a dialogue annota-
                                                                          In neo-Davidsonian fashion (Davidson, 1969;
tion scheme specific to multi-floor dialogue that
                                                                       Parsons, 1990), AMR introduces variables (or
identifies initiator intent and signals relations be-
                                                                       graph nodes) for entities, events, properties,
tween individual utterances pertaining to that in-
                                                                       and states.      Leaves are labeled with con-
tent (Traum et al., 2018).The design of the exist-
                                                                       cepts, so that (d / dog) refers to an in-
ing annotation scheme allows for the characteriza-
                                                                       stance (d) of the concept dog. Relations link
tion of distinct information states by way of sets
                                                                       entities, so that (w / walk-01 :location
of participants, participant roles, turn-taking and
                                                                       (p/ park)) means the walking event (w) has
floor-holding, and other factors (Traum and Lars-
                                                                       the relation location to the entity, park (p).
son, 2003). Transaction units (TU) identify utter-
                                                                       When an entity plays multiple roles in a sen-
ances from multiple participants and floors into
                                                                       tence (e.g., (d / dog) above), AMR employs
units according to the realization of an initiator’s
                                                                       re-entrancy in graph notation (nodes with multiple
intent, such that all utterances involved in an ex-
                                                                       parents) or variable re-use in PENMAN notation.
    1
      This corpus is still being collected and a public release is        AMR concepts are either English words
in preparation.                                                        (boy), PropBank (Palmer et al., 2005) role-

                                                                 237
Left floor                                        Right Floor                      Annotations
       #    Participant             DM Ñ Participant          DM Ñ RN                  RN      TU Ant           Rel
       1    move forward 3 feet                                                                1
       2                            ok                                                         1     1          ack-wilco
       3                                                      move forward 3 feet              1     1          trans-r
       4                                                                               done    1     3          ack-done
       5                            I moved forward 3 feet                                     1     4          trans-l

Table 1: Example of a Transaction Unit (TU) in the existing corpus dialogue annotation, which contains an instruction initiated
by the participant, its translation to a simplified form (DM to RN), and the execution of the instruction and acknowledgement
of such by the RN. TU, Ant(ecedent), and Rel(ation type) are indicated in the right columns. (Traum et al., 2018)

sets (want-01), or special keywords indi-                          4.1    Parsers
cating generic entity types: date-entity,
                                                                   To automatically create AMRs for the human-
world-region, distance-quantity, etc.
                                                                   robot diaogue data, we used two off-the-shelf
In addition to the PropBank lexicon of role-
                                                                   AMR parsers, JAMR2 (Flanigan et al., 2014) and
sets, which associate argument numbers (ARG
                                                                   CAMR3 (Wang et al., 2015). JAMR was one of the
0–6) with predicate-specific semantic roles (e.g.,
                                                                   first AMR parsers and uses a two-part algorithm to
ARG0=wanter in ex. 1), AMR uses approximately
                                                                   first identify concepts and then to build the maxi-
100 relations of its own (e.g., :time, :age,
                                                                   mum spanning connected subgraph of those con-
:quantity, :destination, etc.).
                                                                   cepts, adding in the relations. CAMR, in con-
   The representation captures who is doing what                   trast, starts by obtaining the dependency tree (in
to whom like other semantic role labeling (SRL)                    this case, using the Charniak parser4 and Stanford
schemes (e.g., PropBank (Palmer et al., 2005),                     CoreNLP toolkit (Manning et al., 2014)) and then
FrameNet (Baker et al., 1998; Fillmore et al.,                     uses their algorithm to apply a series of transfor-
2003), VerbNet (Kipper et al., 2008)), but also rep-               mations to the dependency tree, ultimately trans-
resents other aspects of meaning outside of seman-                 forming it into an AMR graph. One strength of
tic role information, such as fine-grained quan-                   CAMR is that the dependency parser is indepen-
tity and unit information and parthood relations.                  dent of the AMR creation, so a dependency parser
Also distinguishing it from other SRL schemes, a                   that is trained on a larger data set, and therefore
goal of AMR is to capture core facets of mean-                     more accurate, can be used. Both JAMR and
ing while abstracting away from idiosyncratic syn-                 CAMR have algorithms that have learned proba-
tactic structures; thus, for example, She adjusted                 bilities from training data in order to execute their
the machine and She made an adjustment to the                      algorithms on novel sentences.
machine share the same AMR. AMR has been
widely used to support NLU, generation, and sum-                   4.2    Gold Standard Data Set
marization (Liu et al., 2015; Pourdamghani et al.,
                                                                   In order to evaluate parser performance on our
2016), machine translation, question answering
                                                                   data set, we hand-annotated a subset of the par-
(Mitra and Baral, 2016), information extraction
                                                                   ticipant’s speech in the human-robot dialogue cor-
(Pan et al., 2015), and biomedical text mining
                                                                   pus to create a gold standard data set. We focus
(Garg et al., 2016; Rao et al., 2017; Wang et al.,
                                                                   on only participant language because it is the nat-
2017)
                                                                   ural language that the robot will ultimately need to
                                                                   process and act on. This selected subset comprises
4    Evaluating AMR parsers on                                     10% of participant utterances from one phase of
     Human-Robot Dialogue Data                                     the corpus including 10 subjects. The result-
                                                                   ing sample is 137 sentences, equally distributed
To serve as a conduit for NLU in a dialogue sys-                   across the 10 participants, who tend to have unique
tem, the ideal semantic representation would have                  speech patterns. Three expert annotators famil-
robust parsers, allowing the representation to be                  iar with both the human-robot dialogue data and
implemented efficiently on a large scale. There                    AMR independently annotated this sample, ob-
have been a variety of parsers developed for AMR;                     2
                                                                        https://github.com/jflanigan/jamr
two parsers using very different approaches are ex-                   3
                                                                        https://github.com/c-amr/camr
plored in the sections to follow.                                     4
                                                                        https://github.com/BLLIP/bllip-parser

                                                             238
taining inter-annotator agreement (IAA) scores of                     can be confidently inferred should be included in
.82, .82, and .91 using the Smatch metric.5                           the AMR, even if implicit.6
   After independent annotations, we collabora-
                                                                      (m / move-01 :mode imperative
tively created the gold standard set. Notable                             :ARG0 (y / you)
choices made during this process include the treat-                       :ARG1 y
ment of “can you” utterances, re-entry of the sub-                        :direction (f / forward)
                                                                          :extent (d / distance-quantity
ject in commands using motion verbs with in-                                  :quant 3
dependent arguments for the mover and thing-                                  :unit (f2 / foot)))
moved, and handling of disfluencies; each is de-                      Figure 3: Gold standard AMR for “move forward 3 feet,”
scribed here.                                                         showing the inferred argument “you” as :ARG0 (mover) and
   In “can you” utterances, there is an ambi-                         :ARG1 (thing-moved).
guity as to whether it is a genuine question
of ability, or a polite request. This difference                         Although the LDC AMR corpus7 does not in-
determines whether the sentence gets annotated                        clude speech data, AMR does offer guidance
with possible-01 (used in AMR to convey                               for disfluencies—dropping the disfluent portion
both possibility and ability), or just as a command                   of an utterance in favor of representing only the
(figure 2). It also determines whether the robot                      speaker’s repair utterance.8 We followed this gen-
should respond with a statement of ability, or                        eral AMR practice and dropped disfluent speech
perform the action requested. To resolve this                         for the surprisingly infrequent cases of disfluency
ambiguity, we referred back to the full transcripts                   in our gold standard sample.
of the data, and inferred based on context. In
our sample, only one of these utterances (“can                        4.3   Results & Error Analysis
you go that way”) was deemed to be a genuine                          Having created a gold standard sample of our data,
questions of ability, while the remaining 13 (e.g.                    we ran both JAMR and CAMR on the same sam-
“can you take a picture,” “can you turn to your                       ple and obtained the Smatch scores when com-
right”) were treated as commands. Those that                          pared to the gold standard. As seen in Table 2,
were commands were annotated with :polite                             CAMR performs better on both precision and re-
+, in order to preserve what we believe to be the                     call, thus obtaining the higher F-score. However,
speaker’s intention in using the modal “can.”                         compared to their self-reported F-scores (0.58 for
(p / possible-01                                                      JAMR and 0.63 for CAMR) on other corpora, both
    :ARG1 (p2 / picture-01                                            under-perform on the human-robot dialogue data.
        :ARG0 (y / you))
    :polarity (a / amr-unknown))
                                                                                       Precision      Recall     F-score
(p / picture-01 :mode imperative                                            JAMR         0.27          0.44       0.33
    :polite +                                                               CAMR         0.33          0.51       0.40
    :ARG0 (y / you))

Figure 2: Two different AMR parses for the utterance “can             Table 2: Parser performance on human-robot dialogue data.
you take a picture,” convey two distinct interpretations of the
utterance: the top can be read as “Is it possible for you to take        Of the errors present in the parser output, many
a picture?,” the bottom as a command for a picturing event.
                                                                      come from improper handling of light verb con-
                                                                      struction, imperatives, inferred arguments, and re-
   With commands like “move” or “turn,” it is im-                     quests phrased as “can you” questions. “Take a
plied that the robot is the agent impelling mo-                       picture” is an example of a frequent light verb con-
tion and the thing being moved. Therefore, we                         struction, in which the verb (“take”) is not seman-
used re-entry in those AMRs to infer the implied                      tically the main predicating element of the sen-
“you” as both the :ARG0, mover, and the :ARG1,                        tence. The correct parse is shown in Figure 4,
thing-moved (figure 3). This is consistent with                           6
                                                                            See    “implicit   roles”      in     AMR      guide-
AMR’s goal of capturing the meaning of an utter-                      lines:                       https://github.com/amrisi/amr-
ance, independent of syntax—all arguments that                        guidelines/blob/master/amr.md
                                                                          7
                                                                            https://catalog.ldc.upenn.edu/
   5
     IAA Smatch scores on AMRs are generally between .7               LDC2017T10
                                                                          8
and .8, depending on the complexity of the data (AMR devel-                 https://www.isi.edu/˜ulf/amr/lib/
opment group communication, 2014).                                    amr-dict.html#disfluency

                                                                239
followed by the AMR that both parsers consis-                  training data came from the LDC AMR cor-
tently create. They both incorrectly place “take”              pus, made up entirely of written text, mostly
as the top node (using the grasping, caused mo-                newswire.9 In contrast, the human-robot dia-
tion sense of the verb), when in reality it should be          logue data is transcribed from spoken utterances,
dropped from the AMR completely according to                   taken from dialogue that is instructional and goal-
AMR practice for light verbs. Light verb construc-             oriented. Thus, running the parsers on this data
tions occur in 60 utterances in our sample, with 59            set shows that the differences in domain have sig-
of those being some variation on “take a picture”              nificant effects on the parsers, and give rise to the
or “take a photo.”                                             systematic errors described above.
                                                                  Improvements to the parser output could be ob-
(p / picture-01 :mode imperative
    :ARG0 (y / you))                                           tained even by just adding a few simple heuris-
                                                               tics, due to the formulaic nature of our data. Of
(x1 / take-01                                                  137 sample sentences, 25 were “take a picture,”
    :ARG1 (x3 / picture))
                                                               so introducing a heuristic specific to that sentence
Figure 4: Gold standard AMR for “take a picture” (top),        would be a simple way to make several correc-
followed by parser output.
                                                               tions. To obtain broader improvements, however,
                                                               it’s clear that it will be necessary to retrain the
   Another common error shown in figure 4 is
                                                               parsers on in-domain data. Given that retrain-
the notation :mode imperative, used to indi-
                                                               ing requires a corpus of hand-annotated data, this
cate commands, which is present in 127 sentences
                                                               gives us an opportunity to examine the current
within our sample. Despite the prevalence of this
                                                               features of AMR in relation to our collaborative
feature in our gold standard, it is never present in
                                                               human-robot dialogue domain, and to explore pos-
parser output.
                                                               sible additions to the annotation scheme to ensure
   Another problematic omission on the parsers’
                                                               that all elements of meaning essential to our do-
part is a lack of inferred arguments. As dis-
                                                               main have coverage. The findings of this analysis
cussed earlier, commands like “move forward 3
                                                               are described in the sections to follow.
feet” have an implied “you” in both the :ARG0
and :ARG1 positions. However, the parsers don’t                5 Evaluating Semantic Coverage &
include this variable, instead only including con-               Distinctions of AMR
cepts that are explicitly mentioned in the sentence
(figure 5).                                                    We assess the adequacy of AMR for its use in
                                                               an NLU component of a human-robot dialogue
(x1 / move-01                                                  system both on theoretical grounds and in light
    :direction (x2 / forward)
    :ARG1 (x4 / distance-quantity                              of the results and error analysis presented in §4.
        :unit (f / foot)                                       To our knowledge, our research is the first to
        :quant 3))
                                                               employ AMR to capture the semantics of spo-
Figure 5: CAMR output for “move forward 3 feet.” The           ken language; existing corpora are otherwise text-
distance is mistaken for the :ARG1 thing-moved and the im-     based.10 Here, we discuss the characteristics of
plied “you”/robot is omitted. Compare with figure 3.
                                                               our data relevant to semantic representation and
                                                               highlight specific challenges and areas of interest
   A fourth error made by the parsers was on “can
                                                               that we hope to address with AMR (§5.1); explore
you” requests. These were consistently handled
                                                               how to leverage AMR for these purposes by identi-
by both parsers as questions of ability, annotated
                                                               fying (§5.2) and remedying (§5.3) gaps in existing
using possible-01, even when (as was usually
                                                               AMR; and conclude with discussion (§5.4).
the case) the utterances were intended to be polite
commands (parser output is shown in the top parse              5.1    Challenges of the Data
of figure 2).
                                                               Our goal in introducing AMR is to bridge the gap
4.4 Discussion                                                 between what is annotated currently as dialogue
The poor performance of both parsers on the                       9
                                                                    https://catalog.ldc.upenn.edu/
human-robot dialogue data is unsurprising given                LDC2017T10
                                                                 10
                                                                    Data   released   at    https://amr.isi.edu/
the significant differences between it and the data            download/amr-bank-struct-v1.6.txt            (Little
the parsers were trained on. For both parsers,                 Prince) and LDC corpus (footnote 5).

                                                         240
structure (Traum et al., 2018), and the seman-                     ble of performing low-level actions with specific
tic content of utterances that comprise such dia-                  endpoints or goals: for example, “move five feet
logue (not included in the current scheme). This                   forward,” “face the doorway on your left,” and
goal follows from the understanding that dialogue                  “get closer to that orange object.” This robot can-
acts are composed of two primary components: (i)                   not successfully perform instructions that have no
semantic content, identifying the entities, events,                clear endpoint:“move forward” and “turn” will
and propositions relevant to the dialogue; and (ii)                trigger clarification requests for further specifica-
communicative function, identifying the ways an                    tion. In our planned system, however, we would
addressee may use semantic content to update the                   ultimately like to give the robot an instruction such
information state (Bunt et al., 2012). The existing                as “Robot, explore this space and tell me if any-
dialogue structure annotation scheme of Traum                      one has been here recently.” If the robot can learn
et al. (2018) distinguishes two primary levels of                  to decompose an instruction such as explore into
pragmatic meaning important to dialogue that our                   smaller actions, as well as how to identify signs
research aims to maintain. The first, intentional                  of previous inhabitants, such instructions may be-
structure (Grosz and Sidner, 1986), is equiva-                     come feasible.
lent to a TU11 : all utterances that explicate and                    Finally, much of the semantic content of the
address an initiator’s intent. The second, inter-                  data in our experiment must be situated within
actional structure, captures how the information                   the dialogue context to be properly interpreted.
state of participants in the dialogue is updated as                A command of “Do that again” is ambiguous in
the TU is constructed (Traum et al., 2018). These                  terms of which action it refers back to. Similarly, a
two levels of meaning stand apart from the basic                   negative command such as “No, go to the doorway
compositional meaning of their associated utter-                   on the left” negates specific information contained
ances.We seek to represent these pragmatic levels                  in a previous command. Implicit in natural lan-
of meaning and link them to their respective se-                   guage input, as well, are subtle event-event tem-
mantic forms.                                                      poral relations, such as “Move forward and take
   We also seek to represent the temporal, aspec-                  a picture” (sequential actions) and “Turn around
tual, and veridical/modal nature of the robot’s ac-                and take a picture every 45 degrees” (simultane-
tions. Human instructions must be interpreted for                  ous actions).
actions: the robot may respond to such instruction
by asserting whether or not such an instruction is                 5.2    Gaps in AMR
possible given the robot’s capabilities and the sur-               We focus on the three elements of meaning crucial
rounding physical environment, and the robot may                   to human-robot dialogue and currently lacking in
also communicate whether it is in the process of                   AMR: (i) speaker intended meaning (as differing
completing or has completed the desired instruc-                   from the compositional meaning of the speaker’s
tion. Such information is implicitly linked to the                 utterances);13 (ii) tense and aspect (and vericity,
intentional and interactional structure of the dia-                by association); and (iii) spatial parameters neces-
logue. For example, the act of giving a command                    sary for the robot to successfully execute instruc-
implies that an event (if it occurs) will happen in                tions within its physical environment.
the future; the act of asserting that an event has                    The goal of the AMR project is to represent
occurred signals that the event is past.                           meaning, but whether such meaning is purely se-
   The representation of space and specific param-                 mantic or also captures a speaker’s intended mean-
eters that contribute to the robot’s understanding                 ing is not specified. Here, we attempt to strike
of how human language maps on to its physi-                        a balance between capturing speaker intended
cal environment is also of interest to our work                    meaning and overspecifying utterances. We do
here.12 As its capabilities are presented to the                   this to stay faithful to existing experimental dia-
participant, the ‘robot’ in this research is capa-                 logue annotation practices, and to enable the robot
  11                                                                 13
     Transaction Unit is described in §2.                               Speaker vs.       compositional meaning is arguably
  12
     Though this mapping is outside the scope of our work,         annotated inconsistently in the current AMR corpus;
we see AMR as contributing substantially richer semantic           see         https://www.isi.edu/˜ulf/amr/lib/
forms to the NL planning research in robotics of others, such      amr-dict.html#pragmatics for some specific guide-
as (Howard et al., 2014) that entails such grounding, the pro-     lines, but note that the released annotations seem to differ
cess of assigning physical meanings to NL expressions.             from this guidance in places.

                                                             241
to generalize the connections between such inten-              dialogue structure annotation scheme for our cor-
tion and underlying semantic content.                          pus that identifies dialogue types similar to speech
   To illustrate the need for adding tense and as-             acts. Using this scheme as a starting point, we cre-
pect information to existing AMRs, compare the                 ate 36 unique speech acts and corresponding AMR
following utterances: (i) “move forward five feet”             frames: this consists of six general dialogue types
(uttered by the human participant); (ii) “moving               (command, assert, request, question, evaluate, ex-
forward five feet...” (sent via text by the robot to           press) with 5 to 12 subtypes each (e.g., move,
signal initiation of instruction); and (iii) “I moved          send-image). An example of a speech act template
forward five feet” (sent by the robot upon comple-             for command:move may be seen in figure 7.16 No-
tion of the action). Although distinctions between             tably, by adding a layer of speaker-intended mean-
these three utterances are critical for our domain,            ing to the content of the proposition itself, we are
AMR represents these three utterances the same                 able to capture participant roles within the speech
way:14                                                         act (:ARG0 and :ARG1 of command-02). Fu-
                                                               ture work will be able to reference these roles and
(m / move-01
    :ARG1 (r / robot)
                                                               model how participant relations evolve over the
    :extent (d / distance-quantity                             course of the discourse (Allwood et al., 2000).
        :quant 5
        :unit (f / foot))                                      (c / command-02
    :direction (f / forward))                                      :ARG0-commander
                                                                   :ARG1-impelled agent
Figure 6: Without tense and aspect representation, current         :ARG2 (g / go-02 :completable +
AMR conflates commands to move forward and assertions of               :ARG0-goer
ongoing and/or completed motion.                                       :ARG1-extent
                                                                       :ARG3-start point
                                                                       :ARG4-end point
   Spatial parameters of actions are represented in                    :path
the current AMR through the use of predicate-                          :direction
specific ARG numbers, as outlined in the Prop-                         :time (a / after
                                                                           :op1 (n / now))))
Bank rolesets, or with the use of AMR relations,
such as :path and :destination. Whether                        Figure 7: Speech act template for command:move. Argu-
or not a relation or argument number is used, and              ments and relations in italics are filled in from context and
                                                               the utterance.
which argument number, is specific to a predicate
and therefore inconsistent across motion relations.
We aim to make the representation of these param-                 Tense and Aspect. We also adopt the an-
eters more consistent and enrich them with infor-              notation scheme proposed by Donatelli et al.
mation about which are required, which are op-                 (2018) for augmenting AMR with tense and as-
tional, and which might have assumed, default in-              pect. This scheme identifies the temporal na-
terpretations.                                                 ture of an event relative to the dialogue act in
                                                               which it is expressed as past, present, or future.
5.3 Proposed Refinements                                       The aspectual nature of the event can be specified
                                                               as atelic (:ongoing -/+), telic and hypotheti-
We leverage the existing dialogue annotations to
                                                               cal (:ongoing -, :completable +), telic
extract intentional and interactional meaning and,
                                                               and in progress (:ongoing +, :complete
where relevant, map these annotations to proposed
                                                               -), or telic and complete (:ongoing -,
refinements described below.
                                                               :complete +). A telic and hypothetical event
   Speech Acts. We add a higher level of prag-
                                                               representation can be seen in figure 7. This
matic meaning to the propositional content rep-
                                                               tense/aspect annotation scheme is specific to AMR
resented in the AMR through frames that corre-
                                                               and coarse-grained in nature, but through its use
spond to speech acts (Austin, 1975; Searle, 1969).
                                                               of existing AMR relations (:time, before,
We use the model of vocatives in existing AMR
                                                               after, now, :op), it can be adapted to finer-
as our guide.15 We also make use of the existing
                                                               grained temporal relations in future work.
  14
      Utterance (i) would additionally annotate (m / move)        Spatial Parameters. As seen in figure 7, all ar-
with :mode imperative to signal a command. As noted            guments of go-02 as well as additional relations
in §4, parsers rarely capture this information.
   15                                                             16
      https://www.isi.edu/˜ulf/amr/lib/                              Annotation of tense and aspect, and the need for extra
amr-dict.html#vocative                                         relations will be explained in continuation.

                                                         242
are made use of in the template. These positions          tent parses in comparison to CCG. Universal Con-
correspond to parameters of the action, needed            ceptual Cognitive Annotation (UCCA) (Abend
so the robot may carry out the action success-            and Rappoport, 2013), which also abstracts away
fully. Having a template for each domain-relevant         from syntactic idiosyncrasies, and its correspond-
speech act will allow us to specify required and          ing parser (Hershcovich et al., 2017) merits future
optional parameters for robot operationalization.         investigation.
In figure 7, this is either :ARG1 or :ARG4, given
that the robot currently needs an endpoint for each       6.2   NLU in Dialogue Systems
action; if both arguments are empty, this ought
to trigger a request for further information by the       Task-oriented spoken dialogue systems have been
robot. Note also that the same command:move               an active area of research since the early 1990s.
AMR (including go-02) would be implemented                Broadly, the architecture of such systems includes
for all realizations of commands for movement             (i) automatic speech recognition (ASR) to recog-
(e.g., move, go, drive), whereas these would re-          nize an utterance, (ii) an NLU component to iden-
ceive distinct AMRs, headed by each individual            tify the user’s intent, and (iii) a dialogue man-
predicate, under current practices.                       ager to interact with the user and achieve the in-
                                                          tended task (Bangalore et al., 2006). The mean-
5.4 Discussion                                            ing representation within such systems has, in the
The refinements we present to the existing AMR            past, been predefined frames for particular sub-
aim to mimic a conservative learning process: we          tasks (e.g., flight inquiry), with slots to be filled
seek to provide just enough pragmatic meaning to          (e.g., destination city) (Issar and Ward, 1993). In
assist robot understanding of natural language, but       such approaches, the meaning representation was
we do not provide specification to the point that         crafted for a specific application, making gener-
there will be over-fitting in the form of one-to-one      alizability to new domains difficult if not impos-
mappings of semantic content and pragmatic ef-            sible. Current approaches still model NLU as a
fect. As research continues and robot capabilities        combination of intent and dialogue act classifica-
expand, we expect to augment the robot’s linguis-         tion and slot tagging, but many have begun to in-
tic knowledge based on general patterns in the cur-       corporate recurrent neural networks (RNNs) and
rent annotation scheme.                                   some multi-task learning for both NLU and di-
                                                          alogue state tracking (Hakkani-Tür et al., 2016;
6   Related work                                          Chen et al., 2016), the latter of which allows the
                                                          system to take advantage of information from the
6.1 Semantic Representation                               discourse context to achieve improved NLU. Sub-
There is a long-standing tradition of research in se-     stantial challenges to these systems include work-
mantic representation within NLP, AI, as well as          ing in domains with intents that have a large num-
theoretical linguistics and philosophy (see Schu-         ber of possible values for each slot and accommo-
bert (2015) for an overview). In this body of re-         dation of out-of-vocabulary slot values (i.e. oper-
search, there are a variety of options that could be      ating in a domain with a great deal of linguistic
used within dialogue systems for NLU. However,            variability).
for many of these representations, there are no ex-          Thus, a primary challenge today and in the past
isting automatic parsers, limiting their feasibility      is representing the meaning of an utterance in a
for larger-scale implementation. A notable excep-         form that can exploit the constraints of a partic-
tion is combinatory categorical grammar (CCG)             ular domain but also remain portable across do-
(Steedman and Baldridge, 2009); CCG parsers               mains and robust despite linguistic variability. We
have already been incorporated in some current            see AMR as promising because the parsers are
dialogue systems (Chai et al., 2014). Although            domain-independent (and can be retrained), and
promising, CCG parses closely mirror the input            the representation itself is flexible enough for the
language, so systems making use of CCG parses             addition of some domain-specific constraints. Fur-
still face the challenge of a great deal of linguis-      thermore, since AMR abstracts away from syn-
tic variability that can be associated with a single      tactic variability to represent only core elements
intent. Again, in abstracting away from surface           of meaning, some of the variability in the input
variation, AMR may offer more regular, consis-            language can be “tamed,” to give systems more

                                                    243
systematic input. With the proposed addition of           2018). For NLU, Scout Bot makes use of the
speech acts to AMR described in §5.3, the aug-            NPCEditor (Leuski and Traum, 2011), a statisti-
mented AMRs also facilitate dialogue state track-         cal classifier that learns a mapping from inputs to
ing.                                                      outputs. Currently, the NPCEditor currently relies
   Although human-robot dialogue systems often            on string divergence measures to associate an in-
leverage a similar architecture to that of the spo-       struction with either a text version to be sent for-
ken dialogue systems described above, human-              ward to the RN-Wizard or a clarification question
robot dialogue introduces the challenge of phys-          to be returned to the participant. However, some
ically situated dialogue and the necessity for sym-       of the challenging cases we analyzed in §5.1 sug-
bol and action grounding, which generally incor-          gest that an intermediate semantic representation
porate computer vision. Few systems are tackling          will be needed within the NLU phase. Specifi-
all of these challenges at this point (but see Chai       cally, because the instructions must be grounded
et al. (2017)). A description of the preliminary          within physical surroundings and with respect to
human-robot dialogue system developed under the           an executable set of robot actions, a semantic rep-
umbrella of this project, and where this research         resentation provides the structure needed to inter-
might fit into that system, is described in the next      pret novel instructions as well as ground instruc-
section.                                                  tions in novel physical contexts. Error analysis has
                                                          demonstrated that the current Scout Bot system,
7   Conclusions & Future Work                             by simply learning an association between an in-
                                                          put string and a particular set of executed actions,
Overall, we find results to be mixed on the fea-          cannot generalize to unseen, novel input instruc-
sibility of AMR for NLU within a human-robot              tions (e.g, “Turn left 100 degrees,” as opposed to
dialogue system. On one hand, AMR is attrac-              a more typical number of degrees like 90) and fails
tive given that there are a variety of relatively         to interpret instructions with respect to the cur-
robust parsers available for AMR, making im-              rent physical surroundings (e.g., the destination of
plementation on a larger scale feasible. How-             “Move to the door on the left” will be interpreted
ever, our evaluation of two parsers on the human-         differently depending where the robot is facing).
robot dialogue data demonstrates that retraining          The structure of the semantic representation pro-
on domain-relevant data is necessary, and this            vided by AMR will allow the system to interpret
will require a certain amount of manual annota-           100 degrees as a novel extent of turning, and al-
tion. Furthermore, our assessment of the distinc-         low destination slots like “door on the left” to be
tions made in AMR reveal gaps that must be ad-            grounded to a location in the current physical con-
dressed for effective use in collaborative human-         text with the help of the robot’s sensors.
robot search and navigation. Nonetheless, these
AMR refinements are tractable and may also be                Thus, in future iterations of the dialogue sys-
valuable to the broader community.                        tem incorporating AMR, we will retrain or refor-
   Thus, we have several paths forward in ongoing         mulate the NPCEditor to take the automatic AMR
and future work. First, we plan to use heuristics         parses as input and output the in-domain AMR
and manual corrections to CAMR parser output to           templates described in §5.3. The Dialogue Man-
create a larger in-domain training set following ex-      ager will act upon these templates with either a
isting AMR guidelines. We plan to combine this            response/question to the participant or pass the
training set with other AMRs from various human-          domain-specific AMR along to be mapped to the
agent dialogue data sets being annotated in parallel      behavior specification of the robot for execution.
with this work. In addition to expanding the train-       Specific steps on this research trajectory will in-
ing set for dialogue, this will allow us to explore       clude (i) development of graph to graph trans-
the extent to which our findings, with respect to         formations to map parser output to the domain-
AMR gaps, may also apply to other human-agent             refined AMRs and (ii) an assessment of how well
dialogue domains.                                         the domain-refined AMRs map to a specific robot
   Second, we will consider how to implement              planning and behavior specification, which will fa-
AMR into the existing, preliminary dialogue sys-          cilitate determining what other refinements may
tem called ‘Scout Bot,’ which has been developed          be necessary to effectively bridge from natural lan-
as part of our larger research project (Lukin et al.,     guage instructions to robot execution.

                                                    244
References                                                  Yun-Nung Chen, Dilek Hakkani-Tür, Gökhan Tür,
                                                              Jianfeng Gao, and Li Deng. 2016. End-to-end mem-
Omri Abend and Ari Rappoport. 2013. Universal Con-            ory networks with knowledge carryover for multi-
 ceptual Cognitive Annotation (UCCA). In Proceed-             turn spoken language understanding. In INTER-
 ings of the 51st Annual Meeting of the Association           SPEECH, pages 3245–3249.
 for Computational Linguistics (Volume 1: Long Pa-
 pers), volume 1, pages 228–238.                            Donald Davidson. 1969. The individuation of events.
                                                              In Essays in honor of Carl G. Hempel, pages 216–
Jens Allwood, David Traum, and Kristiina Jokinen.             234. Springer.
   2000. Cooperation, dialogue and ethics. In-
   ternational Journal of Human-Computer Studies,           Lucia Donatelli, Michael Regan, William Croft, and
   53(6):871–914.                                             Nathan Schneider. 2018. Annotation of tense and
                                                              aspect semantics for sentential AMR. In Proceed-
John Langshaw Austin. 1975. How to do things with             ings of the Joint Workshop on Linguistic Annotation,
  words, volume 88. Oxford University Press.                  Multiword Expressions and Constructions, Santa Fe,
                                                              New Mexico, USA. Association for Computational
Collin F Baker, Charles J Fillmore, and John B Lowe.          Linguistics.
  1998. The Berkeley FrameNet project. In Proc.
                                                            Charles J Fillmore, Christopher R Johnson, and
  of the 36th Annual Meeting of the Association for
                                                              Miriam RL Petruck. 2003.        Background to
  Computational Linguistics and 17th International
                                                              FrameNet. International journal of lexicography,
  Conference on Computational Linguistics-Volume 1,
                                                              16(3):235–250.
  pages 86–90. Association for Computational Lin-
  guistics.                                                 Jeffrey Flanigan, Sam Thomson, Jaime Carbonell,
                                                               Chris Dyer, and Noah A Smith. 2014. A discrim-
Laura Banarescu, Claire Bonial, Shu Cai, Madalina              inative graph-based parser for the Abstract Meaning
  Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin               Representation. In Proceedings of the 52nd Annual
  Knight, Philipp Koehn, Martha Palmer, and Nathan             Meeting of the Association for Computational Lin-
  Schneider. 2013. Abstract Meaning Representation             guistics (Volume 1: Long Papers), volume 1, pages
  for sembanking. In Proceedings of the 7th Linguis-           1426–1436.
  tic Annotation Workshop and Interoperability with
  Discourse, pages 178–186.                                 Sahil Garg, Aram Galstyan, Ulf Hermjakob, and
                                                              Daniel Marcu. 2016. Extracting biomolecular inter-
Srinivas Bangalore, Dilek Hakkani-Tür, and Gokhan            actions using semantic parsing of biomedical text.
   Tur. 2006. Introduction to the special issue on spo-       In Proc. of AAAI, Phoenix, Arizona, USA.
   ken language understanding in conversational sys-
   tems. Speech Communication, 3(48):233–238.               Barbara J Grosz and Candace L Sidner. 1986. Atten-
                                                              tion, intentions, and the structure of discourse. Com-
Claire Bonial, Stephanie Lukin, Ashley Foots, Cas-            putational linguistics, 12(3):175–204.
  sidy Henry, Matthew Marge, Kimberly Pollard, Ron
  Artstein, David Traum, and Clare R. Voss. 2018.           Dilek Hakkani-Tür, Gökhan Tür, Asli Celikyilmaz,
  Human-robot dialogue and collaboration in search            Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-
  and navigation. In Proceedings of the Annota-               Yi Wang. 2016. Multi-domain joint semantic frame
  tion, Recognition and Evaluation of Actions (AREA)          parsing using bi-directional rnn-lstm. In Inter-
  Workshop at LREC, 2018.                                     speech, pages 715–719.

Harry Bunt, Jan Alexandersson, Jae-Woong Choe,              Daniel Hershcovich, Omri Abend, and Ari Rappoport.
  Alex Chengyu Fang, Koiti Hasida, Volha Petukhova,           2017. A transition-based directed acyclic graph
  Andrei Popescu-Belis, and David R Traum. 2012.              parser for UCCA. arXiv preprint arXiv:1704.00552.
  Iso 24617-2: A semantically-based standard for di-        Thomas Howard, Stephanie Tellex, and Nicholas Roy.
  alogue annotation. In LREC, pages 430–437. Cite-            2014. A natural language planner interface for mo-
  seer.                                                       bile manipulators. In Proceedings of the IEEE In-
                                                              ternational Conference on Robotics and Automation
Joyce Y. Chai, Rui Fang, Changsong Liu, and Lanbo             (ICRA 2014).
  She. 2017. Collaborative Language Grounding To-
  ward Situated Human-Robot Dialogue. AI Maga-              Sunil Issar and Wayne Ward. 1993. Cmlps robust spo-
  zine, 37(4):32.                                             ken language understanding system. In Third Eu-
                                                              ropean Conference on Speech Communication and
Joyce Y Chai, Lanbo She, Rui Fang, Spencer Ottarson,          Technology.
  Cody Littley, Changsong Liu, and Kenneth Hanson.
  2014. Collaborative effort towards common ground          Karin Kipper, Anna Korhonen, Neville Ryant, and
  in situated human-robot dialogue. In Proceedings            Martha Palmer. 2008. A large-scale classification of
  of the 2014 ACM/IEEE international conference on            English verbs. Language Resources and Evaluation,
  Human-robot interaction, pages 33–40. ACM.                  42(1):21–40.

                                                      245
Anton Leuski and David Traum. 2011. Npceditor: Cre-        Lenhart K Schubert. 2015. Semantic representation. In
  ating virtual human dialogue using information re-         AAAI, pages 4132–4139.
  trieval techniques. Ai Magazine, 32(2):42–56.
                                                           John Rogers Searle. 1969. Speech acts: An essay in the
Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman               philosophy of language, volume 626. Cambridge
  Sadeh, and Noah A Smith. 2015. Toward abstrac-             University Press.
  tive summarization using semantic representations.
  In Proc. of NAACL.                                       Lanbo She and Joyce Chai. 2017. Interactive Learning
                                                             of Grounded Verb Semantics towards Human-Robot
Stephanie M Lukin, Felix Gervits, Cory J Hayes, An-          Communication. In Proceedings of the 55th An-
   ton Leuski, Pooja Moolchandani, John G Rogers III,        nual Meeting of the Association for Computational
   Carlos Sanchez Amaro, Matthew Marge, Clare R              Linguistics (Volume 1: Long Papers), pages 1634–
   Voss, and David Traum. 2018. Scoutbot: A dialogue         1644, Vancouver, Canada. Association for Compu-
   system for collaborative navigation. arXiv preprint       tational Linguistics.
   arXiv:1807.08074.
                                                           Mark Steedman and Jason Baldridge. 2009. Combina-
Christopher Manning, Mihai Surdeanu, John Bauer,            tory categorial grammar. nontransformational syn-
  Jenny Finkel, Steven Bethard, and David McClosky.         tax: A guide to current models. Blackwell, Oxford,
  2014. The Stanford CoreNLP natural language pro-          9:13–67.
  cessing toolkit. In Proceedings of 52nd annual
  meeting of the association for computational lin-        David Traum, Cassidy Henry, Stephanie Lukin, Ron
  guistics: system demonstrations, pages 55–60.              Artstein, Felix Gervits, Kimberly Pollard, Claire
                                                             Bonial, Su Lei, Clare Voss, Matthew Marge, Cory
Matthew Marge, Claire Bonial, Brendan Byrne, Taylor          Hayes, and Susan Hill. 2018. Dialogue Structure
 Cassidy, A. William Evans, Susan G. Hill, and Clare         Annotation for Multi-Floor Interaction. In Proceed-
 Voss. 2016a. Applying the Wizard-of-Oz Technique            ings of the Eleventh International Conference on
 to Multimodal Human-Robot Dialogue. In Proc. of             Language Resources and Evaluation (LREC 2018),
 RO-MAN.                                                     Miyazaki, Japan. European Language Resources
Matthew Marge, Claire Bonial, Kimberly A Pollard,            Association (ELRA).
 Ron Artstein, Brendan Byrne, Susan G Hill, Clare          David R Traum and Staffan Larsson. 2003. The in-
 Voss, and David Traum. 2016b. Assessing Agree-              formation state approach to dialogue management.
 ment in Human-Robot Dialogue Strategies: A Tale             In Current and new directions in discourse and dia-
 of Two Wizards. In International Conference on In-          logue, pages 325–353. Springer.
 telligent Virtual Agents, pages 484–488. Springer.
                                                           Chuan Wang, Nianwen Xue, and Sameer Pradhan.
Arindam Mitra and Chitta Baral. 2016. Addressing a
                                                             2015. A transition-based algorithm for AMR pars-
  question answering challenge by combining statisti-
                                                             ing. In Proceedings of the 2015 Conference of
  cal methods with inductive rule learning and reason-
                                                             the North American Chapter of the Association for
  ing. In Proc. of AAAI, pages 2779–2785.
                                                             Computational Linguistics: Human Language Tech-
Martha Palmer, Daniel Gildea, and Paul Kingsbury.            nologies, pages 366–375.
 2005. The proposition bank: An annotated cor-
                                                           Yanshan Wang, Sijia Liu, Majid Rastegar-Mojarad, Li-
 pus of semantic roles. Computational Linguistics,
                                                             wei Wang, Feichen Shen, Fei Liu, and Hongfang
 31(1):71–106.
                                                             Liu. 2017. Dependency and AMR embeddings for
Xiaoman Pan, Taylor Cassidy, Ulf Hermjakob, Heng Ji,         drug-drug interaction extraction from biomedical lit-
  and Kevin Knight. 2015. Unsupervised entity link-          erature. In Proc. of ACM-BCB, pages 36–43, New
  ing with Abstract Meaning Representation. In Proc.         York, NY, USA.
  of HLT-NAACL, pages 1130–1139.
Terence Parsons. 1990. Events in the Semantics of En-
  glish, volume 5. MIT Press, Cambridge, MA.
Penman Natural Language Group. 1989. The Penman
  user guide. Technical report, Information Sciences
  Institute.
Nima Pourdamghani, Kevin Knight, and Ulf Herm-
  jakob. 2016. Generating English from Abstract
  Meaning Representations. In Proc. of INLG, pages
  21–25.
Sudha Rao, Daniel Marcu, Kevin Knight, and Hal
  Daumé III. 2017. Biomedical event extraction us-
  ing Abstract Meaning Representation. In Proc. of
  BioNLP, pages 126–135, Vancouver, Canada.

                                                     246
You can also read