Interactive Task and Concept Learning from Natural Language Instructions and GUI Demonstrations

Page created by Lisa Payne
 
CONTINUE READING
Interactive Task and Concept Learning from Natural Language Instructions and GUI Demonstrations
Interactive Task and Concept Learning from Natural Language Instructions and
                                                                     GUI Demonstrations*

                                                                         Toby Jia-Jun Li1 , Marissa Radensky2 , Justin Jia1 ,
                                                                      Kirielle Singarajah1 , Tom M. Mitchell1 , Brad A. Myers1
                                                                                      1
                                                                                          Carnegie Mellon University, Pittsburgh, PA
                                                                                              2
                                                                                                Amherst College, Amherst, MA
arXiv:1909.00031v2 [cs.HC] 7 Jan 2020

                                                                    Abstract                                     low in conventional and EUD programming languages, as
                                                                                                                 they require users to think about their tasks very differ-
                                          Natural language programming is a promising approach to
                                          enable end users to instruct new tasks for intelligent agents.         ently from how they would think about them in familiar
                                          However, our formative study found that end users would                contexts (Pane, Myers, and others 2001), making program-
                                          often use unclear, ambiguous or vague concepts when nat-               ming particularly difficult for end users who are not famil-
                                          urally instructing tasks in natural language, especially when          iar with programming languages and “computational think-
                                          specifying conditionals. Existing systems have limited sup-            ing” (Wing 2006). To address this issue, the concept of nat-
                                          port for letting the user teach agents new concepts or explain-        ural programming (Myers et al. 2017; Myers, Pane, and Ko
                                          ing unclear concepts. In this paper, we describe a new multi-          2004) has been proposed to create techniques and tools that
                                          modal domain-independent approach that combines natural                match more closely the ways users think.
                                          language programming and programming-by-demonstration
                                          to allow users to first naturally describe tasks and associated           Natural language programming is a promising technique
                                          conditions at a high level, and then collaborate with the agent        for bridging the gap between user mental models of tasks
                                          to recursively resolve any ambiguities or vagueness through            and programming languages (Mihalcea, Liu, and Lieberman
                                          conversations and demonstrations. Users can also define new            2006). It should have a low learning barrier for end users, un-
                                          procedures and concepts by demonstrating and referring to              der the assumption that the majority of end users can already
                                          contents within GUIs of existing mobile apps. We demon-                communicate procedures and structures for familiar tasks
                                          strate this approach in P UMICE, an end-user programmable              through natural language conversations (Lieberman and Liu
                                          agent that implements this approach. A lab study with 10
                                          users showed its usability.
                                                                                                                 2006; Pane, Myers, and others 2001). Speech is also a natu-
                                                                                                                 ral input modality for users to describe desired program be-
                                                                                                                 haviors (Oviatt 1999b). However, existing natural language
                                                                Introduction                                     programming systems are not adequate for supporting end
                                        The goal of end user development (EUD) is to empower                     user task automation in domain-general tasks. Some prior
                                        users with little or no programming expertise to pro-                    systems (e.g., (Price et al. 2000)) directly translate user in-
                                        gram (Paterno and Wulf 2017). Among many EUD appli-                      structions in natural language into conventional program-
                                        cations, a particularly useful one would be task automation,             ming languages like Java. This approach requires users to
                                        through which users program intelligent agents to perform                use unambiguous language with fixed structures similar to
                                        tasks on their behalf (Maes 1994). To support such EUD                   those in conventional programming languages. Therefore, it
                                        activities, a major challenge is to help non-programmers to              does not match the user’s existing mental model of tasks,
                                        specify conditional structures in programs. Many common                  imposing significant learning barriers and high cognitive de-
                                        tasks involve conditional structures, yet they are difficult for         mands on end users.
                                        non-programmers to correctly specify using existing EUD                     Other natural language programming approaches
                                        techniques due to the great distance between how end users               (e.g., (Azaria, Krishnamurthy, and Mitchell 2016;
                                        think about the conditional structures, and how they are rep-            Fast et al. 2018; Kate, Wong, and Mooney 2005;
                                        resented in programming languages (Pane, Myers, and oth-                 Srivastava, Labutov, and Mitchell 2017)) restricted the
                                        ers 2001).                                                               problem space to specific task domains, so that they could
                                           According to Green and Petre’s cognitive dimensions of                constrain the space and the complexity of target program
                                        notations (Green and Petre 1996), the closer the program-                statements in order to enable the understanding of flexible
                                        ming world is to the problem world, the easier the problem-              user utterances. Such restrictions are due to the limited
                                        solving ought to be. This closeness of mapping is usually                capabilities of existing natural language understanding
                                           * Tobe presented at the AAAI-20 Workshop on Intelligent Pro-          techniques – they do not yet support robust understanding
                                        cess Automation (IPA-20). An earlier version of this paper was           of utterances across diverse domains without extensive
                                        published and presented at ACM UIST 2019.                                training data and structured prior knowledge within each
Interactive Task and Concept Learning from Natural Language Instructions and GUI Demonstrations
domain.                                                           ing desired program behaviors. Through P UMICE’s mixed-
   Another difficult problem in natural language program-         initiative conversations with users, an underlying persistent
ming is supporting the instruction of concepts. In our study      knowledge graph is dynamically updated with new pro-
(details below in the Formative Study section), we found          cedural (i.e., actions) and declarative (i.e., concepts and
that end users often refer to ambiguous or vague concepts         facts) knowledge introduced by users, allowing the semantic
(e.g., cold weather, heavy traffic) when naturally instruct-      parser to improve its understanding of user utterances over
ing a task. Moreover, even if a concept may seem clear to a       time. This structure also allows for effective reuse of user-
human, an agent may still not understand it due to the limi-      defined procedures and concepts at a fine granularity, reduc-
tations in its natural language understanding techniques and      ing user effort in EUD.
pre-defined ontology.                                                P UMICE presents a multi-modal interface, through which
   In this paper, we address the research challenge of en-        users interact with the system using a combination of
abling end users to augment domain-independent task au-           demonstrations, pointing, and spoken commands. Users may
tomation scripts with conditional structures and new con-         use any modality that they choose, so they can leverage their
cepts through a combination of natural language program-          prior experience to minimize necessary training (MacLellan
ming and programming by demonstration (PBD). To support           et al. 2018; Laird et al. 2017). This interface also provides
programming for tasks in diverse domains, we leverage the         users with guidance through a mix of visual aids and ver-
graphical user interfaces (GUIs) of existing third-party mo-      bal directions through various stages in the process to help
bile apps as a medium, where procedural actions are repre-        users overcome common challenges and pitfalls identified in
sented as sequences of GUI operations, and declarative con-       the formative study, such as the omission of else statements,
cepts can be represented through references to GUI contents.      the difficulty in finding correct GUI objects for defining new
This approach supports EUD for a wide range of tasks, pro-        concepts, and the confusion in specifying proper data de-
vided that these tasks can be performed with one or more          scriptions for target GUI objects. A summative lab usability
existing third-party mobile apps.                                 study with 10 participants showed that users with little or no
   We took a user-centered design approach, first study-          prior programming expertise could use P UMICE to program
ing how end users naturally describe tasks with condi-            automation scripts for 4 tasks derived from real-world sce-
tionals in natural language in the context of mobile apps,        narios. Participants also found P UMICE easy and natural to
and what types of tasks they are interested in automating.        use.
Based on insights from this study, we designed and im-               This paper presents the following three primary contribu-
plemented an end-user-programmable conversational agent           tions:
named P UMICE1 (Li et al. 2019) that allows end users to          1. A formative study showing the characteristics of end
program tasks with flexible conditional structures and new           users’ natural language instructions for tasks with condi-
concepts across diverse domains through spoken natural lan-          tional structures in the context of mobile apps.
guage instructions and demonstrations.
                                                                  2. A multi-modal conversational approach for the EUD of
   P UMICE extends our previous S UGILITE (Li, Azaria, and           task automation motivated by the aforementioned forma-
Myers 2017) system. A key novel aspect of P UMICE’s de-              tive study, with the following major advantages:
sign is that it allows users to first describe the desired pro-
gram behaviors and conditional structures naturally in nat-        (a) The top-down conversational structure allows users to
ural language at a high level, and then collaborate with an            naturally start with describing the task and its con-
intelligent agent through multi-turn conversations to explain          ditionals at a high-level, and then recursively clarify
and to define any ambiguities, concepts and procedures in              ambiguities, explain unknown concepts and define new
the initial description as needed in a top-down fashion. Users         procedures through conversations.
can explain new concepts by referring to either previously         (b) The agent learns new procedural and declarative
defined concepts, or to the contents of the GUIs of third-             knowledge through explicit instructions from users,
party mobile apps. Users can also define new procedures by             and stores them in a persistent knowledge graph, fa-
demonstrating using third-party apps (Li, Azaria, and Myers            cilitating effective reusability and generalizability of
2017). Such an approach facilitates effective program reuse            learned knowledge.
in automation authoring, and provides support for a wide           (c) The agent learns concepts and procedures in various
range of application domains, which are two major chal-                task domains while having a low learning barrier
lenges in prior EUD systems. The results from the motivat-             through its multi-modal approach that supports ref-
ing study suggest that this paradigm is not only feasible, but         erences and demonstrations using the contents of third-
also natural for end users, which was supported by our sum-            party apps’ GUIs.
mative lab usability study.
                                                                  3. The P UMICE system: an implementation of this approach,
   We build upon recent advances in natural language
                                                                     along with a user study evaluating its usability.
processing (NLP) to allow P UMICE’s semantic parser to
learn from users’ flexible verbal expressions when describ-
                                                                            Background and Related Work
   1
     P UMICE is a type of volcanic rock. It is also an acronym    This research builds upon prior work from many different
for Programming in a User-friendly Multimodal Interface through   sub-disciplines across human-computer interaction (HCI),
Conversations and Examples                                        software engineering (SE), and natural language processing
Interactive Task and Concept Learning from Natural Language Instructions and GUI Demonstrations
(NLP). In this section, we focus on related work on three        cepts, procedures and functionalities already exist to be used
topics: (1) natural language programming; (2) programming        as “building blocks” for developing new ones.
by demonstration; and (3) the multi-modal approach that
combines natural language inputs with demonstrations.            Programming by Demonstration
                                                                 P UMICE uses the programming by demonstration (PBD)
Natural Language Programming                                     technique to enable end users to define concepts by referring
P UMICE uses natural language as the primary modality            to the contents of GUIs of third-party mobile apps, and teach
for users to program task automation. The idea of us-            new procedures through demonstrations with those apps.
ing natural language inputs for programming has been ex-         PBD is a natural way of supporting EUD with a low learning
plored for decades (Ballard and Biermann 1979; Biermann          barrier (Cypher and Halbert 1993; Lieberman 2001). Many
1983). In the NLP and AI communities, this idea is also          domain-specific PBD tools have been developed in the past
known as learning by instruction (Azaria, Krishnamurthy,         in various domains, such as text editing (e.g., (Lau et al.
and Mitchell 2016; Lieberman and Maulsby 1996).                  2001)), photo editing (e.g., (Grabler et al. 2009)), web scrap-
   The foremost challenge in supporting natural language         ing (e.g., (Chasins, Mueller, and Bodik 2018)), smart home
programming is dealing with the inherent ambiguities             control (e.g., (Li et al. 2017)) and robot control (e.g., (Argall
and vagueness in natural language (Vadas and Curran              et al. 2009)).
2005). To address this challenge, one prior approach                P UMICE supports domain-independent PBD by using the
was to constrain the structures and expressions in the           GUIs of third-party apps for task automation and data ex-
user’s language to similar formulations of conventional          traction. Similar approaches have also been used in prior
programming languages (e.g., (Ballard and Biermann 1979;         systems. For example, S UGILITE (Li, Azaria, and Myers
Price et al. 2000)), so that user inputs can be directly         2017), K ITE (Li and Riva 2018) and A PPINITE (Li et al.
translated into programming statements. This approach is         2018a) use mobile app GUIs, CoScripter (Leshed et al.
not adequate for EUD, as it has a high learning barrier for      2008), d.mix (Hartmann et al. 2007), Vegemite (Lin et al.
users without programming expertise.                             2009) and P LOW (Allen et al. 2007) use web interfaces,
   Another approach for handling ambiguities and vague-          and H ILC (Intharah, Turmukhambetov, and Brostow 2019)
ness in natural language inputs is to seek user clarifica-       and Sikuli (Yeh, Chang, and Miller 2009) use desktop GUIs.
tion through conversations. For example, Iris (Fast et al.       Compared to those, P UMICE is the only one that can learn
2018) asks follow-up questions and presents possible op-         concepts as generalized knowledge, and the only one that
tions through conversations when the initial user input is in-   supports creating conditionals from natural language in-
complete or unclear. This approach lowers the learning bar-      structions. Sikuli (Yeh, Chang, and Miller 2009) allows
rier for end users, as it does not require them to clearly de-   users to create conditionals in a scripting language, which
fine everything up front. It also allows users to form complex   is not suitable for end users without programming expertise.
commands by combining multiple natural language instruc-
tions in conversational turns under the guidance of the sys-     The Multi-Modal Approach
tem. P UMICE also adopts the use of multi-turn conversations     A central challenge in PBD is generalization. A PBD agent
as a key strategy in handling ambiguities and vagueness in       should go beyond literal record-and-replay macros, and be
user inputs. However, a key difference between P UMICE and       able to perform similar tasks in new contexts (Cypher and
other conversational instructable agents is that P UMICE is      Halbert 1993; Lieberman 2001). This challenge is also a
domain-independent. All conversational instructable agents       part of the program synthesis problem. An effective way
need to map the user’s inputs onto existing concepts, pro-       of addressing this challenge is through multi-modal inter-
cedures and system functionalities supported by the agent,       action (Oviatt 1999b). Demonstrations can clearly commu-
and to have natural language understanding mechanisms            nicate what the user does, but not why the user does this and
and training data in each task domain. Because of this con-      how the user wants to do this in different contexts. On the
straint, existing agents limit their supported tasks to one      other hand, natural language instructions can often reflect
or a few pre-defined domains, such as data science (Fast         the user’s underlying intent (why) and preferences (how),
et al. 2018), email processing (Azaria, Krishnamurthy, and       but they are usually ambiguous or unclear. This is where
Mitchell 2016; Srivastava, Labutov, and Mitchell 2017), or       grounding natural language instructions with concrete GUI
database queries (Kate, Wong, and Mooney 2005).                  demonstrations can help.
   P UMICE supports learning concepts and procedures from           This mutual disambiguation approach (Oviatt 1999a) in
existing third-party mobile apps regardless of the task do-      multi-modal interaction has been proposed and used in many
mains. End users can create new concepts with P UMICE by         previous systems. This approach leverages repetition in a
referencing relevant information shown in app GUIs, and de-      different modality for mediation (Mankoff, Abowd, and
fine new procedures by demonstrating with existing apps.         Hudson 2000). Particularly for PBD generalization, S UG -
This approach allows P UMICE to support a wide range of          ILITE (Li, Azaria, and Myers 2017) and P LOW (Allen et al.
tasks from diverse domains as long as the corresponding mo-      2007) use natural language inputs to identify parameteriza-
bile apps are available. This approach also has a low learning   tion in demonstrations, A PPINITE (Li et al. 2018a) uses nat-
barrier because end users are already familiar with the capa-    ural language explanations of intents to resolve the “data de-
bilities of mobile apps and how to use them. In comparison,      scription” (Cypher and Halbert 1993) for demonstrated ac-
with prior instructable agents, it is often unclear what con-    tions, M ARS (Chen, Martins, and Feng 2019) similarly used
Interactive Task and Concept Learning from Natural Language Instructions and GUI Demonstrations
multi-layer specifications including the textual descriptions     of efforts in the responses, question misunderstandings, and
of user intents to disambiguate program synthesis results         blank responses.
from user-provided input-output examples in the data sci-            We report the most relevant findings which motivated the
ence domain.                                                      design of P UMICE next.
   P UMICE builds upon this prior work, and extends the
multi-modal approach to support learning concepts involved        App GUI Grounding Reduces Unclear Concept
in demonstrated tasks. The learned concepts can also be gen-      Usage
eralized to new task domains, as described in later sections.     We analyzed whether each user’s verbal instruction for the
The prior multi-modal PBD systems also use demonstration          task provided a clear definition of the conditions in the task.
as the main modality. In comparison, P UMICE uses natu-           In the control group (instructing without seeing app screen-
ral language conversation as the main modality, and uses          shots), 33% of the participants used ambiguous, unclear or
demonstration for grounding unknown concepts, values, and         vague concepts in the instructions, such as “If it is daytime,
procedures after they have been broken down and explained         play upbeat music...” which is ambiguous as to when the
in conversations.                                                 user considers it to be “daytime.” This is despite the fact that
                                                                  the example instructions they saw had clearly defined con-
                    Formative Study                               ditions.
                                                                     Interestingly, for the experimental group, where each par-
We took a user-centered approach (Myers et al. 2016) for
                                                                  ticipant was provided an app screenshot displaying specific
designing a natural end-user development system (Myers et
                                                                  information relevant to the task’s condition, fewer partici-
al. 2017; Radensky, Li, and Myers 2018). We first studied
                                                                  pants (9%) used ambiguous or vague concepts (this differ-
how end users naturally communicate tasks with declarative
                                                                  ence is statistically significant with p ¡ 0.05), while the rest
concepts and control structures in natural language for
                                                                  clearly defined the condition (e.g., “If the current time is be-
various tasks in the mobile app context through a formative
                                                                  fore 7 pm...”). These results suggest that end users naturally
study on Amazon Mechanical Turk with 58 participants
                                                                  use ambiguous and vague concepts when verbally instruct-
(41 of which are non-programmers; 38 men, 19 women, 1
                                                                  ing task logic, but showing users relevant mobile app GUIs
non-binary person).
                                                                  with concrete instances of the values can help them ground
   Each participant was presented with a graphical descrip-       the concepts, leading to fewer ambiguities and vagueness in
tion of an everyday task for a conversational agent to com-       their descriptions. The implication is that a potentially effec-
plete in the context of mobile apps. All tasks had distinct       tive approach to avoiding unclear utterances for agents is to
conditions for a given task so that each task should be per-      guide users to explain them in the context of app GUIs.
formed differently under different conditions, such as play-
ing different genres of music based on the time of the day.       Unmet User Expectation of Common Sense
Each participant was assigned to one of 9 tasks. To avoid
                                                                  Reasoning
biasing the language used in the responses, we used the Nat-
ural Programming Elicitation method (Myers et al. 2016) by        We observed that participants often expected and assumed
showing graphical representations of the tasks with limited       the agent to have the capability of understanding and reason-
text in the prompts. Participants were asked how they would       ing with common sense knowledge when instructing tasks.
verbally instruct the agent to perform the tasks, so that the     For example, one user said, “if the day is a weekend”. The
system could understand the differences among the condi-          agent would therefore need to understand the concept of
tions and what to do in each condition. Each participant was      “weekend” (i.e., how to know today’s day of the week, and
first trained using an example scenario and the correspond-       what days count as “weekend”) to resolve this condition.
ing example verbal instructions.                                  Similarly when a user talked about “sunset time”, he ex-
   To study whether having mobile app GUIs can affect             pected the agent to know what it meant, and how to find
users’ verbal instructions, we randomly assigned partici-         out its value.
pants into one of two groups. For the experimental group,            However, the capability for common sense knowledge
participants instructed agents to perform the tasks while         and reasoning is very limited in current agents, especially
looking at relevant app GUIs. Each participant was pre-           across many diverse domains, due to the spotty coverage and
sented with a mobile app screenshot with arrows pointing          unreliable inference of existing common sense knowledge
to the screen component that contained the information per-       systems. Managing user expectation and communicating the
tinent to the task condition. Participants in the control group   agent’s capability is also a long-standing unsolved challenge
were not shown app GUIs. At the end of each study session,        in building interactive intelligent systems (Lieberman et al.
we also asked the participants to come up with another task       2004). A feasible workaround is to enable the agent to ask
scenario of their own where an agent should perform differ-       users to ground new concepts to existing contents in apps
ently in different conditions.                                    when they come up, and to build up knowledge of concepts
   The participants’ responses were analyzed by two in-           over time through its interaction with users.
dependent coders using open coding (Strauss and Corbin
1990). The inter-rater agreement (Cohen 1960) was κ =             Frequent Omission of Else Statements
0.87, suggesting good agreement. 19% of responses were            In the study, despite all provided example responses con-
excluded from the analysis for quality control due to the lack    taining else statements, 18% of the 39 descriptions from
users omitted an else statement when it was expected. “Play       overlay to highlight any GUI objects on the screen that fit
upbeat music until 8pm every day,” for instance, may im-          into the comparison (i.e., those that display a value compa-
ply that the user desires an alternative genre of music to        rable to 85 degrees Fahrenheit). The user can choose from
be played at other times. Furthermore, 33% omitted an else        these highlighted objects (see Figure 2 for an example).
statement when a person would be expected to infer an else        Through this demonstration, P UMICE learns a reusable pro-
statement, such as: “If a public transportation access point      cedure query Temperature() for getting the current
is more than half a mile away, then request an Uber,” which       value for the new concept temperature, and stores it in a per-
implies using public transportation otherwise. This might be      sistent knowledge graph so that it can be used in other tasks.
a result of the user’s expectation of common sense reasoning      P UMICE confirms with the user every time it learns a new
capabilities. The user omits what they expect the agent can       concept or a new rule, so that the user is aware of the current
infer to avoid prolixity, similar to patterns in human-human      state of the system, and can correct any errors (see the Error
conversations (Grice et al. 1975).                                Recovery and Backtracking section for details).
   These findings suggest that end users will often omit ap-          For the next phase, P UMICE has already determined that
propriate else statements in their natural language instruc-      “order a cup of Iced Cappuccino” should be an action trig-
tions for conditionals. Therefore, the agent should proac-        gered when the condition “it’s hot” is true, but does not
tively ask users about alternative situations in conditionals     know how to perform this action (also known as intent ful-
when appropriate.                                                 fillment in chatbots (Li and Riva 2018)). To learn how to
                                                                  perform this action, it asks, “How do I order a cup of Iced
                         PUMICE                                   Cappuccino?”, to which the user responds, “I can demon-
                                                                  strate.” The user then proceeds to demonstrate the proce-
Motivated by the formative study results, we designed the         dure of ordering a cup of Iced Cappuccino using the existing
P UMICE agent that supports understanding ambiguous nat-          app for Starbucks (a coffee chain). From the user demon-
ural language instructions for task automation by allowing        stration, P UMICE can figure out that “Iced Cappuccino” is
users to recursively define any new, ambiguous or vague           a task parameter, and can learn the generalized procedure
concepts in a multi-level top-down process.                       order Starbucks() for ordering any item in the Star-
                                                                  bucks app, as well as a list of available items to order in
Example Scenario                                                  the Starbucks app by looking through the Starbucks app’s
This section shows an example scenario to illustrate how          menus, using the underlying S UGILITE framework [9] for
P UMICE works. Suppose a user starts teaching the agent a         processing the task recording.
new task automation rule by saying, “If it’s hot, order a cup         Finally, P UMICE asks the user about the else condition
of Iced Cappuccino.” We also assume that the agent has no         by saying, “What should I do if it’s not hot?” Suppose the
prior knowledge about the relevant task domains (weather          user says “Order a cup of Hot Latte,” then the user will
and coffee ordering). Due to the lack of domain knowledge,        not need to demonstrate again because P UMICE can rec-
the agent does not understand “it’s hot” and “order a cup         ognize “hot latte” as an available parameter for the known
of Iced Cappuccino”. However, the agent can recognize the         order Starbucks() procedure.
conditional structure in the utterance (the parse for Utter-
ance 0 in Figure 1) and can identify that “it’s hot” should       Design Features
represent a Boolean expression while “order a cup of Iced         In this section, we discuss several of P UMICE’s key design
Cappuccino” represents the action to perform if the condi-        features in its user interactions, and how they were motivated
tion is true.                                                     by results of the formative study.
   P UMICE’s semantic parser can mark unknown parts in
user utterances using typed resolve...() functions, as            Support for Concept Learning In the formative study,
marked in the yellow highlights in the parse for Utterance        we identified two main challenges in regards to concept
0 in Figure 1. The P UMICE agent then proceeds to ask the         learning. First, user often naturally use intrinsically unclear
user to further explain these concepts. It asks, “How do I tell   or ambiguous concepts when instructing intelligent agents
whether it’s hot?” since it has already figured out that “it’s    (e.g., “register for easy courses”, where the definition of
hot” should be a function that returns a Boolean value. The       “easy” depends on the context and the user preference).
user answers “It is hot when the temperature is above 85          Second, users expect agents to understand common-sense
degrees Fahrenheit.”, as shown in Utterance 2 in Figure 1.        concepts that the agents may not know. To address these
P UMICE understands the comparison (as shown in the parse         challenges, we designed and implemented the support for
for Utterance 2 in Figure 1), but does not know the concept       concept learning in P UMICE. P UMICE can detect and learn
of “temperature”, only knowing that it should be a numeric        three kinds of unknown components in user utterances: pro-
value comparable to 85 degrees Fahrenheit. Hence it asks,         cedures, Boolean concepts, and value concepts. Because
“How do I find out the value for temperature?”, to which the      P UMICE’s support for procedure learning is unchanged from
user responds, “Let me demonstrate for you.”                      the underlying S UGILITE mechanisms (Li et al. 2018a;
   Here the user can demonstrate the procedure of find-           Li, Azaria, and Myers 2017), in this section, we focus on
ing the current temperature by opening the weather app            discussing how P UMICE learns Boolean concepts and value
on the phone, and pointing at the current reading of the          concepts.
weather. To assist the user, P UMICE uses a visualization            When encountering an unknown or unclear concept in the
Figure 1: Example structure of how P UMICE learns the concepts and procedures in the command “If it’s hot, order a cup
of Iced Cappuccino.” The numbers indicate the order of utterances. The screenshot on the right shows the conversational
interface of P UMICE. In this interactive parsing process, the agent learns how to query the current temperature, how to order
any kind of drink from Starbucks, and the generalized concept of “hot” as “a temperature (of something) is greater than another
temperature”.

utterance parsing result, P UMICE first determines the con-         popular ridesharing apps).
cept type based on the context. If the concept is used as a            Similar to Boolean concepts, the user can refer to another
condition (e.g., “if it is hot”), then it should be of Boolean      value concept when explaining a value concept. When a
type. Similarly, if a concept is used where a value is ex-          value concept is concrete and available in a mobile app, the
pected (e.g., “if the current temperature is above 70°F”            user can also demonstrate how to query the value through
or “set the AC to the current temperature”), then it will           app GUIs. The formative study has suggested that this multi-
be marked as a value concept. Both kinds of concepts are            modal approach is effective and feasible for end users. Af-
represented as typed resolve() functions in the parsing             ter users indicate that they want to demonstrate, P UMICE
result (shown in Figure 1), indicating that they need to be         switches to the home screen of the phone, and prompts the
further resolved down the line. This process is flexible. For       user to demonstrate how to find out the value of the concept.
example, if the user clearly defines a condition without in-           To help the user with value concept demonstrations,
troducing unknown or unclear concepts, then P UMICE will            P UMICE highlights possible items on the current app GUI
not need to ask follow-up questions for concept resolution.         if the type of the target concept can be inferred from the
   P UMICE recursively executes each resolve() function             type of the constant value, or using the type of value concept
in the parsing result in a depth-first fashion. After a con-        to which it is being compared (see Figure 2). For example,
cept is fully resolved (i.e., all concepts used for defining it     in the aforementioned “commute time” example, P UMICE
have been resolved), it is added to a persistent knowledge          knows that “commute time” should be a duration, because
graph (details in the System Implementation section), and a         it is comparable to the constant value “30 minutes”. Once
link to it replaces the resolve() function. From the user’s         the user finds the target value in an app, they can long press
perspective, when a resolve() function is executed, the             on the target value to select it and indicate it as the target
agent asks a question to prompt the user to further explain         value. P UMICE uses an interaction proxy overlay (Zhang et
the concept. When resolving a Boolean concept, P UMICE              al. 2017) for recording, so that it can record all values vis-
asks, “How do I know whether [concept name]?” For re-               ible on the screen, not limited to the selectable or clickable
solving a value concept, P UMICE asks, “How do I find out           ones. P UMICE can extract these values from the GUI using
the value of [concept name]?”                                       the screen scraping mechanism in the underlying S UGILITE
                                                                    framework (Li, Azaria, and Myers 2017). Once the target
   To explain a new Boolean concept, the user may verbally          value is selected, P UMICE stores the procedure of navigating
refer to another Boolean concept (e.g., “traffic is heavy”          to the screen where the target value is displayed and finding
means “commute takes a long time”) or may describe a                the target value on the screen into its persistent knowledge
Boolean expression (e.g., “the commute time is longer than          graph as a value query, so that this query can be used when-
30 minutes”). When describing the Boolean expression, the           ever the underlying value is needed. After the value concept
user can use flexible words (e.g., colder, further, more ex-        demonstration, P UMICE switches back to the conversational
pensive) to describe the relation (i.e., greater than, less than,   interface and continues to resolve the next concept if needed.
and equal to). As explained previously, if any new Boolean
or value concepts are used in the explanation, P UMICE will         Concept Generalization and Reuse Once concepts are
recursively resolve them. The user can also use more than           learned, another major challenge is to generalize them so
one unknown value concepts, such as “if the price of a Uber         that they can be reused correctly in different contexts and
is greater than the price of a Lyft” (Uber and Lyft are both        task domains. This is a key design goal of P UMICE. It should
concept named “temperature”, so it will ask “I already know
                                                                how to find out the value for temperature using the Weather
                                                                app. Should I use that for determining whether the oven is
                                                                hot?”, to which the user can say “No” and then demonstrate
                                                                querying the temperature of the oven using the correspond-
                                                                ing app (assuming the user has a smart oven with an in-app
                                                                display of its temperature).
                                                                   This mechanism allows concepts like “hot” to be reused at
                                                                three different levels: (1) exactly the same (e.g., the tempera-
                                                                ture of the weather is greater than 85°F); (2) different thresh-
                                                                old (e.g., the temperature of the weather is greater than x);
                                                                and (3) different value query (e.g., the temperature of some-
                                                                thing else is greater than x).
                                                                Error Recovery and Backtracking Like all other inter-
                                                                active EUD systems, it is crucial for P UMICE to properly
                                                                handle errors, and to backtrack from errors in speech recog-
                                                                nition, semantic parsing, generalizations, and inferences of
                                                                intent (Li et al. 2018b). We iteratively tested early proto-
                                                                types of P UMICE with users through early usability testing,
Figure 2: The user teaches the value concept “commute           and developed the following mechanisms to support error
time” by demonstrating querying the value in Google Maps.       recovery and backtracking in P UMICE.
The red overlays highlight all durations it was able to iden-      To mitigate semantic parsing errors, we implemented a
tify on the Google Maps GUI.                                    mixed-initiative mechanism where P UMICE can ask users
                                                                about components within the parsed expression if the pars-
be able to learn concepts at a fine granularity, and reuse      ing result is considered incorrect by the user. Because pars-
parts of existing concepts as much as possible to avoid ask-    ing result candidates are all typed expressions in P UMICE’s
ing users to make redundant demonstrations. In our previous     internal functional domain-specific language (DSL) as a
works on generalization for PBD, we focused on generaliz-       conditional, Boolean, value, or procedure, P UMICE can
ing procedures, specifically learning parameters (Li, Azaria,   identify components in a parsing result that it is less con-
and Myers 2017) and intents for underlying operations (Li       fident about by comparing the top candidate with the alter-
et al. 2018a). We have already deployed these existing gen-     natives and confidence scores, and ask the user about them.
eralization mechanisms in P UMICE, but in addition, we also        For example, suppose the user defines a Boolean concept
explored the generalization of Boolean concepts and value       “good restaurant” with the utterance “the rating is better than
concepts.                                                       2”. The parser is uncertain about the comparison operator
   When generalizing Boolean concepts, P UMICE assumes          in this Boolean expression, since “better” can mean either
that the Boolean operation stays the same, but the arguments    “greater than” or “less than” depending on the context. It
may differ. For example, for the concept “hot” in Figure 1,     will ask the user “I understand you are trying to compare
it should still mean that a temperature (of something) is       the value concept ‘rating’ and the value ‘2’, should ‘rating’
greater than another temperature. But the two in comparison     be greater than, or less than ‘2’?” The same technique can
can be different constants, or from different value queries.    also be used to disambiguate other parts of the parsing re-
For example, suppose after the interactions in Figure 1, the    sults, such as the argument of resolve() functions (e.g.,
user instructs a new rule “if the oven is hot, start the cook   determining whether the unknown procedure should be “or-
timer.” P UMICE can recognize that “hot” is a concept that      der a cup of Iced Cappuccino” or “order a cup” for Utterance
has been instructed before in a different context, so it asks   0 in Figure 1).
“I already know how to tell whether it is hot when determin-       P UMICE also provides an “undo” function to allow the
ing whether to order a cup of Iced Cappuccino. Is it the same   user to backtrack to a previous conversational state in case
here when determining whether to start the cook timer?” Af-     of incorrect speech recognition, incorrect generalization, or
ter responding “No”, the user can instruct how to find out      when the user wants to modify a previous input. Users can
the temperature of the oven, and the new threshold value for    either say that they want to go back to the previous state,
“hot” either by instructing a new value concept, or using a     or click on an “undo” option in P UMICE’s menu (activated
constant value.                                                 from the option icon on the top right corner on the screen
   The generalization mechanism for value concepts works        shown in Figure 1).
similarly. P UMICE supports value concepts that share the
same name to have different query implementations for dif-      System Implementation
ferent task contexts. For example, following the “if the oven   We implemented the P UMICE agent as an Android app. The
is hot, start the cook timer” example, suppose the user de-     app was developed and tested on a Google Pixel 2 XL phone
fines “hot” for this new context as “The temperature is above   running Android 8.0. P UMICE does not require the root ac-
400 degrees.” P UMICE realizes that there is already a value    cess to the phone, and should run on any phone running An-
droid 6.0 or higher. P UMICE is open-sourced on GitHub2 .
Semantic Parsing We built the semantic parser for
P UMICE using the S EMPRE framework (Berant et al. 2013).
The parser runs on a remote Linux server, and communi-
cates with the P UMICE client through an HTTP RESTful
interface. It uses the Floating Parser architecture, which is
a grammar-based approach that provides more flexibility
without requiring hand-engineering of lexicalized rules like     Figure 3: An example showing how P UMICE parses the
synchronous CFG or CCG based semantic parsers (Pasupat           user’s demonstrated action and verbal reference to an app’s
and Liang 2015). This approach also provides more inter-         GUI content into a SET VALUE statement with a query over
pretable results and requires less training data than neural     the UI snapshot graph when resolving a new value concept
network approaches (e.g., (Yin et al. 2018; Yin and Neubig       “current temperature”
2017)). The parser parses user utterances into expressions in
a simple functional DSL we created for P UMICE.
                                                                 for demonstration, and a persistent knowledge base for stor-
   A key feature we added to P UMICE’s parser is allowing
                                                                 ing learned procedures and concepts.
typed resolve() functions in the parsing results to
indicate unknown or unclear concepts and procedures.                The purpose of the UI snapshot graph is to support under-
This feature adds interactivity to the traditional semantic      standing the user’s references to app GUI contents in their
parsing process. When this resolve() function is called          verbal instructions. The UI snapshot graph mechanism used
at runtime, the front-end P UMICE agent asks the user            in P UMICE was extended from A PPINITE (Li et al. 2018a).
to verbally explain, or to demonstrate how to fulfill this       For every state of an app’s GUI, a UI snapshot graph is con-
resolve() function. If an unknown concept or procedure           structed to represent all visible and invisible GUI objects
is resolved through verbal explanation, the parser can parse     on the screen, including their types, positions, accessibility
the new explanation into an expression of its original type in   labels, text labels, various properties, and spatial relations
the target DSL (e.g., an explanation for a Boolean concept is    among them. We used a lightweight semantic parser from
parsed into a Boolean expression), and replace the original      the Stanford CoreNLP (Manning et al. 2014) to extract types
resolve() function with the new expression. The parser           of structured data (e.g., temperature, time, date, phone num-
also adds relevant utterances for existing concepts and          ber) and named entities (e.g., city names, people’s names).
procedures, and visible text labels from demonstrations on       When handling the user’s references to app GUI contents,
third-party app GUIs to its set of lexicons, so that it can      P UMICE parses the original utterances into queries over the
understand user references to those existing knowledge           current UI snapshot graph (example in Figure 3). This ap-
and in-app contents. P UMICE’s parser was trained on rich        proach allows P UMICE to generate flexible queries for value
features that associate lexical and syntactic patterns (e.g.,    concepts and procedures that accurately reflect user intents,
unigrams, bigrams, skipgrams, part-of-speech tags, named         and which can be reliably executed in future contexts.
entity tags) of user utterances with semantics and structures       The persistent knowledge base stores all procedures, con-
of the target DSL over a small number of training data           cepts, and facts P UMICE has learned from the user. Proce-
(n = 905) that were mostly collected and enriched from the       dures are stored as S UGILITE (Li, Azaria, and Myers 2017)
formative study.                                                 scripts, with the corresponding trigger utterances, param-
                                                                 eters, and possible alternatives for each parameter. Each
Demonstration Recording and Replaying P UMICE uses               Boolean concept is represented as a set of trigger utterances,
our open-sourced S UGILITE (Li, Azaria, and Myers 2017)          Boolean expressions with references to the value concepts or
framework to support its demonstration recording and re-         constants involved, and contexts (i.e., the apps and actions
playing. S UGILITE provides action recording and replaying       used) for each Boolean expression. Similarly, the structure
capabilities on third-party Android apps using the Android       for each stored value concept includes its triggering utter-
Accessibility API. S UGILITE also provides the support for       ances, demonstrated value queries for extracting target val-
parameterization of sequences of actions (e.g., identifying      ues from app GUIs, and contexts for each value query.
“Iced Cappuccino” as a parameter and “Hot Latte” as an al-
ternative value in the example in Figure 1), and the support
for handling minor GUI changes in apps. Through S UG -                                  User Study
ILITE , P UMICE operates well on most native Android apps,       We conducted a lab study to evaluate the usability of
but may have problems working with web apps and apps             P UMICE. In each session, a user completed 4 tasks. For each
with special graphic engines (e.g., games). It currently does    task, the user instructed P UMICE to create a new task au-
not support recording gestural and sensory inputs (e.g., ro-     tomation, with the required conditionals and new concepts.
tating the phone) either.                                        We used a task-based method to specifically test the usabil-
Knowledge Representations P UMICE maintains two                  ity of P UMICE’s design, since the motivation for the design
kinds of knowledge representations: a continuously refresh-      derives from the formative study results. We did not use a
ing UI snapshot graph representing third-party app contexts      control condition, as we could not find other tools that can
                                                                 feasibly support users with little programming expertise to
   2
       https://github.com/tobyli/Sugilite development            complete the target tasks.
Participants
We recruited 10 participants (5 women, 5 men, ages 19 to
35) for our study. Each study session lasted 40 to 60 min-
utes. We compensated each participant $15 for their time. 6
participants were students in two local universities, and the
other 4 worked different technical, administrative or man-
agerial jobs. All participants were experienced smartphone
users who had been using smartphones for at least 3 years. 8
out of 10 participants had some prior experience of interact-
ing with conversational agents like Siri, Alexa and Google
Assistant.                                                        Figure 4: The graphical prompt used for Task 1 – A possible
   We asked the participants to report their programming ex-      user command can be “Order Iced coffee when it’s hot out-
perience on a five-point scale from “never programmed” to         side, otherwise order hot coffee when the weather is cold.”
“experienced programmer”. Among our participants, there
were 1 who had never programmed, 5 who had only used              demonstration, concept generalization, procedure general-
end-user programming tools (e.g., Excel functions, Office         ization and “else” condition handling); (2) involved only
macros), 1 novice programmer with experience equivalent to        app interfaces that most users are familiar with; and (3) used
1-2 college level programming classes, 1 programmer with          conditions that we can control so we can test the correctness
1-2 years of experience, and 2 programmers with more than         of the scripts (we controlled the temperature, the traffic
3 years of experience. In our analysis, we will label the first   condition, and the room price by manipulating the GPS
two groups “non-programmers” and the last three groups            location of the phone).
“programmers”.                                                       In order to minimize biasing users’ utterances, we used
                                                                  the Natural Programming Elicitation method (Myers et al.
Procedure                                                         2016). Task descriptions were provided in the form of graph-
                                                                  ics, with minimal text descriptions that could not be directly
At the beginning of each session, the participant received a
                                                                  used in user instructions (see Figure 4 for an example).
5-minute tutorial on how to use P UMICE. In the tutorial, the
experimenter demonstrated an example of teaching P UMICE
to check the bus schedule when “it is late”, and “late” was       Task 1 In this task, the user instructs P UMICE to order iced
defined as “current time is after 8pm” through a conversa-        coffee when the weather is hot, and order hot coffee other-
tion with P UMICE. The experimenter then showed how to            wise (Figure 4). We pre-taught P UMICE the concept of “hot”
demonstrate to P UMICE finding out the current time using         in the task domain of turning on the air conditioner. So the
the Clock app.                                                    user needs to utilize the concept generalization feature to
   Following the tutorial, the participant was provided a         generalize the existing concept “hot” to the new domain of
Google Pixel 2 phone with P UMICE and relevant third-party        coffee ordering. The user also needs to demonstrate order-
apps installed. The experimenter showed the participant the       ing iced coffee using the Starbucks app, and to provide “or-
available apps, and made sure that the participant understood     der hot coffee” as the alternative for the “else” operation.
the functionality of each third-party app. We did this because    The user does not need to demonstrate again for ordering
the underlying assumption of the study (and the design of         hot coffee, as it can be automatically generalized from the
P UMICE) is that users are familiar with the third-party apps,    previous demonstration of ordering iced coffee.
so we are testing whether they can successfully use P UMICE,
not the apps. Then, the participant received 4 tasks in ran-      Task 2 In this task, the user instructs P UMICE to set an
dom order. We asked participants to keep trying until they        alarm for 7:00am if the traffic is heavy on their commut-
were able to correctly execute the automation, and that they      ing route. We pre-stored “home” and “work” locations in
were happy with the resulting actions of the agent. We also       Google Maps. The user needs to define “heavy traffic” as
checked the scripts at the end of each study session to eval-     prompted by P UMICE by demonstrating how to find out the
uate their correctness.                                           estimated commute time, and explaining that “heavy traffic”
   After completing the tasks, the participant filled out a       means that the commute takes more than 30 minutes. The
post-survey to report the perceived usefulness, ease of use       user then needs to demonstrate setting a 7:00am alarm using
and naturalness of interactions with P UMICE. We ended            the built-in Clock app.
each session with a short informal interview with the par-
ticipant on their experiences with P UMICE.                       Task 3 In this task, the user instructs P UMICE to choose
                                                                  between making a hotel reservation and requesting a Uber
Tasks                                                             to go home depending on whether the hotel price is cheap.
We assigned 4 tasks to each participant. These tasks were         The user should verbally define ”cheap” as ”room price is
designed by combining common themes observed in users’            below $100”, and demonstrate how to find out the hotel price
proposed scenarios from the formative study. We ensured           using the Marriott (a hotel chain) app. The user also needs to
that these tasks (1) covered key P UMICE features (i.e.,          demonstrate making the hotel reservation using the Marriott
concept learning, value query demonstration, procedure            app, specify ”request an Uber” as the action for the “else”
the tutorial that they should try to use concepts that can be
                                                                      found in mobile apps in their explanations of new concepts.
                                                                      These factors might contributed to the successes of our
                                                                      participants.
                                                                         In a post survey, we asked participants to rate state-
                                                                      ments about the usability, naturalness and usefulness of
                                                                      P UMICE on a 7-point Likert scale from “strongly disagree”
                                                                      to “strongly agree”. P UMICE scored on average 6.2 on “I feel
                                                                      P UMICE is easy to use”, 6.1 on “I find my interactions with
                                                                      P UMICE natural”, and 6.9 on “I think P UMICE is a useful
                                                                      tool for automating tasks on smartphones,” indicating that
                                                                      our participants were generally satisfied with their experi-
                                                                      ence using P UMICE.
Figure 5: The average task completion times for each task.
The error bars show one standard deviation in each direction.         Discussion
condition, and demonstrate how to request an Uber using               In the informal interview after completing the tasks, par-
the Uber app.                                                         ticipants praised P UMICE for its naturalness and low learn-
                                                                      ing barriers. Non-programmers were particularly impressed
Task 4 In this task, the user instructs P UMICE to order a            by the multi-modal interface. For example, P7 (who was a
pepperoni pizza if there is enough money left in the food             non-programmer) said: “Teaching P UMICE feels similar to
budget. The user needs to define the concept of “enough               explaining tasks to another person...[Pumice’s] support for
budget”, demonstrate finding out the balance of the budget            demonstration is very easy to use since I’m already famil-
using the Spending Tracker app, and demonstrate ordering a            iar with how to use those apps.” Participants also considered
pepperoni pizza using the Papa Johns (a pizza chain) app.             P UMICE’s top-down interactive concept resolution approach
                                                                      very useful, as it does not require them to define everything
Results                                                               clearly upfront.
                                                                         Participants were excited about the usefulness of
All participants were able to complete all 4 tasks. The to-           P UMICE. P6 said, “I have an Alexa assistant at home, but
tal time for tasks ranged from 19.4 minutes to 25 min-                I only use them for tasks like playing music and setting
utes for the 10 participants. Figure 5 shows the overall av-          alarms. I tried some more complicated commands before,
erage task completion time of each task, as well as the               but they all failed. If it had the capability of P UMICE, I
comparison between the non-programmers and the pro-                   would definitely use it to teach Alexa more tasks.” They also
grammers. The average total time-on-task for programmers              proposed many usage scenarios based on their own needs in
(22.12 minutes, SD=2.40) was slightly shorter than that for           the interview, such as paying off credit card balance early
non-programmers (23.06 minutes, SD=1.57), but the differ-             when it has reached a certain amount, automatically closing
ence was not statistically significant.                               background apps when the available phone memory is low,
    Most of the issues encountered by participants were actu-         monitoring promotions for gifts saved in the wish list when
ally from the Google Cloud speech recognition system used             approaching anniversaries, and setting reminders for events
in P UMICE. It would sometimes misrecognize the user’s                in mobile games.
voice input, or cut off the user early. These errors were han-           Several concerns were also raised by our participants.
dled by the participants using the “undo” feature in P UMICE.         P4 commented that P UMICE should “just know” how to
Some participants also had parsing errors. P UMICE’s current          find out weather conditions without requiring her to teach
semantic parser has limited capabilities in understanding ref-        it since “all other bots know how to do it”, indicating the
erences of pronouns (e.g., for an utterance “it takes longer          need for a hybrid approach that combines EUD with pre-
than 30 minutes to get to work”, the parser would recognize           programmed common functionalities. P5 said that teaching
it as “it” instead of “the time it takes to get to work” is greater   the agent could be too time-consuming unless the task was
than 30 minutes). Those errors were also handled by partic-           very repetitive since he could just “do it with 5 taps.” Sev-
ipants through undoing and rephrasing. One participant ran            eral users also expressed privacy concerns after learning that
into the “confusion of Boolean operator” problem in Task              P UMICE can see all screen contents during demonstrations,
2 when she used the phrase “commute [time is] worse than              while one user, on the other hand, suggested having P UMICE
30 minutes”, for which the parser initially recognized incor-         observe him at all times so that it can learn things in the
rectly as “commute is less than 30 minutes.” She was able             background.
to correct this using the mixed-initiative mechanism, as de-
scribed in the Error Recovery and Backtracking section.
    Overall, no participant had major problem with the multi-                    Limitations and Future Work
modal interaction approach and the top-down recursive                 The current version of P UMICE has no semantic under-
concept resolution conversational structure, which was en-            standing of information involved in tasks, which prevents it
couraging. However, all participants had received a tutorial          from dealing with implicit parameters (e.g., “when it snows”
with an example task demonstrated. We also emphasized in              means “the current weather condition is snowing”) and un-
You can also read