Deep Interpretable Models of Theory of Mind For Human-Agent Teaming

Page created by Lloyd Wolf
 
CONTINUE READING
Deep Interpretable Models of Theory of Mind For Human-Agent Teaming
Deep Interpretable Models of Theory of Mind
                                                                          For Human-Agent Teaming
                                                                                  Ini Oguntola, Dana Hughes, and Katia Sycara

                                           Abstract— When developing AI systems that interact with               mental state, such as reward [6] or rationality [9], more
                                        humans, it is essential to design both a system that can                 complex models incorporating multiple properties of mental
                                        understand humans, and a system that humans can understand.              state (e.g., beliefs over the environment, desires, personality
                                        Most deep network based agent-modeling approaches are 1)
                                        not interpretable and 2) only model external behavior, ignoring          characteristics, etc.) in non-toy environments remains largely
                                        internal mental states, which potentially limits their capability        unexplored.
arXiv:2104.02938v1 [cs.LG] 7 Apr 2021

                                        for assistance, interventions, discovering false beliefs, etc. To this      While modeling ToM of a human is a very challenging
                                        end, we develop an interpretable modular neural framework                task for an artificial AI agent, understanding the reasoning of
                                        for modeling the intentions of other observed entities. We               such an agent is even more challenging. This paper focuses
                                        demonstrate the efficacy of our approach with experiments on
                                        data from human participants on a search and rescue task                 on developing human-interpretable ToM models. If a ToM
                                        in Minecraft, and show that incorporating interpretability can           model can both infer human mental states and produce
                                        significantly increase predictive performance under the right            human-interpretable explanations for its inferences, this 1)
                                        conditions.                                                              develops trust with humans it interacts with 2) better enables
                                                             I. INTRODUCTION                                     the agent to choose and justify interventions. In addition,
                                                                                                                 interpretable ToM models would allow system designers to
                                           Human intelligence is remarkable not just for the way it              better understand the reasoning and inferences of such an
                                        allows us to navigate environments individually, but also how            agent observing and giving advice to humans executing a
                                        it operates socially, engaging with other intelligent actors.            task.
                                        Humans naturally build high-level models of those around                    We perform experiments with human trajectories obtained
                                        them, and are able to make inferences about their beliefs, de-           from simulated search and rescue tasks in a Minecraft
                                        sires and intentions (BDI) [1]. These inferences allow people            environment, and find that enforcing interpretability can also
                                        to anticipate the behaviors of others, use these predictions to          increase predictive accuracy under the right conditions.
                                        condition their own behavior, and then anticipate potential                 The primary contributions of this paper are the following:
                                        responses.
                                                                                                                    • We design a modular framework to enable AI agents to
                                           In both psychology and machine learning this is referred
                                        to as theory of mind (ToM) [2], [3], [4], which aims to model                  have a theory of mind model of a human.1
                                                                                                                    • We present a method of combining neural and non-
                                        not only the external behavior of other entities but their
                                        internal mental states as well. The developmental psychology                   differentiable components within our framework.
                                                                                                                    • We extend this framework for interpretable intent pre-
                                        literature has found that children as young as 4 years old
                                        have already developed a ToM, a crucial ability in human                       diction with a novel application of concept whiten-
                                        social interaction [5]. ToM can enable discovery of false                      ing [10].
                                                                                                                    • We present experimental results with human partic-
                                        or incomplete beliefs and knowledge and can thus facilitate
                                        interventions to correct false beliefs. Therefore, work in                     ipants and provide both qualitative and quantitative
                                        enabling agents to develop ToM is a crucial step not only in                   interpretability analyses.
                                        developing more effective multi-agent AI systems but also                                      II. RELATED WORK
                                        for developing AI systems that interact with humans, both
                                        cooperatively and competitively [4].                                     A. Theory of Mind
                                           With a few exceptions [6], most agent-modeling ap-                       Theory of mind approaches to agent-modeling have shown
                                        proaches in the reinforcement learning and imitation learning            to make inferences that align with those of human observers
                                        literature largely ignore these internal mental states, usually          [2]. Early approaches infer human goals and beliefs in
                                        only focusing on reproducing the external behavior [7], [8].             the context of planning with pre-programmed non-scalable
                                        This limits their ability to reason in a deeper way about                approaches [11], [12]. Bayesian approaches to ToM [2] are
                                        entities that they interact with. While prior work has explored          also difficult to scale to larger environments. Rabinowitz
                                        providing agents with models of some aspect of a human’s                 et al. used neural networks to learn latent representations
                                                                                                                 of “mental states”, but only for artificial agents in small
                                             * All authors are with the School of Computer Science, Carnegie
                                        Mellon University, Pittsburgh, PA, USA (ioguntol, danahugh,              toy environments [3]. Other work concurrent to this paper
                                        katia)@cs.cmu.edu
                                            * This work is supported by the Defense Advanced Research Projects       1 Our method is general and supports both humans and artificial agents
                                        Agency (DARPA) under Contract No. HR001120C0036, and by the AFR-         as observed entities. However, since our experimental evaluation is on
                                        L/AFOSR award FA9550-18-1-0251.                                          human data, we only refer to a human as the entity being observed.
Deep Interpretable Models of Theory of Mind For Human-Agent Teaming
has explored theory of mind modeling of human state and
behavior with graph-theoretic approaches for scenarios such
as autonomous driving [13].
   This paper presents a neural ToM approach that supports
reasoning about humans, and provides experimental results
within a complex, realistic environment. In addition, we
explicitly incorporate interpretability into our approach while
still maintaining the performance and scalability benefits of
neural approaches.

B. Imitation Learning
   Theory of mind is closely related to imitation learning
and inverse reinforcement learning in that both attempt to
model other agents, and both can be applied to model either
human or artificial agents [14]. Although most imitation
learning methods do not consider interpretability, there are
approaches such as InfoGAIL that take a heuristic approach
towards interpretability by using information theoretic losses
to enforce structure on the latent space and discover modes           Fig. 1: Modular theory of mind (ToM) framework.
in observed behavior [15]. Others use human-readable pro-
grammatic polices for agent-modeling rather than black-box
deep networks [16].                                                          III. NEURAL T O M FRAMEWORK
   The approach presented in this paper falls under the           A. Purpose
umbrella of behavioral cloning [17], [18], with additional
                                                                     It is important to note that the ToM framework is used by
constraints for modeling internal states via theory of mind
                                                                  an observer to infer mental states of an observed entity, and
and for interpretable learning.
                                                                  is not specifically to define the external behavior. The action
C. Deep Interpretability                                          predictions produced by the ToM model can be treated as a
                                                                  policy to forecast an observed entity’s future behavior, and
   Saliency-based methods [19], [20], [21], [22], [23] are
                                                                  imitation learning can provide a training signal for the ToM
the most popular means of providing post-hoc explanations,
                                                                  model, but our purpose is not to train an agent to perform a
and aim to highlight the most important input features
                                                                  task.
by assigning importance weights. However, many common
saliency methods have been found to be independent of both        B. Overview
the model and the data, and often have similar resulting
explanations for all inputs, which is undesirable [24], [25].        We developed a modular theory of mind framework for
While saliency maps have also been applied in reinforcement       an observing agent to use to infer the mental state of an
learning contexts [26], [27], operating on low-level input        observed human. The modules in our network impose an
features (e.g. raw pixels) may not coincide with human ideas      inductive bias that reflects the BDI model of agency in folk
of explanation.                                                   psychology [1]. Modularizing the framework in this way
   Concept-vector post-hoc methods focus on explanations          allows for combining heuristic and data-driven components
based on higher-level concepts rather than low-level input        (e.g., neural networks). Specifically, we model the decision
features [28], [29], [30], but these also have their own          making process of an observed entity as follows:
drawbacks. In general, most post-hoc explanations are based          1) Observations from the environment update a belief
on assumptions of the latent space that may not hold;                    state (belief model)
for instance, the implicit assumptions that 1) concepts are          2) Given the current trajectory, calculate an intent (desire
(linearly) separable in the latent space, and that 2) each axis          model)
in the latent space only represents a single concept.                3) Finally, we predict/generate an action given the belief
   Alternatively, others have devised approaches such as                 state and intent (action model)
concept whitening that focus on learning models that are
interpretable by design [25], [31], [10], i.e. approaches that    C. Combining Differentiable and Heuristic Components
shape the latent space during training. These methods have           If all three models are differentiable (e.g. neural networks),
been used in classification tasks, e.g. in image classifica-      we can train this pipeline end to end with imitation learning
tion for aligning the latent space with predefined human-         methods such as behavioral cloning [18] or GAIL [32], [15].
interpetable concepts. In contrast to these approaches, this      However, often the action alone is not a strong enough signal
paper develops a variation of concept whitening [10] for          to train a model that generalizes well, especially when the
modeling decision-making processes (e.g. theory of mind,          overall goal is to model variation in mental states from
imitation learning, reinforcement learning).                      different entities (e.g. theory of mind).
We can mitigate this difficulty by imposing additional
structure on the pipeline. The simplest ways to do this are to
replace one or more of the models with rule-based models,
and/or to impose structural constraints on the input/output
space of these models. For instance, given a planning task we
can structure the belief state as a grid/graph of locations, and
use a rule-based belief model to update this belief state given
an observation. Additionally, we could replace the action
model with A* search [33], and structure the intents from          Fig. 2: Encoder-decoder architecture used for the desire and
the desire model as locations of subgoals.                         inverse-action models in our experiments, inspired by U-Nets
   In the setup described above, the belief and action models      for image-segmentation [34]. Blue indicates convolutional
are rule-based, and the desire model is the sole trainable         layers, green indicates pooling, brown indicates up-sampling,
component. The rule-based belief model does not pose any           and yellow is a final linear layer. A more detailed description
issue (one can think of it as simply preprocessing the input       can be found in the appendix.
observation). However, we cannot optimize for the final
output in any gradient-based way, as the output of the desire
model is the input to the non-differentiable action model,
                                                                   creating a dataset of belief-action-intent triples. Finally, we
which produces the final output. And unless we have ground-
                                                                   train the inverse action model on this dataset to predict
truth intents, we cannot train the desire model in a directly
                                                                   intents given beliefs and actions.
supervised manner.
                                                                      Pseudocode for this training process is provided in Algo-
   Given belief state b ∈ B, observed action a ∈ A, a set of
                                                                   rithm 1.
intents I, and non-differentiable action model f : B × I →
                                                                      2) Training Desire Model: To train the desire model, we
A, we want to learn a desire model g to model a conditional
                                                                   first collect belief states by running the observations from
distribution p(z | b) such that
                                                                   human trajectories through the belief model. We also store
                E[a | b] = Ez∼g(b) [f (b, z) | b]                  the corresponding observed actions for each belief state.
                                                                      We can then generate belief-intent pairs for the desire
where z ∈ I is the intent. However, we may not have access         model by sampling intents from our inverse action model
to any samples from such a distribution (i.e. no ground truth      z ∼ h(b, a) for each belief-action pair. The target intents are
z ∈ I for given b, a pairs).                                       formed by combining a) the probability distribution from
   Alternatively, we can learn to model the distribution           the inverse action model over the previous belief state, and
p(z | b, a) with an “inverse action model” h. This density         if available b) the next realized intent (from the future, in a
of this distribution is proportional to                            post-hoc manner).
       p(z | b, a) ∝ p(a, z | b) = p(a | b, z) · p(z | b)   (1)       Finally we train the desire model on this data to predict
                                                                   intent given belief.
Because we have direct access to f we can sample from                 Pseudocode for this training process is provided in Algo-
p(a | b, z), and thus given some kind of prior p(z | b) we         rithm 2.
can sample from p(z | b, a) to learn h.
   Once we have learned an inverse action model h, then
                                                                    Algorithm 1: Training inverse action model
for each belief-action pair (b, a) we can then simply use h
to sample intents from p(z | b, a), and use these sampled            Input: set of human trajectories T = {τ1 , τ2 , . . . },
intents to train the desire model g in a supervised manner.                     belief model s, action model f
                                                                     Dbaz → ∅
D. Training                                                          for τ ∈ T do
                                                                          o1 , . . . , o n → τ
   Training is done in two stages: the first stage trains the
                                                                          b0 → Uniform
inverse action model, the second stage trains the desire
                                                                          for t = 1, . . . , n do
model. In each stage, once we gather the necessary data,
                                                                                bt → s(bt−1 , ot )
we train using stochastic gradient descent (SGD).
                                                                                for i = 1, . . . , m do
   1) Training Inverse Action Model: To train the inverse                              zt ∼ p(z | bt )
action model, we first collect belief states by sequentially                           a = f (bt , zt )
running observations from human trajectories through the                               Dbaz = Dbaz ∪ {(bt , a, zt )}
rule-based belief model and storing the resulting belief states                 end
at each timestep. These trajectories can be from human                    end
participants or potentially even from artificial agents trained      end
to perform the task.                                                 Initialize neural network parameters θh
   Then for each belief state b, we sample an intent z given         Use SGD to train h(b, a | θh ) on Dbaz
some prior p(z | b), and create a set of b, z pairs. Next, for
each belief-intent pair, we generate an action a = f (b, z),
Algorithm 2: Training desire model                                Q, we maximize the following objective:
  Input: set of human trajectories T = {τ1 , τ2 , . . . },                                  k
            belief model s, inverse action model h
                                                                                           X   1     X
                                                                                   max                        q>
                                                                                                               j ẑxcj        (4)
  Dbz → ∅                                                                         q1 ...qk    n
                                                                                           j=1 j   xcj ∈Xcj
  for τ ∈ T do
       (o1 , a1 ), . . . , (on , an ) → τ                          where ẑxcj denotes the concept-whitened latent representa-
       b0 → Uniform                                                tion in the model on a data sample from concept cj . Orthog-
       for t = 1, . . . , n do                                     onality can be maintained when optimization is performed
            bt → s(bt−1 , ot )                                     via gradient descent and curvilinear search on the Stiefel
            zt ∼ h(bt , at )                                       manifold [36].
            Dbz = Dbz ∪ {(bt , zt )}                                  A more detailed description of concept whitening and the
       end                                                         optimization algorithm can be found in [10].
  end
  Initialize neural network parameters θg
                                                                   B. Concept Whitening for Intent Prediction
  Use SGD to train g(b | θg ) on Dbz
                                                                      We can modify this idea to the context of explanatory
                                                                   concepts for intent prediction. Specifically, we consider the
                                                                   desire model (Fig. 2) and insert a concept whitening layer
              IV. CONCEPT WHITENING                                (for more detail see the appendix and Fig. 3).
    Concept whitening (CW) is a mechanism introduced by               First we define a set of concepts C = {c1 , . . . , ck };
Chen et al. [10] for modifying neural network layers to            these concepts should correspond to appropriate human-
increase interpretability. Broadly, it aims to enforce structure   interpretable reasons or “explanations” for intent prediction
on the latent space by aligning the axes with predefined           given the problem domain. We also must be able to identify a
human-interpretable concepts. While this technique was de-         subset of timesteps from our trajectories where each concept
veloped for the purpose of image classification, here we           applies, either directly from the trajectory data, or from
adapt the idea in the context of intent prediction with the        external labels.
desire model. By explicitly defining a set of concepts that           Recall that the desire model’s inputs are belief states,
can serve as “explanations” for intent predictions, we can         which we can generate sequentially by passing the observa-
use concept whitening to allow for interpretability via iden-      tions from each trajectory timestep through the belief model.
tification of the most important concepts for any prediction.      Then for each concept cj we consider only the belief states
    We also note that although we consider concept whitening       from the timesteps where cj is known to apply, and aggregate
in the context that can broadly be categorized as behavioral       them into auxiliary dataset Bcj .
cloning, our approach to interpretable agent-modeling is              Then training alternates between:
framework agnostic and could potentially be applied to other         1) Optimizing for intent prediction, given a belief state
reinforcement learning and imitation learning contexts.                 and a ground truth intent
                                                                     2) Concept-aligning the CW orthogonal matrix Q by
A. Technical Details                                                    maximizing the activation along axis j for each auxil-
   Given latent representation Z ∈ Rn×d , let ZC ∈ Rn×d                 iary dataset Bcj
be the mean-centered latent representation. We can calculate         Pseudocode for this process is provided in Algorithm 3.
the ZCA-whitening matrix W ∈ Rd×d as in [35], and thus
decorrelate and standardize the data via whitening operation                          V. EXPERIMENTS
ψ:
                                                                   A. Task
              ψ(Z) = WZC = W(Z − µ1> )                     (2)
                 n
                                                                      We consider a simulated search and rescue task in a
where µ = n1 i=1 zi is the latent sample mean.
              P
                                                                   Minecraft environment. The scenario simulates a damaged
   Now say we are given concepts c1 . . . ck that can be char-     building after a disaster, with areas of the building layout
acterized by corresponding auxiliary datasets Xc1 . . . Xck ,      perturbed with collapsed rubble, wall openings, and fires.
and assume we have an orthogonal matrix Q ∈ Rd×d such              There are 34 injured victims within the building who will
that the data from Xcj has high activation on the j-th axis        die if left untreated. For convenience and simplicity, victims
(i.e. column qj ). Then the concept-whitened representation        are represented as blocks. Out of these victims, 10 of these
is given by:                                                       are critically injured and will expire after 5 minutes. These
                        Ẑ = Q> WZC                          (3)   critical victims take 15 seconds to triage and are worth 30
                                                                   points. Other victims are considered “non-critical”, but will
   Training alternates between optimizing for the main ob-         expire after 10 minutes. Non-critical victims take 7.5 seconds
jective (i.e. with the network’s final output) and optimizing      to triage and are worth 10 points. The goal of the task is to
the orthogonal matrix Q for concept-alignment. To optimize         earn as many points as possible within a 10 minute mission.
Fig. 3: Desire model with concept whitening.

 Algorithm 3: Training desire model with concept                   Participants were assessed under 3 knowledge conditions:
 whitening                                                         1) No knowledge of critical-victim tradeoff, no knowl-
  D→∅                                                                 edge of signal meaning
  for τ ∈ T do                                                     2) Knowledge of critical-victim tradeoff but not of signal
      (o1 , a1 ), . . . , (on , an ) → τ                              meaning
      b0 → Uniform                                                 3) Knowledge of both critical-victim tradeoff and signal
      for t = 1, . . . , n do                                         meaning
           bt → BM(bt−1 , ot )
           zt ∼ IAM(bt , at )                                    C. Intent Prediction
           D = D ∪ {(bt , zt )}                                     We represent intents as (x, y) positions the participant in-
      end                                                        tends to navigate towards. Specifically, we consider victims,
  end                                                            doors, and room openings as locations-of-interest, which
  for e = 1, ..., num epochs do                                  frames the intent prediction task as predicting either the next
      Train DM on D with gradient descent                        room to be visited or the next victim to be triaged. The
      if e mod 5 = 0 then                                        predictions are accumulated at each timestep (∼ 0.5 seconds
           for j = 1, ..., k do                                  per timestep) between visits of locations-of-interest, and then
                Maximize activation of Bcj on the j-th           their mode is evaluated against the ground truth. We evaluate
                  column of Q (see [10])                         on a held-out test set of 20% of participant trajectories.
           end
      end                                                        D. Concepts
  end                                                              We defined a set of 10 concepts related to mission timer,
                                                                 knowledge condition, and field of view (see Table I). We
                                                                 consider 3 subsets:
B. Human Data For Training and Evaluation                          • Concept Set I is the full concept set
   All experiments are performed using a set of 75 trajecto-       • Concept Set II omits the knowledge condition concepts
ries previously collected from human participants [37]. Prior      • Concept III omits both knowledge condition and mis-
to each mission, participants were given information on the           sion timer concepts
task and the original building layout. However, the knowl-
edge conditions of certain participants were manipulated by                             TABLE I: Concepts
partially withholding information. Some participants were             Concept                                        Concept Sets
not informed of the cost-benefit tradeoffs (i.e. the knowledge        Mission timer between 0-3 minutes                   I, II
that critical victims take 15 seconds to rescue and are worth         Mission timer between 3-5 minutes                   I, II
                                                                      Mission timer between 5-8 minutes                   I, II
30 points and non-critical victims take 7.5 seconds to rescue         Mission timer > 8 minutes                           I, II
and are worth 10 points). Knowledge was also manipulated              Knowledge condition 1 (no triage, no signal)          I
via a beep signal that activated whenever the participant was         Knowledge condition 2 (triage, no signal)             I
                                                                      Knowledge condition 3 (triage, signal)                I
near a room with a victim (1 beep for non-critical victim, 2          Door / opening in field of view                  I, II, III
beeps for critical victim); certain participants were not told        Non-critical victim in field of view             I, II, III
the meaning of the signal.                                            Critical victim in field of view                 I, II, III
The field of view and mission timer concepts were labeled      (CW + transfer, full concept set). The mean normalized
directly from the data; the knowledge condition concepts are      activations for non-critical victims, critical victims, and doors
labeled with external knowledge of the condition for each         / openings are visualized in Fig. 4.
participant trajectory.                                              These largely line up with intuition; unsurprisingly, the
                                                                  presence of an intent in the field of view is an important
E. Results
                                                                  concept for the model’s prediction of said intent. We also
  We compare the accuracy of ToM model intent predictions         see variability in the importance of different mission time
under 3 methods: training without CW, training from scratch       intervals for different intent predictions, and similarly for
with CW, and transfer learning by initializing a CW model         knowledge condition.
with the weights of a pretrained non-CW model. Results are
provided in Table II, where we see that introducing concept       B. Quantitative Analysis
whitening for interpretability actually results in increased         We also attempt to quantitatively assess how well the
accuracy of the model.                                            concept activation vectors characterize the ToM model’s
                                                                  intent predictions. Intuitively, we should be able to deduce
         TABLE II: Intent Prediction Performance                  information about what the model’s prediction will be with
                                                                  high accuracy, using only concept activation vectors as input.
          Training Method      Intent Prediction Accuracy
               Without CW                 0.730                      This can be framed as a classification problem where,
                       CW                 0.840                   given an activation vector, we predict the type of the cor-
           CW + Transfer                  0.841                   responding predicted intent as one of: non-critical victim,
                                                                  critical victim, or door / opening. Rather than use a complex
                                                                  model, we can learn a decision tree or SVM, and use
F. Concept Ablation                                               the accuracy as a proxy for the quality of our concept
   We also tested the effect of concept selection on perfor-      activations. As shown in Table IV, we achieve relatively
mance (Table III). In particular, we omitted the knowledge        high accuracies with simple models, which suggests that our
condition (KC) concepts and / or the mission timer concepts,      learned concept activations are a good characterization of the
tested concept-whitened ToM models both with and without          “decision making process”.
transfer, and found noticeably diminished performance.
   Compared to the non-CW model, CW with reduced con-                    TABLE IV: Classifiying Activations as Intents
cept sets resulted in worse performance, and while transfer
                                                                                             Model    Accuracy
from the non-CW model somewhat mitigated this effect,                                 Decision Tree     0.93
we still see a significant drop from the performance with                                     SVM       0.92
full concept set. This demonstrates the importance of good
concept selection for the resulting performance of concept-
whitened ToM model.                                                                   VII. CONCLUSION
             TABLE III: Varying Concept Sets                         We have presented a modular ToM approach for rea-
                                                                  soning about humans that can allow for both neural and
             Training Method      Concept Set   Acc.              heuristic components. Our approach explicitly incorporates
                  Without CW         N/A        0.730
                          CW          III       0.412             interpretability while still maintaining the performance and
                          CW          II        0.692             scalability advantages of neural approaches. We move be-
                          CW           I        0.840             yond simple toy environments and apply our framework to a
               CW + Transfer          III       0.549
               CW + Transfer          II        0.779             more complex, realistic setting, and our experimental results
              CW + Transfer            I        0.841             demonstrate that enforcing interpretability can also increase
                                                                  predictive accuracy.
                                                                     The natural extension of this work is in exploring the
         VI. INTERPRETABILITY ANALYSIS                            benefits of ToM and interpretability in assistance and in-
                                                                  terventions. A particularly interesting direction would be
A. Qualitative Analysis                                           to explore counterfactuals; that is, examining how intent
   We can estimate the concept importance for each predic-        prediction changes given a change in concept activation,
tion via the activation for each column of the CW orthogonal      and then finding the closest belief state that could result
matrix Q, given by:                                               in the given change. Approaching this through interpretable
                                                                  concept activations rather than in the belief space could
                            aj = q>
                                  j ẑb                     (5)
                                                                  facilitate interventions to warn about or correct human errors
where ẑx is the concept-whitened latent representation for       when working in Human + AI teams.
belief state b.                                                      We hope that this work can serve as a starting point for
   We can examine the activation vectors a = [a1 . . . ak ] for   working towards social intelligence in AI systems that are
different types of intent predictions by the learned model        designed to interact with humans.
APPENDIX
                                                                        A. Implementation Details
                                                                           We consider Minecraft environments that can effectively
                                                                        be collapsed to 2D representations. The specification for each
                                                                        of the framework components is given below.
                                                                              Observations: Observations are represented X × Y
                                                                        grids, where each (x, y) coordinate contains one of K
                                                                        different block types.
                                                                              Belief States: Each belief state b is represented by a
                                                                        X × Y × K grid, where the value at (x, y, k) represents the
                                                                        probability of the block at position (x, y) having block type
                                                                        k.
                                                                              Belief Model: We use a rule-based belief model that
(a) Mean concept activation for intent prediction of non-critical       aggregates observations into our belief state with probability
victim. We can see that the presence of a non-critical victim in the    1, and decay probabilities over time to a uniform distribution
field-of-view is the most activated concept.                                                         b+
                                                                        over block types by b → 1+K      after each timestep, where b
                                                                        is a belief state grid, K is the number of block types, and 
                                                                        is a forgetfulness hyperparameter we set to 0.01.
                                                                              Intents: We represent each intent as an (x, y) position
                                                                        the player intends to navigate towards.
                                                                              Intent Prior: When generating data to train the inverse
                                                                        action model, for each belief state b, we sample an intent
                                                                        (x, y) given some prior p(x, y | b), and create a set of
                                                                        b, (x, y) pairs. We specifically use the prior p(x, y | b) =
                                                                            1
                                                                        db (x,y) if we belief a victim or door is at position (x, y),
                                                                        and 0 otherwise, where db (x, y) is the L1 distance of point
                                                                        (x, y) from the player’s position.
                                                                              Action Model: We use A* search as our action model,
                                                                        A∗ (b, (x, y)) = a, where b is a belief state, (x, y) rep-
(b) Mean concept activation for intent prediction of critical victim.   resents the intent, and a is an action from discrete ac-
Here we see zero activation for mission timer above 5 minutes           tion set of: left turn, right turn, toggle door,
(which corresponds with critical victims expiring). We also see that    toggle lever, triage, or None.
the presence of a critical victim or room opening in field-of-view
is a common reason for predicting intent to triage a critical victim.   B. Neural Architectures
                                                                             Inverse Action Model: The inverse action model takes
                                                                        as input a belief state b and an action a. It outputs an X × Y
                                                                        grid of log-probabilities for the intent at each grid cell. It is
                                                                        designed as an encoder-decoder model, inspired by image-
                                                                        segmentation approaches, and uses the following architecture
                                                                        (Fig 2):
                                                                           • 3 encoder blocks each consisting of a convolutional
                                                                             layer, followed by max-pooling, ReLU activation and
                                                                             batch norm
                                                                           • A bottleneck layer, where the downsampled input is
                                                                             concatenated with action a and passed through a fully-
                                                                             connected layer
                                                                           • 3 decoder blocks each consisting of:
(c) Mean concept activation for intent prediction of opening. The
                                                                               – a deconvolutional upsampling layer
presence of an opening in the field of view is the most highly
activated concept. We also see that compared to the other mission              – a residual connection with the output of the corre-
timer concepts, the last 2 minutes sees the timer become a more                   sponding encoder block
important reason for predicting intent to go towards a door or                 – a convolutional layer, followed by ReLU activation
opening.                                                                          and batch norm
Fig. 4: Mean concept activations for different intent predic-                Desire Model: The desire model takes as input a belief
tion types.                                                             state b and a character embedding c. It outputs an X × Y
                                                                        grid of log-probabilities for the intent at each grid cell.
                                                                        Its architecture (Fig 2) is identical to that of the inverse
action model, except without concatenating the action in                         [20] D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg,
the bottleneck layer. When training concept whitening, we                             “Smoothgrad: removing noise by adding noise,” arXiv preprint
                                                                                      arXiv:1706.03825, 2017.
replace batch normalization after the bottleneck layer with a                    [21] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
concept whitening layer.                                                              D. Batra, “Grad-cam: Visual explanations from deep networks via
                                                                                      gradient-based localization,” in Proceedings of the IEEE international
                              R EFERENCES                                             conference on computer vision, 2017, pp. 618–626.
                                                                                 [22] M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?”
 [1] M. Georgeff, B. Pell, M. Pollack, M. Tambe, and M. Wooldridge, “The              explaining the predictions of any classifier,” in Proceedings of the 22nd
     belief-desire-intention model of agency,” in International workshop on           ACM SIGKDD international conference on knowledge discovery and
     agent theories, architectures, and languages. Springer, 1998, pp. 1–             data mining, 2016, pp. 1135–1144.
     10.                                                                         [23] S. M. Lundberg and S. Lee, “A unified approach to interpreting model
 [2] C. Baker, R. Saxe, and J. Tenenbaum, “Bayesian theory of mind:                   predictions,” in Advances in Neural Information Processing Systems
     Modeling joint belief-desire attribution,” in Proceedings of the annual          30: Annual Conference on Neural Information Processing Systems
     meeting of the cognitive science society, vol. 33, 2011.                         2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von
 [3] N. Rabinowitz, F. Perbet, F. Song, C. Zhang, S. A. Eslami, and                   Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan,
     M. Botvinick, “Machine theory of mind,” in International conference              and R. Garnett, Eds., 2017, pp. 4765–4774.
     on machine learning. PMLR, 2018, pp. 4218–4227.                             [24] J. Adebayo, J. Gilmer, M. Muelly, I. J. Goodfellow, M. Hardt, and
 [4] F. Cuzzolin, A. Morelli, B. Cirstea, and B. J. Sahakian, “Knowing me,            B. Kim, “Sanity checks for saliency maps,” in Advances in Neural
     knowing you: Theory of mind in ai,” Psychological medicine, vol. 50,             Information Processing Systems 31: Annual Conference on Neural
     no. 7, pp. 1057–1061, 2020.                                                      Information Processing Systems 2018, NeurIPS 2018, December 3-
 [5] J. W. Astington and M. J. Edward, “The development of theory of mind             8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle,
     in early childhood,” Encyclopedia on early childhood development,                K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., 2018, pp. 9525–
     vol. 14, pp. 1–7, 2010.                                                          9536.
 [6] R. Choudhury, G. Swamy, D. Hadfield-Menell, and A. D. Dragan,               [25] C. Rudin, “Stop explaining black box machine learning models for
     “On the utility of model learning in hri,” in 2019 14th ACM/IEEE                 high stakes decisions and use interpretable models instead,” Nature
     International Conference on Human-Robot Interaction (HRI). IEEE,                 Machine Intelligence, vol. 1, no. 5, pp. 206–215, 2019.
     2019, pp. 317–325.                                                          [26] R. Iyer, Y. Li, H. Li, M. Lewis, R. Sundar, and K. Sycara, “Trans-
 [7] J. N. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel,              parency and explanation in deep reinforcement learning neural net-
     and I. Mordatch, “Learning with opponent-learning awareness,” in                 works,” in Proceedings of the 2018 AAAI/ACM Conference on AI,
     Proceedings of the 17th International Conference on Autonomous                   Ethics, and Society, 2018, pp. 144–150.
     Agents and MultiAgent Systems, AAMAS 2018, Stockholm, Sweden,               [27] R. M. Annasamy and K. Sycara, “Towards better interpretability in
     July 10-15, 2018, E. André, S. Koenig, M. Dastani, and G. Sukthankar,           deep q-networks,” in Proceedings of the AAAI Conference on Artificial
     Eds. International Foundation for Autonomous Agents and Multiagent               Intelligence, vol. 33, no. 01, 2019, pp. 4561–4569.
     Systems Richland, SC, USA / ACM, 2018, pp. 122–130.                         [28] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas,
 [8] Y. Wen, Y. Yang, R. Luo, J. Wang, and W. Pan, “Probabilistic recursive           et al., “Interpretability beyond feature attribution: Quantitative testing
     reasoning for multi-agent reinforcement learning,” in 7th International          with concept activation vectors (tcav),” in International conference on
     Conference on Learning Representations, ICLR 2019, New Orleans,                  machine learning. PMLR, 2018, pp. 2668–2677.
     LA, USA, May 6-9, 2019, 2019.                                               [29] B. Zhou, Y. Sun, D. Bau, and A. Torralba, “Interpretable basis
                                                                                      decomposition for visual explanation,” in Proceedings of the European
 [9] R. Shah, N. Gundotra, P. Abbeel, and A. Dragan, “On the feasibility
                                                                                      Conference on Computer Vision (ECCV), 2018, pp. 119–134.
     of learning, rather than assuming, human biases for reward inference,”
                                                                                 [30] A. Ghorbani, J. Wexler, J. Y. Zou, and B. Kim, “Towards auto-
     in International Conference on Machine Learning. PMLR, 2019, pp.
                                                                                      matic concept-based explanations,” in Advances in Neural Information
     5670–5679.
                                                                                      Processing Systems 32: Annual Conference on Neural Information
[10] Z. Chen, Y. Bei, and C. Rudin, “Concept whitening for interpretable
                                                                                      Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Van-
     image recognition,” Nature Machine Intelligence, vol. 2, no. 12, pp.
                                                                                      couver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer,
     772–782, 2020.
                                                                                      F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 9273–9282.
[11] J. Oh, F. Meneguzzi, and K. Sycara, “Probabilistic plan recognition
                                                                                 [31] P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim,
     for proactive assistant agents,” Plan, activity, and intent recognition.
                                                                                      and P. Liang, “Concept bottleneck models,” in International Confer-
     Elsevier, Amsterdam, The Netherlands, vol. 10, p. 23, 2014.
                                                                                      ence on Machine Learning. PMLR, 2020, pp. 5338–5348.
[12] J. Oh, F. Meneguzzi, K. Sycara, and T. J. Norman, “Prognostic nor-
                                                                                 [32] J. Ho and S. Ermon, “Generative adversarial imitation learning,” in
     mative reasoning,” Engineering Applications of Artificial Intelligence,
                                                                                      NIPS, 2016.
     vol. 26, no. 2, pp. 863–872, 2013.
                                                                                 [33] P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for the
[13] R. Chandra, A. Bera, and D. Manocha, “Stylepredict: Machine theory
                                                                                      heuristic determination of minimum cost paths,” IEEE transactions on
     of mind for human driver behavior from trajectories,” arXiv preprint
                                                                                      Systems Science and Cybernetics, vol. 4, no. 2, pp. 100–107, 1968.
     arXiv:2011.04816, 2020.
                                                                                 [34] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional
[14] J. Jara-Ettinger, “Theory of mind as inverse reinforcement learning,”            networks for biomedical image segmentation,” in International Confer-
     Current Opinion in Behavioral Sciences, vol. 29, pp. 105–110, 2019.              ence on Medical image computing and computer-assisted intervention.
[15] Y. Li, J. Song, and S. Ermon, “Infogail: interpretable imitation learning        Springer, 2015, pp. 234–241.
     from visual demonstrations,” in Proceedings of the 31st International       [35] L. Huang, Y. Zhou, F. Zhu, L. Liu, and L. Shao, “Iterative nor-
     Conference on Neural Information Processing Systems, 2017, pp.                   malization: Beyond standardization towards efficient whitening,” in
     3815–3825.                                                                       Proceedings of the IEEE/CVF Conference on Computer Vision and
[16] A. Verma, V. Murali, R. Singh, P. Kohli, and S. Chaudhuri, “Pro-                 Pattern Recognition, 2019, pp. 4874–4883.
     grammatically interpretable reinforcement learning,” in International       [36] Z. Wen and W. Yin, “A feasible method for optimization with orthog-
     Conference on Machine Learning. PMLR, 2018, pp. 5045–5054.                       onality constraints,” Mathematical Programming, vol. 142, no. 1, pp.
[17] M. Bain and C. Sammut, “A framework for behavioural cloning.” in                 397–434, 2013.
     Machine Intelligence 15, 1995, pp. 103–129.                                 [37] L. Huang, J. Freeman, N. Cooke, M. Cohen, X. Yin, J. Clark,
[18] F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from ob-                M. Wood, V. Buchanan, C. Carrol, F. Scholcover, A. Mudigonda,
     servation,” in Proceedings of the Twenty-Seventh International Joint             L. Thomas, A. Teo, M. Freiman, J. Colonna-Romano, L. Lapujade,
     Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018,             and K. Tatapudi, “Using humans’ theory of mind to study artificial
     Stockholm, Sweden, J. Lang, Ed. ijcai.org, 2018, pp. 4950–4957.                  social intelligence in minecraft search and rescue,” in (to be submitted
[19] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolu-                 to the) Journal of Cognitive Science, 2021.
     tional networks: Visualising image classification models and saliency
     maps,” in 2nd International Conference on Learning Representations,
     ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track
     Proceedings, Y. Bengio and Y. LeCun, Eds., 2014.
You can also read