Evaluating Explanations for Reading Comprehension with Realistic Counterfactuals

Page created by Alfred Lane

Lifestyle

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Evaluating Explanations for Reading Comprehension with Realistic Counterfactuals

Evaluating Explanations for Reading Comprehension
                                                                  with Realistic Counterfactuals

                                                                       Xi Ye Rohan Nair Greg Durrett
                                                                        Department of Computer Science
                                                                        The University of Texas at Austin
                                                                  {xiye,rnair,gdurrett}@cs.utexas.edu

                                                              Abstract                         language inference (Camburu et al., 2018; Thorne
                                                                                               et al., 2019). However, both approaches have been
                                            Token-level attributions have been extensively     roundly criticized. An explanation may not be faith-
                                            studied to explain model predictions for a wide
                                                                                               ful to the computation of the original model (Wu
                                            range of classification tasks in NLP (e.g., sen-
arXiv:2104.04515v1 [cs.CL] 9 Apr 2021

                                            timent analysis), but such explanation tech-
                                                                                               and Mooney, 2018; Hase and Bansal, 2020; Wiegr-
                                            niques are less explored for machine read-         effe et al., 2020; Jacovi and Goldberg, 2020b), even
                                            ing comprehension (RC) tasks. Although the         misleading users (Rudin, 2019). More critically,
                                            transformer-based models used here are identi-     token attributions in particular do not have a con-
                                            cal to those used for classification, the under-   sistent and meaningful social attribution (Miller,
                                            lying reasoning these models perform is very       2019; Jacovi and Goldberg, 2020a): that is, when
                                            different and different types of explanations      a user of the system looks at the explanation, they
                                            are required. We propose a methodology to
                                                                                               do not draw a correct conclusion from it, making it
                                            evaluate explanations: an explanation should
                                            allow us to understand the RC model’s high-        hard to use for downstream tasks.
                                            level behavior with respect to a set of realis-       Our focus on this work is how to evaluate ex-
                                            tic counterfactual input scenarios. We define      planations for reading comprehension in terms of
                                            these counterfactuals for several RC settings,     their ability to reveal the high-level behavior of
                                            and by connecting explanation techniques’ out-     models. That is, rather than an explanation saying
                                            puts to high-level model behavior, we can eval-
                                                                                               “this word was important”, we want to draw a con-
                                            uate how useful different explanations really
                                            are. Our analysis suggests that pairwise ex-       clusion like “the model picked out these two words
                                            planation techniques are better suited to RC       and compared them;” this statement can be evalu-
                                            than token-level attributions, which are often     ated for faithfulness and it helps a user draw mean-
                                            unfaithful in the scenarios we consider. We        ingful conclusions about the system. We approach
                                            additionally propose an improvement to an          this evaluation from a perspective of simulatability
                                            attention-based attribution technique, resulting   (Hase and Bansal, 2020): can we predict how the
                                            in explanations which better reveal the model’s
                                                                                               system will behave on new or modified examples?
                                            behavior.1
                                                                                               Doing so for RC models is challenging due to the
                                        1   Introduction                                       complex nature of the task, which fundamentally
                                                                                               involves a correspondence between a question and
                                        Interpreting the behavior of black-box neural mod-     a supporting text context.
                                        els for NLP has garnered interest for its many pos-       Our core technique is to assess how well various
                                        sible benefits (Lipton, 2018). A range of post-        explanations can support or reject hypotheses about
                                        hoc explanation techniques have been proposed,         the model’s behavior (i.e., simulate the model) on
                                        including textual explanations (Hendricks et al.,      realistic counterfactuals, which are perturbations
                                        2016) and token-level attributions (Ribeiro et al.,    of original data points (Figure 1). These resemble
                                        2016; Sundararajan et al., 2017; Guan et al., 2019;    several prior “stress tests” used to evaluate models,
                                        De Cao et al., 2020). These formats can be ap-         including counterfactual sets (Kaushik et al., 2020),
                                        plied to many domains, including sentiment analy-      contrast sets (Gardner et al., 2020), and checklists
                                        sis (Guan et al., 2019; De Cao et al., 2020), visual   (Ribeiro et al., 2020). We first manually curate
                                        recognition (Simonyan et al., 2013), and natural       these sets to answer questions like: if different
                                          1
                                            Code and data available at https://github.com/     facts were shown in the context, how would the
                                        xiye17/EvalQAExpl                                      model behave? If different amounts of text or other

Base Example Explanations
？ Are Super High Me and All in This Tea both documentaries?
D0 Super High Me is a 2008 documentary film about smoking. Are Super High Me and All in This Tea both
All in This Tea is a 2007 documentary film. documentaries ? yes no Super High
YES Me is a 2008 documentary film about smoking .
All in This Tea is a 2007 documentary film .
Counterfactual Example
D1 Super High Me is a 2008 romance film about smoking.
Integrated Gradient RoBERTa looks at documentary
(Sundararajan et al., 2017)
All in This Tea is a 2007 documentary film. YES

D2 Super High Me is a 2008 documentary film about smoking.
All in This Tea is a 2007 romance film. YES Are Super High Me and All in This Tea both
documentaries ? yes no Super High
D3 Super High Me is a 2008 romance film about smoking. Me is a 2008 documentary film about smoking .
All in This Tea is a 2007 documentary film .
All in This Tea is a 2007 romance film. YES
DiffMask
I still predict YES even if RoBERTa looks at documentary LaAttrAttn documentary
(De Cao et al., 2020)
documentary tokens are replaced (Ours) barely contributes

Figure 1: A motivating example and explanations generated by several methods. We profile the model behaviors
with the predictions on realistic counterfactual inputs, which suggest the model does not truly base its prediction
on the two movies being documentaries. We can evaluate explanations by seeing whether they can be used in
combination with heuristics to derive this same conclusion about model behavior.

incorrect paragraphs were retrieved by an upstream lows: (1) We propose a framework for evaluating
retrieval system, would the model still get the right explanations based on model simulation on real-
answer? Then, we evaluate techniques for simulat- istic counterfactuals. (2) We describe a technique
ing the model’s behavior given explanations like for connecting low-level attributions (token-level
token attributions. That is, using the explanations, or higher-order) with high-level model hypotheses.
can we recover the answers to these questions and (3) We improve an attention-based pairwise attri-
give usable insights about the QA system? bution technique with a simple but effective fix,
We investigate two paradigms of explanation leading to strong empirical results. (4) We analyze
techniques, token attribution-based (Simonyan a set of QA tasks and show that our approach can
et al., 2013; Ribeiro et al., 2016; De Cao et al., derive meaningful conclusions on each.
2020) and feature interaction based (Tsang et al.,
2020; Hao et al., 2020), which attribute decisions to 2 Motivation
sets of tokens or pairwise/higher-order interactions. We start by going through a detailed example of
We show that token level attribution is not suffi- how to use our methodology to compare several
cient for analyzing QA, which naturally involves attribution techniques. Figure 1 shows an exam-
more complex reasoning over multiple clues. For ple of a multi-hop yes/no question from HotpotQA.
both techniques, we devise methods to bridge from The QA model correctly answers yes in this case.
these explanations to high-level conclusions about Given the original example, the explanations pro-
counterfactual behavior. duced using I NT G RAD (Sundararajan et al., 2017)
We apply our methodology to automatically eval- and D IFF M ASK (De Cao et al., 2020) (explained in
uate and compare a series of explanation techniques Section 4) both assign high attribution scores to the
on two types of questions from H OTPOT QA (Yang two documentary tokens appearing in the context:
et al., 2018), questions from adversarial S QUAD a user of the system is likely to impute that the
(Rajpurkar et al., 2016), and on a synthetic QA set- model is comparing these two values, as it’s natu-
ting. For each concrete high-level hypothesis we ral to assume this model is using the highlighted
formulate, we automatically assess the extent to information correctly. By contrast, our pairwise
which our low-level explanation techniques can attribution approach primarily attributes the predic-
usefully produce the same answer as our hand- tion to interactions with the question, suggests the
crafted counterfactuals. Our experimental results interaction related to documentary do not matter.
show moderate success of this approach overall, We manually curate a set of contrastive examples
and that explanations in form of feature interactions to test this hypothesis. If the model truly recognizes
better align with model behaviours. We further pro- that both movies are documentaries, then replac-
pose a modification to an existing interaction tech- ing either or both of the documentary tokens with
nique from (Hao et al., 2020) and show improved romance should change the prediction. To verify
performance on our datasets. that, we perturb the original examples to obtain an-
We summarize our main contributions as fol- other three examples (left side of Figure 1). These

four examples together form a contrastive local                     focuses on benchmarking explanations, not models
neighborhood (Ribeiro et al., 2016; Kaushik et al.,                 themselves.
2020; Gardner et al., 2020) consisting of realistic
counterfactuals.2                                                   3   Evaluation Protocol
   However, unlike what’s suggested by the token                    We seek to formalize the reasoning we undertook
attribution based techniques, the model always pre-                 on Figure 1. Using the model’s explanation on
dicts “yes” for every example in the neighbourhood,                 a “base” data point, can we predict the model’s
disputing that the model is following the right rea-                behavior on the perturbed instances of that point?
soning process. Although our pairwise attribution
seemed at first glance much less plausible than that                Definitions Given an original example D0 (e.g.,
generated by the other techniques, our explanation                  the top example in Figure 1), we construct a set
was in fact better from the perspective of simulating               of perturbations based on {D1 , ..., Dk } (e.g., the
the model’s behavior on these new examples.                         three counterfactual examples in Figure 1), which
   Our main assumption in this work can be stated                   together with D0 form a local neighborhood D.
as follows an explanation should describe model                     These perturbations are realistic inputs derived
behavior with respect to realistic counterfactu-                    from existing datasets or which we construct.
als, not just look plausible. Past work has eval-                      We formulate a hypothesis H about the neighbor-
uated along plausibility criteria (Lei et al., 2016;                hood. In Figure 1, H is the question “is the model
Strout et al., 2019; Thorne et al., 2019), but as we                comparing the target properties?” (documentary in
see from this example, faithful explanations (Subra-                this case). Based on the model’s behavior on the
manian et al., 2020; Jacovi and Goldberg, 2020b,a)                  set D, we can derive a high-level behavioral label z
are better aligned with our goal of simulatability.                 corresponding to the truth of H. We form our local
We argue that a good explanation is one that aligns                 neighborhood to check the answer empirically and
with the model’s high-level behaviors, and from                     compute a ground truth for z. Since the model al-
which we can understand how the model general-                      ways predicts “yes” in this neighborhood, we label
izes to new data.                                                   set D with z = 0 (the model is not comparing the
                                                                    properties). We label D as z = 1, when the model
Discussion: Realistic Counterfactuals Many                          does predict “no” for some perturbations.
counterfactual modifications are possible: past                     Procedure Our approach is as follows:
work has looked at injecting non-meaningful trig-                      1. Formulate a hypothesis H about the model
gers (Wallace et al., 2019), deleting chunks of con-                   2. Collect realistic counterfactuals D to answer
tent (Ribeiro et al., 2016), or evaluating interpo-                 it empirically for some base examples
lated input points as in I NT G RAD, all of which                      3. Use the explanation of each base example to
violate assumptions about the input distribution. In                predict z. That is, learn the mapping D0 → z based
RC, masking out a fact in the question often turns                  on the explanation of D0 so we can simulate the
the question into a nonsense one.3 Focusing on                      model on D without observing the perturbations.
realistic counterfactuals, by contrast, illuminates                    Note that this third step only uses the explana-
fundamental problems with our RC models’ rea-                       tion of the base data point: explanations should
soning capabilities (Jia and Liang, 2017; Chen and                  let us make conclusions about new counterfactuals
Durrett, 2019; Min et al., 2019; Jiang and Bansal,                  without having to do inference on them.
2019). This is the same motivation as that behind
contrast sets (Gardner et al., 2020), but our work                  Simulation In our experiments on H OTPOT QA
                                                                    and S QUAD, we compute a scalar factor f for each
    2
      One could argue that these counterfactuals are not entirely   explanation representing the importance of a spe-
realistic: a romance film about smoking is fairly unlikely to
occur. However, generating perfect counterfactuals is an ex-        cific part of the inputs (e.g., the “documentary” to-
tremely hard problem (Qin et al., 2019), requiring deep world       kens in Figure 1), which we believe should corre-
knowledge of what scenarios make sense or what properties           late with model predictions on the counterfactuals.
are true of certain entities. Nevertheless, we believe that these
examples are realistic enough that robust models should still       If an explanation assigns higher importance to this
behave well on them.                                                information, it suggests that the model will actually
    3
      The exception is in adversarial settings; however, many       change its behavior on these new examples.
adversarial attacks do not actually draw on real-world threat
models (Athalye et al., 2018), so we consider these less im-           Given this factor, we construct a simple classifier
portant.                                                            where we predict z = 1 if the factor f is above a

threshold. We expect the factors extracted using                  a differentiable fashion, then a a shallow neural
better explanations should better indicate the model              model (a linear layer) is trained to recognize which
behavior. Hence, we evaluate the explanation using                tokens to discard.
the best simulation accuracy it can achieve and
the AUC score (S-ACC and S-AUC).4                                 4.2   Feature Interaction-Based
   Our evaluation resembles the human evaluation                  These techniques all return scores si for each pair
in Hase and Bansal (2020), which asks human                       of tokens (i, j) in both the question and context
raters to predict model’s decision given an example               that are fed into the QA system.
together with its explanations and also reports simu-                Archipelago (Tsang et al., 2020) measures non-
latability. Our method differs in that (1) we predict             additive feature interaction. Similar to D IFF M ASK,
the behavior on unseen counterfactuals given the                  A RCHIP is also implicitly based on unrealistic coun-
explanation of a single base data point, and (2)                  terfactuals which remove tokens. Given a subset of
we automatically extract a factor to predict model                tokens, A RCHIP defines the contribution of the in-
behavior instead of asking humans to do so.                       teraction by the the prediction obtained from mask-
                                                                  ing out all the other tokens, only leaving a very
4     Explanation Techniques                                      small fraction of the input. Applying this definition
Compared to classification tasks like sentiment                   to a complex task like QA can result in a completely
analysis, QA is much more fundamentally about                     nonsensical input.
interaction between input features, especially be-                   Attention Attribution (ATATTR) (Hao et al.,
tween a question and a context. This work will                    2020) uses attention specifically to derive pairwise
directly compare feature interaction explanations                 explanations. However, it avoids the pitfalls of
with token attribution techniques that are more com-              directly inspecting attention (Serrano and Smith,
mon for other tasks.5                                             2019; Wiegreffe and Pinter, 2019) by running an
                                                                  integrated gradients procedure over all the atten-
4.1    Token Attribution-Based                                    tion links within transformers, yielding attribution
These techniques all return scores si for each token              scores for each link. The attribution scores directly
i in both the question and context that are fed into              reflect the attribution of the particular attention
the QA system.                                                    links, making this model able to describe pairwise
   Integrated Gradient (I NT G RAD) (Sundarara-                   interactions.
jan et al., 2017) computes an attribution for each                   Concretely, define the h-head attention matrix
token by integrating the gradients of the predic-                 over input D with n tokens as A = [A1 , ..., Al ],
tion with respect to the token embeddings over the                where Ai ∈ Rh×n×n is the attention scores for
path from a baseline input (typically mask or pad                 each layer. We can obtain the attribution score for
tokens) towards the designated input. Although a                  each entry in the attention matrix A as :
common technique, recent work has raised concern                                           Z   1
                                                                                                     ∂F (D, αA)
about the effectiveness of I NT G RAD methods for                         ATTR(A) = A                           dα,   (1)
                                                                                               α=0       ∂A
NLP tasks, as interpolated word embeddings do not
correspond to real input values (Harbecke, 2021).                 where F (D, αA) is the transformer model that
   Differentiable Mask (D IFF M ASK) (De Cao                      takes as input the tokens and a matrix specifying
et al., 2020) learns to mask out a subsets of the                 the attention scores for each layer. We later sum
input tokens for a given example while maintaining                up the attention attributions across all heads and
a distribution over answers as close to the original              layers to obtain the pairwise
                                                                                           P Pinteraction between
distribution as possible. This mask is learned in                 token (i, j), i.e., sij = m n ATTR(A)mnij .
    4
      We do not collect large enough datasets to train a simu-    4.3   Layer-wise Attention Attribution
lation model, but given larger collections of counterfactuals,
this is another approach one could take.                          We propose a new technique LATATTR to improve
    5
      A potentially even more meaningful format would be a
program actually approximating the model’s behavior in a          upon ATATTR. The ATATTR approach simulta-
detailed way, as has been explored in the context of reinforce-   neously increases all attention scores when com-
ment learning (Verma et al., 2018; Bastani et al., 2018). Prior   puting the attribution, which could be problematic.
work does not really show how to effectively build this type
of explanation for QA at this time, although some techniques      Since the attention scores of higher layers are de-
like anchors (Ribeiro et al., 2018) have been explored before.    termined by the attention scores of lower layers,

Transformer Layers                              Attention Masks                   binary answer space (Clark et al., 2019). Typically,
 ln                                                        Ãn         Ãn        IG(0,An)         a yes-no comparison type question requires com-
                            …                              …            …            …             paring the properties of two entities (Figure 1). We
 l2                                                        Ã2       IG(0,A2)       Ã2            base our experiments on a RO BERTA (Liu et al.,
 l1                                                      IG(0,A1)
                                                                                                   2019) QA model achieving 77.2 F1 scores on the
                                                                       Ã1          Ã1

                                                                      Step 2
                                                                                                   development set in the distractor setting, compara-
      Are Super … documentaries?   …2007 documentary …    Step 1                … Step n
                                                                                                   ble to other strong RO BERTA-based models (Tu
Figure 2: Steps of our Layer-wise Attention Attribu-                                               et al., 2020; Groeneveld et al., 2020).
tion approach, where we only intervene a single lager
at step. For instance, to compute the attribution of at-                                           Hypothesis & Counterfactuals The hypothesis
tention masks at layer 2, we only intervene the attention                                          H we investigate is as in Section 2: the model com-
mask A2 , and leave other attention mask computed as                                               pares the entities’ properties as indicated by the
usual (marked with tilde).                                                                         question. For instance, for the question Are A
forcibly setting all the attention scores and comput-                                              and B of the same nationality, the properties are
ing gradients at the same time may distort the gra-                                                nationalities of “A” and “B”; for the question Are
dients for the attribution of lower level links, hence                                             A and B both ice plants, the properties are their
producing inaccurate attribution. When applying                                                    plant species. As in the motivating example, we
I NT G RAD approach in other contexts, we typically                                                later construct the counterfactuals by replacing the
assume the independence of input features (e.g.,                                                   properties in the context with the one another if
pixels of an image and tokens of an utterance), an                                                 the two properties are different or similar hand-
assumption which does not hold here.                                                               selected properties (e.g., “documentary” → “ro-
   To address this issue, we propose a simple fix,                                                 mance”, “American” → “English”) if the two are
namely applying the I NT G RAD method layer-by-                                                    the same, producing additional three perturbations
layer. As in Figure 2, to compute the attribution                                                  D1 , D2 , D3 for each base example D0 . We set
for attention links of layer i, we only change the                                                 z = 0 (the hypothesis does not hold) if for each
attention scores at layer i:                                                                       perturbed example Di ∈ D, the model predicts
                                                1
                                                                                                   the same answer as for the original example, in-
                                                      ∂F/i (D, αAi )
                                            Z
          ATTR(Ai ) = Ai                                             dα.                     (2)   dicating a failure to compare the properties. We
                                                α=0        ∂Ai
                                                                                                   set z = 1 if the model’s prediction does change.
F/i (D, αAi ) denotes we only intervene the atten-                                                 The authors annotate perturbations for 50 (D, z)
tion masks at layer i while leaving other attention                                                randomly selected pairs in total, forming a total of
masks computed naturally via the model. We pool                                                    200 counterfactual instances. More details of the
to obtain the final attribution for pairwise interac-                                              annotation process and concrete examples can be
              P P
tion as sij = m n ATTR(A)mnij .                                                                    found in the Appendix.
   This technique does not necessarily satisfy the
                                                                                                   Connecting Explanation and Hypothesis To
Completeness axiom commonly used in this line
                                                                                                   make a judgment about z, we extract a factor f
of work (Sundararajan et al., 2017). Since our ulti-
                                                                                                   based on the importance of a set of property tokens
mate goal is a downstream empirical evaluation, we
                                                                                                   P . For token attribution-based methods, we define
set aside any theoretical analysis of this technique
                                                                                                   f asPthe sum of the attribution si of each token in
for now.
                                                                                                   P : i∈P si . For feature interaction-based meth-
5     Experiments: Real QA Datasets                                                                ods producing pairwise attribution sij , we compute
                                                                                                   f by pooling the scores of allPthe interaction related
We evaluate our attribution methods (Section 4)                                                    to the property tokens, i.e., i∈P ∨j∈P sij .
follow our stated evaluation protocol (Section 3)                                                     Now we predict z = 1 if the factor f is above a
on the H OTPOT QA dataset (Yang et al., 2018),                                                     threshold, and evaluate the capability of the factor
and the S QUAD dataset (Rajpurkar et al., 2016),                                                   in indicating the model high-level behavior using
specifically leveraging examples from adversarial                                                  the best simulation accuracy it can achieve (S-ACC)
S QUAD (Jia and Liang, 2017).                                                                      and AUC score (S-AUC).
5.1      Hotpot Yes-No Questions                                                                   Results First, we show that using explanations
We first study a subset of comparison yes/no ques-                                                 can indeed illustrate the model’s behavior. As in
tions, which is a challenging format despite the                                                   Table 1, our approach (LATATTR) is the best in

Yes-No Bridge
Approach

Primary Question
S-ACC S-AUC S-ACC S-AUC
M AJORITY 52.0 − 56.0 −
C ONF 64.0 49.8 66.0 65.9
I NT G RAD 72.0 75.2 72.0 77.9
D IFF M ASK 66.0 60.2 68.0 62.3
A RCHIP 56.0 53.2 62.0 57.5
ATATTR 66.0 63.6 72.0 79.1
LATATTR 84.0 87.9 78.0 81.7

Table 1: Results on H OTPOT QA Yes-No type and
Bridge questions. Our approach can better predict the
model behavior on realistic counterfactuals, surpassing
attribution based methods.

this setting, achieving a simulation accuracy of Figure 3: Explanations generated by our approach for
84%. That is, with a properly set threshold, we an bridge type question from H OTPOT QA. The predic-
can successfully predict whether the model pre- tion can mostly be attributed to the primary question,
dictions change when perturbing the properties in indicating the model is taking the reasoning shortcut,
the original example 88% of the time. The expla- and the prediction can be flipped with an adversarial
sentence.
nations therefore give us the ability to simulate
our model’s behavior better than the other meth- United States secretary of the state from 2009 to
ods here. Our approach also improves substantially 2013”, into the context, the model will be misled
over the vanilla ATATTR method. and predict “United States secretary” as the new
Token attribution based approaches obtain an answer. This sentence could easily have been part
accuracy around 72%. This indicates token at- of another document retrieved in the retrieval stage,
tribution based methods are not effective in the so we consider its inclusion to be a realistic coun-
H OTPOT QA setting which engages with interac- terfactual.
tion between tokens more intensively. We further define the primary question, i.e., the
In this setting, D IFF M ASK performs poorly typ- primary part (containing Wh- words) of the entire
ically because it assigns high attribution to many question. (E.g., “What government position is held
tokens, since it determines which tokens need to by the women” in Figure 3), following the decom-
be kept rather than distinguishing fine-grained im- position principle from (Min et al., 2019).
portance (examples in Appendix). It’s possible that
other heuristics or models learned on large numbers Hypothesis & Counterfactuals The hypothesis
of perturbations could more meaningfully extract H we investigate is: the model is using correct
predictions from this technique. reasoning and not a shortcut driven by the primary
question part.
5.2 Hotpot Bridge Questions We construct counterfactuals following the same
We also evaluate the explanation approaches on so- idea applied in our example. For a given ques-
called bridge questions on the H OTPOT QA dataset, tion, we add an adversarial sentence based on the
described in Yang et al. (2018). Figure 3 shows a primary part (containing Wh- words) of the ques-
example explanation of a bridge problem in Figure . tion so as to alter the model prediction. The added
From the attribution scores we find the most salient adversarial sentence contains context leading to a
connection is between the span “what government spurious answer to only the primary question, but
position” in the the question and the span “United does not change the gold answer (refer to Appendix
States Ambassador” in the context. This attribution for examples). We do this twice, yielding a set
directly highlights the reasoning shortcut (Jia and D = {D0 , D1 , D2 } consisting of the base example
Liang, 2017; Chen and Durrett, 2019; Min et al., and two perturbations. We define the label of D to
2019; Jiang and Bansal, 2019) the model is using, be z = 0 in the case that model’s prediction does
where it disregards the second part of the question. change when being attacked, and z = 1 otherwise.
If we inject an additional sentence “Hillary Clin- We randomly sample 50 base data points from
ton is an American politician, who served as the the development set, two of our authors each write

a adversarial sentence, forming 150 data points in                 Approach       S-ACC   S-AUC
total.                                                              M AJORITY     52.1      −
                                                                    C ONF         58.3     57.8
Connecting Explanation and Hypothesis For                           I NT G RAD    61.6     61.1
this setting, we use a factor describing the im-                    D IFF M ASK   57.6     53.6
portance of the primary question normalized by                      A RCHIP       58.6     56.2
                                                                    ATATTR        68.4     72.5
the importance of the entire question. Namely, let                  LATATTR       70.0     72.1
P = {pi } be the set of tokens in the primary ques-
tions, and Q = {qi } be the set of tokens in the        Table 2: Simulation Accuracy and AUC scores for the
entire question. We define the factor f as the the      SQuAD adversarial setting, assessing whether model
importance of P normalized by the importance of         changes its prediction on an example when attacked.
Q, where the importance calculation is the same
                                                        confidence on the original prediction as a baseline.
as in Section 5.1. A higher factor means it is more
heavily relying only on the primary question and        Results We show results in Table 2. The best
hence a better chance of being attacked.                approaches (ATATTR and LATATTR) can achieve
                                                        a simulation accuracy around 70%, 10% above the
Results According to the simulation AUC scores
                                                        performance based on confidence. This shows the
in Table 1, feature interaction based techniques
                                                        model is indeed over-confident in its prediction; our
again outperform token attribution approaches. Our
                                                        assumption about the robustness together with our
approach achieves a stimulation accuracy of 78%,
                                                        technique can successfully expose the vulnerability
substantially higher than any other results.
                                                        in some of the model predictions.
5.3   SQuAD Adversarial                                    There is room to improve on these results; our
                                                        simple heuristic cannot perfectly connect the ex-
Hypothesis & Counterfactuals Our hypothesis
                                                        planations to the model behavior in all cases. We
H is: the model can resist adversarial attacks of the
                                                        note that there are other orthogonal approaches (Ka-
addSent variety (Jia and Liang, 2017). For each of
                                                        math et al., 2020) to calibrate the confidence of QA
the original examples D0 from some of the S QUAD -
                                                        models’ predictions by looking at statistics of the
A DV development set, Jia and Liang (2017) creates
                                                        adversarial examples; here, our judgment is made
5 adversarial attacks, which are paraphrased and
                                                        purely based on the original example, and does not
filtered by Turkers to give 0 to 5 valid attacks for
                                                        exploit learning to refine our heuristic.
each example, yielding our set D. We define the
label of D to be z = 1 if the model resists all the     5.4   Discussion and Limitations
adversarial attacks posed on D0 (i.e., predictions
for D are the same). To ensure the behavior is more     Our explanations can reveal known dataset biases
precisely profiled by the counterfactuals, we only      and reasoning shortcuts HotpotQA, without hav-
keep the base examples with more than 3 valid           ing to perform a detailed manual analysis. This
attack, resulting in a total number of 276 (D, z)       confirms the utility of our explanations: model
pair (1,506 data points).                               designers can look at them, either manually or au-
                                                        tomatically, and determine how robust the model is
Connecting Explanation and Hypothesis We                going to be when faced with counterfactuals.
use a factor p indicating the importance of the es-        Our analysis also highlights limitations of cur-
sential keywords extracted from the question using      rent explanation techniques, and shed light on the
POS tags (proper nouns and numbers). E.g., for the      future research direction on this topics. In our ex-
question “What Florida stadium was considered           periments, we observe other nontrivial behaviours
for Super Bowl 50”, we extract “Florida”, “Super        of the QA model in the Hotpot setting. For in-
Bowl” , and “50”. If the model considers all the        stance, we created counterfactuals by permutating
essential keywords mentioned in the question, it        the order of the paragraphs constructing the con-
should not be fooled by distractors with irrelevant     text, which often gave rise to different predictions.
information. We show a set of illustrative examples     This observation indicates the model prediction
in Appendixes. We compute the importance scores         may also be impacted by biases in positional em-
in the same way described in Section 5.1.               beddings (e.g., the answer tends to occur in the
   In addition to the scores provided by various        first retrieved paragraph), which cannot be indi-
explanation techniques, we also use the model’s         cated by current attribution methods. We believe

E0 R0, E1 R1, E2 R0     ？ E0 E2    ！           In general, feature interaction based approaches
                                                           performed better at recovering the ground truth ex-
      E0 R0, E1 R1, E2 R0     ？ E1 E2    ！        planations than token attribution based approaches.
                                                           The best method for this settings are A RCHIP,
Figure 4: Two examples of our synthetic data with
                                                           achieving a F1 score of 0.69. Our LATATTR ap-
ground truth rationales being underlined. In the first
example, the context describes entity E0/E1/E2 as as-      proach is also effective in this setting and performs
sociated with relation R0/R1/R0, respectively; the first   on par with A RCHIP.
question asks whether E0 and E2 exhibit the same rela-        Surprisingly, the simple synthetic case here turns
tion; the answer is yes. Only these tokens are provided    out not to be so simple. This might be due to the
to the model.                                              complexity of our task. Despite being a synthetic
                                                           task, it requires true multi-hop reasoning, a chal-
             Int-    Diff-   Archi-   At-     LAt-
     Rand    Grad    Mask    pelago   Attn    Attn         lenging task which modern models are still strug-
     0.45    0.55    0.64    0.69     0.55    0.67         gling in learning (Jiang and Bansal, 2019; Khashabi
                                                           et al., 2019; Trivedi et al., 2020). This dataset ex-
Table 3: The F1 scores between the models’ top-6 high-     poses the need for better explanation techniques
lighted tokens and ground truth rationales. Our ap-        for this sort of reasoning and how it emerges.
proach is substantially better than ATATTR.
                                                           7   Related Work
this is a useful avenue for future investigation. By
first thinking of what kind of counterfactuals and         We focus on several prominent token attribution
what kind of behaviours we want to explain, we             techniques, but there are other related methods
can motivate new explanation techniques.                   as well, including Shapley Values (Štrumbelj and
                                                           Kononenko, 2014; Lundberg and Lee, 2017), con-
6   Synthetic Dataset                                      textual decomposition (Jin et al., 2020), and hier-
                                                           archical explanations Chen et al. (2020). These
We have evaluated our explanations’ faithfulness
                                                           formats can also be evaluated using our frame-
and to what extent they help simulate model be-
                                                           work if being connected with model behavior with
havior. We now use a synthetic setting to eval-
                                                           proper heuristic. Other work explores “concept-
uate plausibility, i.e., whether these explanations
                                                           based” explanations (Mu and Andreas, 2020; Bau
can successfully attribute the models to the ratio-
                                                           et al., 2017; Yeh et al., 2019). These provide an-
nales that humans would perceive.It is impossible
                                                           other pathway towards building explanations of
to know what a QA model is doing on real data;
                                                           high-level behavior; however, they have been ex-
therefore, we create a synthetic dataset and ensure
                                                           plored primarily for image recognition tasks and
via symmetry that there are no reasoning shortcuts,
                                                           cannot be directly applied to QA, where defining
so a model generalizing on this dataset must be
                                                           these sorts of “concepts” is challenging.
doing some form of correct reasoning.
                                                              Probing techniques aim to discover what inter-
   We show a concrete example of this data in Fig-
                                                           mediate representations have been learned in neural
ure 4, with details of the dataset construction and
                                                           models (Tenney et al., 2019; Conneau et al., 2018;
model in the Appendix. Clues external to the rele-
                                                           Hewitt and Liang, 2019; Voita and Titov, 2020).
vant parts of the context cannot provide any infor-
                                                           Internal representations could potentially be used
mation relevant to the question, and given that our
                                                           to predict behavior on contrast sets similar to this
model generalizes perfectly, a plausibility evalua-
                                                           work; however, this cannot be done heuristically
tion is justified in this case.
                                                           and larger datasets are needed to explore this.
   We do not need to construct counterfactuals for
our evaluation on this dataset.                               Other work considering how to evaluate explana-
                                                           tions are primarily based on how explanations can
Results We assess whether an explanation aligns            assists humans in predicting model decision for a
well with model behavior using the F1 scores be-           given example (Doshi-Velez and Kim, 2017; Chan-
tween ground truth rationales (6 tokens excluding          drasekaran et al., 2018; Nguyen, 2018; Hase and
 and ) and the top-6 important tokens picked         Bansal, 2020); We are the first to consider building
by the explanation. The ground truth rationale in          contrast sets for this. Similar ideas have been used
Figure 4 is underlined; the model should consider          in other contexts (Kaushik et al., 2020; Gardner
these tokens to determine the answer.                      et al., 2020) but we’re focused on evaluation of

explanations rather than general model evaluation. Jifan Chen and Greg Durrett. 2019. Understand-
ing dataset design choices for multi-hop reason-
8 Conclusion ing. In Proceedings of the Conference of the North
American Chapter of the Association for Computa-
We have presented an evaluation technique based tional Linguistics: Human Language Technologies
on realistic counterfactuals to evaluate explana- (NAACL-HLT).
tions for RC models. We show that our evaluation
Christopher Clark, Kenton Lee, Ming-Wei Chang,
method distinguishes which explanations truly give Tom Kwiatkowski, Michael Collins, and Kristina
us insight about high-level model behavior. Feature Toutanova. 2019. BoolQ: Exploring the surprising
interaction-based techniques perform the best in difficulty of natural yes/no questions. In Proceed-
our analysis, especially our LATATTR method. We ings of the 2019 Conference of the North American
Chapter of the Association for Computational Lin-
advocate that future research conduct such quanti- guistics: Human Language Technologies, Volume 1
tative evaluation based on realistic counterfactuals (Long and Short Papers).
when developing novel explanation techniques.
Alexis Conneau, German Kruszewski, Guillaume Lam-
Acknowledgments ple, Loïc Barrault, and Marco Baroni. 2018. What
you can cram into a single $&!#* vector: Probing
Thanks to Eunsol Choi, Jifan Chen, Jiacheng Xu, sentence embeddings for linguistic properties. In
Qiaochu Chen, and everyone in the UT TAUR lab Proceedings of the 56th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
for helpful discussions. This work was partially Long Papers).
supported by NSF Grant IIS-1814522, NSF Grant
SHF-1762299, a gift from Arm, a gift from Sales- Nicola De Cao, Michael Sejr Schlichtkrull, Wilker
force Inc, and an equipment grant from NVIDIA. Aziz, and Ivan Titov. 2020. How do decisions
emerge across layers in neural models? interpreta-
tion with differentiable masking. In Proceedings of
the 2020 Conference on Empirical Methods in Natu-
References ral Language Processing (EMNLP). Association for
Anish Athalye, Nicholas Carlini, and David Wagner. Computational Linguistics.
2018. Obfuscated gradients give a false sense of se-
curity: Circumventing defenses to adversarial exam- Finale Doshi-Velez and Been Kim. 2017. Towards a
ples. In Proceedings of the 35th International Con- rigorous science of interpretable machine learning.
ference on Machine Learning. arXiv preprint arXiv:1702.08608.

Osbert Bastani, Yewen Pu, and Armando Solar- Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan
Lezama. 2018. Verifiable reinforcement learning via Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi,
policy extraction. arXiv preprint arXiv:1805.08328. Dheeru Dua, Yanai Elazar, Ananth Gottumukkala,
Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco,
David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel-
and Antonio Torralba. 2017. Network dissection: son F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer
Quantifying interpretability of deep visual represen- Singh, Noah A. Smith, Sanjay Subramanian, Reut
tations. In Proceedings of the IEEE conference Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou.
on computer vision and pattern recognition, pages 2020. Evaluating models’ local decision boundaries
6541–6549. via contrast sets. In Findings of the Association for
Computational Linguistics: EMNLP 2020.
Oana-Maria Camburu, Tim Rocktäschel, Thomas
Lukasiewicz, and Phil Blunsom. 2018. e-SNLI: Nat-
Dirk Groeneveld, Tushar Khot, Mausam, and Ashish
ural Language Inference with Natural Language Ex-
Sabharwal. 2020. A simple yet strong pipeline for
planations. In Advances in Neural Information Pro-
HotpotQA. In Proceedings of the 2020 Conference
cessing Systems.
on Empirical Methods in Natural Language Process-
Arjun Chandrasekaran, Viraj Prabhu, Deshraj Yadav, ing (EMNLP).
Prithvijit Chattopadhyay, and Devi Parikh. 2018. Do
explanations make VQA models more predictable to Chaoyu Guan, Xiting Wang, Quanshi Zhang, Runjin
a human? In Proceedings of the 2018 Conference Chen, Di He, and Xing Xie. 2019. Towards a deep
on Empirical Methods in Natural Language Process- and unified understanding of deep neural models in
ing. NLP. In Proceedings of the 36th International Con-
ference on Machine Learning.
Hanjie Chen, Guangtao Zheng, and Yangfeng Ji. 2020.
Generating hierarchical explanations on text classifi- Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2020.
cation via feature interaction detection. In Proceed- Self-attention attribution: Interpreting information
ings of the 58th Annual Meeting of the Association interactions inside transformer. arXiv preprint
for Computational Linguistics. arXiv:2004.11207.

David Harbecke. 2021. Explaining natural language Zachary C Lipton. 2018. The mythos of model inter-
processing classifiers with occlusion and language pretability: In machine learning, the concept of in-
modeling. arXiv preprint arXiv:2101.11889. terpretability is both important and slippery. Queue,
16(3):31–57.
Peter Hase and Mohit Bansal. 2020. Evaluating ex-
plainable AI: Which algorithmic explanations help Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
users predict model behavior? In Proceedings of the Joshi, Danqi Chen, Omer Levy, M. Lewis, Luke
58th Annual Meeting of the Association for Compu- Zettlemoyer, and Veselin Stoyanov. 2019. Roberta:
tational Linguistics. A robustly optimized bert pretraining approach.
ArXiv, abs/1907.11692.
Lisa Anne Hendricks, Zeynep Akata, Marcus
Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor
Darrell. 2016. Generating Visual Explanations. In Scott Lundberg and Su-In Lee. 2017. A unified ap-
European Conference on Computer Vision (ECCV). proach to interpreting model predictions. arXiv
preprint arXiv:1705.07874.
J. Hewitt and P. Liang. 2019. Designing and interpret-
ing probes with control tasks. In Empirical Methods Tim Miller. 2019. Explanation in artificial intelligence:
in Natural Language Processing (EMNLP). Insights from the social sciences. Artificial intelli-
gence, 267:1–38.
Alon Jacovi and Y. Goldberg. 2020a. Aligning faithful
interpretations with their social attribution. ArXiv,
Sewon Min, Eric Wallace, Sameer Singh, Matt Gard-
abs/2006.01067.
ner, Hannaneh Hajishirzi, and Luke Zettlemoyer.
Alon Jacovi and Yoav Goldberg. 2020b. Towards faith- 2019. Compositional questions do not necessitate
fully interpretable NLP systems: How should we de- multi-hop reasoning. In Proceedings of the 57th An-
fine and evaluate faithfulness? In Proceedings of the nual Meeting of the Association for Computational
58th Annual Meeting of the Association for Compu- Linguistics.
tational Linguistics.
Jesse Mu and Jacob Andreas. 2020. Composi-
Robin Jia and Percy Liang. 2017. Adversarial exam- tional explanations of neurons. arXiv preprint
ples for evaluating reading comprehension systems. arXiv:2006.14032.
In acl.
Dong Nguyen. 2018. Comparing automatic and human
Yichen Jiang and Mohit Bansal. 2019. Avoiding rea- evaluation of local explanations for text classifica-
soning shortcuts: Adversarial evaluation, training, tion. In Proceedings of the 2018 Conference of the
and model development for multi-hop QA. In Pro- North American Chapter of the Association for Com-
ceedings of the 57th Annual Meeting of the Associa- putational Linguistics: Human Language Technolo-
tion for Computational Linguistics. gies, Volume 1 (Long Papers).
Xisen Jin, Zhongyu Wei, Junyi Du, Xiangyang Xue,
and Xiang Ren. 2020. Towards hierarchical impor- Lianhui Qin, Antoine Bosselut, Ari Holtzman, Chandra
tance attribution: Explaining compositional seman- Bhagavatula, Elizabeth Clark, and Yejin Choi. 2019.
tics for neural sequence models. In International Counterfactual story reasoning and generation. In
Conference on Learning Representations. Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
Amita Kamath, Robin Jia, and Percy Liang. 2020. Se- 9th International Joint Conference on Natural Lan-
lective question answering under domain shift. In guage Processing (EMNLP-IJCNLP).
Proceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. SQuAD: 100,000+ questions for
Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. machine comprehension of text. In Proceedings of
2020. Learning the difference that makes a differ- the 2016 Conference on Empirical Methods in Natu-
ence with counterfactually-augmented data. In Inter- ral Language Processing, pages 2383–2392, Austin,
national Conference on Learning Representations. Texas. Association for Computational Linguistics.
Daniel Khashabi, Erfan Sadeqi Azer, Tushar Khot,
Ashish Sabharwal, and Dan Roth. 2019. On the Marco Tulio Ribeiro, Sameer Singh, and Carlos
possibilities and limitations of multi-hop reason- Guestrin. 2016. “Why should I trust you?” Explain-
ing under linguistic imperfections. arXiv preprint ing the predictions of any classifier. In Proceedings
arXiv:1901.02522. of the 22nd ACM SIGKDD international conference
on knowledge discovery and data mining.
Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016.
Rationalizing neural predictions. In Proceedings of Marco Tulio Ribeiro, Sameer Singh, and Carlos
the 2016 Conference on Empirical Methods in Nat- Guestrin. 2018. Anchors: High-precision model-
ural Language Processing, pages 107–117, Austin, agnostic explanations. In AAAI, volume 18, pages
Texas. Association for Computational Linguistics. 1527–1535.

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Michael Tsang, Sirisha Rambhatla, and Yan Liu. 2020.
and Sameer Singh. 2020. Beyond accuracy: Behav- How does this interaction affect me? interpretable at-
ioral testing of NLP models with CheckList. In Pro- tribution for feature interactions. In Proceedings of
ceedings of the 58th Annual Meeting of the Associa- the Conference on Advances in Neural Information
tion for Computational Linguistics. Processing Systems (NeurIPS).
Cynthia Rudin. 2019. Stop explaining black box ma- Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang,
chine learning models for high stakes decisions and Xiaodong He, and Bowen Zhou. 2020. Select, an-
use interpretable models instead. Nature Machine swer and explain: Interpretable multi-hop reading
Intelligence, 1. comprehension over multiple documents. In Pro-
ceedings of the Association for the Advancement of
Sofia Serrano and Noah A. Smith. 2019. Is attention
Artificial Intelligence (AAAI).
interpretable? In Proceedings of the 57th Annual
Meeting of the Association for Computational Lin- Abhinav Verma, Vijayaraghavan Murali, Rishabh
guistics. Singh, Pushmeet Kohli, and Swarat Chaudhuri.
Karen Simonyan, Andrea Vedaldi, and Andrew Zisser- 2018. Programmatically interpretable reinforce-
man. 2013. Deep inside convolutional networks: Vi- ment learning. In Proceedings of the 35th Interna-
sualising image classification models and saliency tional Conference on Machine Learning.
maps. arXiv preprint arXiv:1312.6034. Elena Voita and Ivan Titov. 2020. Information-
Julia Strout, Ye Zhang, and Raymond Mooney. 2019. theoretic probing with minimum description length.
Do Human Rationales Improve Machine Explana- In Proceedings of the 2020 Conference on Empirical
tions? In Proceedings of the 2019 ACL Workshop Methods in Natural Language Processing (EMNLP).
BlackboxNLP: Analyzing and Interpreting Neural
Networks for NLP, pages 56–62, Florence, Italy. As- Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner,
sociation for Computational Linguistics. and Sameer Singh. 2019. Universal adversarial trig-
gers for attacking and analyzing NLP. In Proceed-
Erik Štrumbelj and Igor Kononenko. 2014. Explaining ings of the 2019 Conference on Empirical Methods
prediction models and individual predictions with in Natural Language Processing and the 9th Inter-
feature contributions. Knowledge and information national Joint Conference on Natural Language Pro-
systems, 41(3):647–665. cessing (EMNLP-IJCNLP).

Sanjay Subramanian, Ben Bogin, Nitish Gupta, Tomer Sarah Wiegreffe, Ana Marasović, and Noah A. Smith.
Wolfson, Sameer Singh, Jonathan Berant, and Matt 2020. Measuring association between labels and
Gardner. 2020. Obtaining faithful interpretations free-text rationales. ArXiv, abs/2010.12762.
from compositional neural networks. arXiv preprint
arXiv:2005.00724. Sarah Wiegreffe and Yuval Pinter. 2019. Attention is
not not explanation. In Proceedings of the 2019 Con-
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. ference on Empirical Methods in Natural Language
Axiomatic attribution for deep networks. arXiv Processing and the 9th International Joint Confer-
preprint arXiv:1703.01365. ence on Natural Language Processing (EMNLP-
IJCNLP).
Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,
Adam Poliak, R Thomas McCoy, Najoung Kim, Jialin Wu and Raymond Mooney. 2018. Faithful multi-
Benjamin Van Durme, Sam Bowman, Dipanjan Das, modal explanation for visual question answering.
and Ellie Pavlick. 2019. What do you learn from
context? probing for sentence structure in contextu- Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-
alized word representations. In International Con- gio, William W. Cohen, Ruslan Salakhutdinov, and
ference on Learning Representations. Christopher D. Manning. 2018. HotpotQA: A
dataset for diverse, explainable multi-hop question
James Thorne, Andreas Vlachos, Christos answering. In Conference on Empirical Methods in
Christodoulopoulos, and Arpit Mittal. 2019. Natural Language Processing (EMNLP).
Generating token-level explanations for natural
language inference. In Proceedings of the 2019 Chih-Kuan Yeh, Been Kim, Sercan Ö. Arik, C. Li,
Conference of the North American Chapter of the P. Ravikumar, and T. Pfister. 2019. On concept-
Association for Computational Linguistics: Human based explanations in deep neural networks. ArXiv,
Language Technologies, Volume 1 (Long and Short abs/1910.07969.
Papers), pages 963–969, Minneapolis, Minnesota.
Association for Computational Linguistics.
Harsh Trivedi, Niranjan Balasubramanian, Tushar
Khot, and Ashish Sabharwal. 2020. Is multihop QA
in DiRe condition? measuring and reducing discon-
nected reasoning. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Language
Processing (EMNLP).

A Details of Hotpot Yes-No part. The primary part is the main body of the ques-
Counterfactuals tion, whereas the secondary part is usually a clause
used to link the bridge entity (Min et al., 2019).
Figure 5 shows several examples to illustrate our
We construct our neighborhoods as follows:
process of generating counterfactuals for the Hot-
pot yes-no setting. 1. Manually decompose our sentences into the
Most Hotpot Yes-No questions follow one of primary and secondary parts
two templates: Are A and B both __? (Figure 5,
abc), and Are A and B of the same __? (Figure 5, 2. Make up adversarial sentences to confuse the
def). We define the property tokens associated model. For a given base example, an adver-
with each question as the tokens in the context that sarial sentence typically provides a spurious
match the blank in the template; that is, the values answer to the primary question, but does not
of the property that A and B are being compared change the gold answer.
on. For example, in Figure 5a, French and German
Two of the authors each wrote a single adversar-
are the property tokens, as the property of interest
ial sentence for 50 of the Hotpot Bridge examples,
is the national origin.
yielding 150 counterfactual instances in total. The
To construct a neighborhood for a base data
adversarial sentences manage to alter 56% of the
point, we take the following steps:
predictions on the base examples.
1. Manually extract the property tokens in the
C Details of Synthetic Dataset
context
Our dataset is generated using templates, with 20
2. Replace the property token with two substi- entities (E0 through E19) and 20 relations (R0
tutes, forming a set of four counterfactuals through R19). We place 3 or 4 entities in the con-
exhibiting nonidentical ground truths text. We randomly inject tokens between
When the properties associated with the two en- entity relation pairs (we do not inject
tities differ from each other, we directly use the within any entity relation pair) to prevent model
properties extracted as the substitutes (Figure 5, learning spurious correlation with positional em-
abf); otherwise we add a new property candidate beddings.
that is of the same class (Figure 5, cde). We create a training/validation set of
We annotated randomly sampled examples from 200,000/10,000 examples, respectively, and
the Hotpot Yes-No questions. We skipped several train a 2-layer 12-head transformer model for this
examples that compared abstract concepts with no task, achieving 100% accuracy on the training set
explicit property tokens. For instance, we skipped and over 98% accuracy on validation set.
the question Are both Yangzhou and Jiangyan Dis-
trict considered coastal cities? whose given con-
text do not explicitly mention whether the cities
are coastal cities. We looked through 61 examples
in total and obtained annotations for 50 examples,
so such discarded examples constitute a relatively
small fraction of the dataset. Overall, this resulted
200 counterfactual instances. We found the predic-
tion of a ROBERTA QA model on 52% of the base
data points change when being perturbed.

B Details of Hotpot Bridge
Counterfactuals
Figure 6 shows several examples of our annotations
for generating counterafactuals for Hotpt Bridge
examples. Specifically, we view bridge questions
as cnsisting of two single hop questions, the pri-
mary part (marked in Figure 6) and secondary

Question Were Ulrich Walter and Léopold Eyharts both from Germany?
Context Léopold Eyharts (born April 28, 1957) is a Brigadier General in the French Air Force, an engineer and
(a) ESA astronaut.
Prof. Dr. Ulrich Hans Walter (born February 9, 1954) is a German physicist/engineer and a former
DFVLR astronaut.
Substitutes French, German
Question Are both Aloinopsis and Eriogonum ice plants?
Context Aloinopsis is a genus of ice plants from South Africa.
(b) Eriogonum is the scientific name for a genus of flowering plants in the family Polygonaceae. The
genus is found in North America and is known as wild buckwheat.
Substitutes ice, flowering
Question Were Frank R. Strayer and Krzysztof Kieślowski both Directors?
Context Frank R. Strayer (September 21, 1891 - 2013 February 3, 1964) was an actor, film writer, and director .
(c) He was active from the mid-1920s until the early 1950s.
Krzysztof Kieślowski (27 June 1941 - 13 March 1996) was a Polish art-house film director and
screenwriter.
Substitutes director, producer

Question Were Scott Derrickson and Ed Wood of the same nationality?
Context Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer.
(d) Edward Davis Wood Jr. (October 10, 1924 - 2013 December 10, 1978) was an American filmmaker,
actor, writer, producer, and director.
Substitutes American, English
Question Are the movies "Monsters, Inc." and "Mary Poppins" both by the same company?
Context Mary Poppins is a 1964 American musical-fantasy film directed by Robert Stevenson and produced by
(e) Walt Disney , with songs written and composed by the Sherman Brothers.
Monsters, Inc. is a 2001 American computer-animated comedy film produced by Pixar Animation
Studios and distributed by Walt Disney Pictures.
Substitutes Walt Disney, Universal
Question Are Steve Perry and Dennis Lyxzén both members of the same band?
Context Stephen Ray Perry (born January 22, 1949) is an American singer, songwriter and record producer. He
(f) is best known as the lead singer of the rock band Journey .
Dennis Lyxzén (born June 19, 1972) is a musician best known as the lead vocalist for Swedish hardcore
punk band Refused .
Substitutes Journey, Refused

Figure 5: Examples (contexts are truncated for brevity) of our annotations on Hotpot Yes-No base data points. We
find the property tokens in the context, and build realist counterfactuals by replacing them with substitutes that are
properties extracted in the base data point or similar properties hand-selected by us.

Question What is the name of the fight song of the university whose main campus is in Lawrence, Kansas and
whose branch campuses are in the Kansas City metropolitan area?
(a) Context Kansas Song (We’re From Kansas) is a fight song of the University of Kansas.
The University of Kansas, often referred to as KU or Kansas, is a public research university in the U.S.
state of Kansas. The main campus in Lawrence, one of the largest college towns in Kansas, is on Mount
Oread, the highest elevation in Lawrence. Two branch campuses are in the Kansas City metropolitan
area.
Adv Sent 1 Texas Fight is a fight song of the University of Texas at Austin.
Adv Sent 2 Big C is a fight song of the University of California, Berkeley.
Question What screenwriter with credits for "Evolution" co-wrote a film starring Nicolas Cage and Téa Leoni?
Context David Weissman is a screenwriter and director. His film credits include "The Family Man" (2000),
(b) "Evolution" (2001), and "When in Rome" (2010).
The Family Man is a 2000 American romantic comedy-drama film directed by Brett Ratner, written by
David Diamond and David Weissman, and starring Nicolas Cage and Téa Leoni.
Adv Sent 1 Don Jakoby is an American screenwriter that collabrates with David Weissman in "Evolution".
Adv Sent 2 Damien Chazelle is a screenwriter most notably known for writing La La Land.
Question The arena where the Lewiston Maineiacs played their home games can seat how many people ?
Context The Androscoggin Bank Colisée (formerly Central Maine Civic Center and Lewiston Colisee) is a 4,000
(c) capacity (3,677 seated) multi-purpose arena, in Lewiston, Maine, that opened in 1958.
The Lewiston Maineiacs were a junior ice hockey team of the Quebec Major Junior Hockey League
based in Lewiston, Maine. The team played its home games at the Androscoggin Bank Colisée.
Adv Sent 1 Allianz (known as Fußball Arena München for UEFA competitions) is a arena in Munich, with a 5,000
seating capacity.
Adv Sent 2 The Tacoma Dome is a multi-purpose arena (221,000 capacity, 10,000 seated) in Tacoma, Washington,
United States.
Question Scott Parkin has been a vocal critic of Exxonmobil and another
corporation that has operations in how many countries ?
(d) Context Scott Parkin (born 1969, Garland, Texas is an anti-war, environmental and global justice organizer,
former community college history instructor, and a founding member of the Houston Global Awareness
Collective. He has been a vocal critic of the American invasion of Iraq, and of corporations such as
Exxonmobil and Halliburton.
The Halliburton Company, an American multinational corporation. One of the world’s largest oil field
service companies, it has operations in more than 70 countries.
Adv Sent 1 Visa is a corporation that has operations in more than 200 countries.
Adv Sent 2 The Ford Motor Company is an American multinational corporation with operations in more than 100
countries.
Question In 1991 Euromarché was bought by a chain that operated how any hypermarkets at the end of 2016?
Context Carrefour S.A. is a French multinational retailer headquartered in Boulogne Billancourt, France, in the
(e) Hauts-de-Seine Department near Paris. It is one of the largest hypermarket chains in the world (with
1,462 hypermarkets at the end of 2016).
Euromarché was a French hypermarket chain. In June 1991, the group was rebought by its rival,
Carrefour, for 5,2 billion francs.
Adv Sent 1 Walmart Inc is a multinational retail corporation that operates a chain of hypermarkets that owns 4,700
hypermarkets within the United States at the end of 2016.
Adv Sent 2 Trader Joe’s is an American chain of grocery stores headquartered in Monrovia, California. By the end
of 2016, Trader Joe’s had over 503 stores nationwide in 42 states.
Question What was the father of Kasper Schmeichel voted to be by the IFFHS in 1992?
Context Peter Bolesław Schmeichel MBE (born 18 November 1963) is a Danish former professional footballer
(f) who played as a goalkeeper, and was voted the IFFHS World’s Best Goalkeeper in 1992 and 1993.
Kasper Peter Schmeichel (born 5 November 1986) is a Danish professional footballer. He is the son of
former Manchester United and Danish international goalkeeper Manuel Neuer.
Adv Sent 1 Robert Lewandowski was voted to be the World’s Best Striker in 1992.
Adv Sent 2 Michael Jordan was voted the IFFHS best NBA player in 1992.

Figure 6: Examples (contexts are truncated for brevity) of primary questions and adversarial senteces for creating
Hotpot Bridge counterfactuals.

You can also read