Explaining NLP Models via Minimal Contrastive Editing (MICE)

Page created by Sara Guerrero
 
CONTINUE READING
Explaining NLP Models via Minimal Contrastive Editing (M I CE)

                         Alexis Ross†      Ana Marasović†}          Matthew E. Peters†

                     Allen Institute for Artificial Intelligence, Seattle, WA, USA
                     †
    }
        Paul G. Allen School of Computer Science and Engineering, University of Washington
                        {alexisr,anam,matthewp}@allenai.org

                      Abstract
    Humans have been shown to give contrastive
    explanations, which explain why an observed
    event happened rather than some other coun-
    terfactual event (the contrast case).        De-
    spite the influential role that contrastivity
    plays in how humans explain, this property
    is largely missing from current methods for
    explaining NLP models. We present M IN -
    IMAL C ONTRASTIVE E DITING (M I CE), a
    method for producing contrastive explanations
    of model predictions in the form of edits
    to inputs that change model outputs to the
    contrast case. Our experiments across three
    tasks—binary sentiment classification, topic
    classification, and multiple-choice question             Figure 1: An example M I CE edit for a multiple-choice
    answering—show that M I CE is able to pro-               question from the R ACE dataset. M I CE generates con-
    duce edits that are not only contrastive, but            trastive explanations in the form of edits to inputs that
    also minimal and fluent, consistent with human           change model predictions to target (contrast) predic-
    contrastive edits. We demonstrate how M I CE             tions. The edit (bolded in red) is minimal and fluent,
    edits can be used for two use cases in NLP sys-          and it changes the model’s prediction from “by train” to
    tem development—debugging incorrect model                the contrast prediction “on foot” (highlighted in gray).
    outputs and uncovering dataset artifacts—and
    thereby illustrate that producing contrastive ex-
    planations is a promising research direction for        instead of at Ann’s home on foot; such information
    model interpretability.                                 is captured by the edit (bolded red) that results in
                                                            the new model prediction “on foot.” For a differ-
1   Introduction
                                                            ent contrast prediction, such as “by car,” we would
Cognitive science and philosophy research has               provide a different explanation. In this work, we
shown that human explanations are contrastive               propose to give contrastive explanations of model
(Miller, 2019): People explain why an observed              predictions in the form of targeted minimal edits, as
event happened rather than some counterfactual              shown in Figure 1, that cause the model to change
event called the contrast case. This contrast case          its original prediction to the contrast prediction.
plays a key role in modulating what explanations               Given the key role that contrastivity plays in
are given. Consider Figure 1. When we seek an ex-           human explanations, making model explanations
planation of the model’s prediction “by train,” we          contrastive could make them more user-centered
seek it not in absolute terms, but in contrast to an-       and thus more useful for their intended purposes,
other possible prediction (i.e. “on foot”). Addition-       such as debugging and exposing dataset biases
ally, we tailor our explanation to this contrast case.      (Ribera and Lapedriza, 2019)—purposes which re-
For instance, we might explain why the prediction           quire that humans work with explanations (Alvarez-
is “by train” and not “on foot” by saying that the          Melis et al., 2019). However, many currently pop-
writer discusses meeting Ann at the train station           ular instance-based explanation methods produce

                                                        3840
             Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3840–3852
                           August 1–6, 2021. ©2021 Association for Computational Linguistics
highlights—segments of input that support a pre-                     2       M I CE: Minimal Contrastive Editing
diction (Zaidan et al., 2007; Lei et al., 2016; Chang
et al., 2019; Bastings et al., 2019; Yu et al., 2019;                This section describes our proposed method, M INI -
                                                                     MAL C ONTRASTIVE E DITING , or M I CE, for ex-
DeYoung et al., 2020; Jain et al., 2020; Belinkov
and Glass, 2019) that can be derived through gradi-                  plaining NLP models with contrastive edits.
ents (Simonyan et al., 2014; Smilkov et al., 2017;
                                                                     2.1      M I CE Edits as Contrastive Explanations
Sundararajan et al., 2017), approximations with
simpler models (Ribeiro et al., 2016), or attention                  Contrastive explanations are answers to questions
(Wiegreffe and Pinter, 2019; Sun and Marasović,                     of the form Why p and not q? They explain why
2021). These methods are not contrastive, as they                    the observed event p happened instead of another
leave the contrast case undetermined; they do not                    event q, called the contrast case.3 A long line of
tell us what would have to be different for a model                  research in the cognitive sciences and philosophy
to have predicted a particular contrast label.1                      has found that human explanations are contrastive
   As an alternative approach to NLP model expla-                    (Van Fraassen, 1980; Lipton, 1990; Miller, 2019).
nation, we introduce M INIMAL C ONTRASTIVE                           Human contrastive explanations have several hall-
E DITING (M I CE)—a two-stage approach to gen-                       mark characteristics. First, they cite contrastive
erating contrastive explanations in the form of tar-                 features: features that result in the contrast case
geted minimal edits (as shown in Figure 1). Given                    when they are changed in a particular way (Chin-
an input, a fixed P REDICTOR model, and a contrast                   Parker and Cantelon, 2017). Second, they are min-
prediction, M I CE generates edits to the input that                 imal in the sense that they rarely cite the entire
change the P REDICTOR’s output from the original                     causal chain of a particular event, but select just a
prediction to the contrast prediction. We formally                   few relevant causes (Hilton, 2017). In this work,
define our edits and describe our approach in §2.                    we argue that a minimal edit to a model input that
   We design M I CE to produce edits with prop-                      causes the model output to change to the contrast
erties motivated by human contrastive explana-                       case has both these properties and can function as
tions. First, we desire edits to be minimal, alter-                  an effective contrastive explanation. We first give
ing only small portions of input, a property which                   an illustration of contrastive explanations humans
has been argued to make explanations more intel-                     might give and then show how minimal contrastive
ligible (Alvarez-Melis et al., 2019; Miller, 2019).                  edits offer analogous contrastive information.
Second, M I CE edits should be fluent, resulting                        As an example, suppose we want to explain why
in text natural for the domain and ensuring that                     the answer to the question “Q: Where can you find
any changes in model predictions are not driven                      a clean pillow case that is not in use?” is “A: the
by inputs falling out of distribution of naturally                   drawer.”4 If someone asks why the answer is not
occurring text. Our experiments (§3) on three                        “C1: on the bed,” we might explain: “E1: Because
English-language datasets, I MDB, N EWSGROUPS,                       only the drawer stores pillow cases that are not
and R ACE, validate that M I CE edits are indeed                     in use.” However, E1 would not be an explana-
contrastive, minimal, and fluent.                                    tion of why the answer is not “C2: in the laundry
   We also analyze the quality of M I CE edits (§4)                  hamper,” since both drawers and laundry hampers
and show how they may be used for two use cases                      store pillow cases that are not in use. For contrast
in NLP system development. First, we show that                       case C2, we might instead explain: “E2: Because
M I CE edits are comparable in size and fluency to                   only laundry hampers store pillow cases that are
human edits on the I MDB dataset. Next, we illus-                    not clean.” We cite different parts of the original
trate how M I CE edits can facilitate debugging in-                  question depending on the contrast case.
dividual model predictions. Finally, we show how                        In this work, we propose to offer contrastive ex-
M I CE edits can be used to uncover dataset artifacts                planations in the form of minimal edits that result
learned by a powerful P REDICTOR model.2                             in the contrast case as model output. Such edits are
                                                                     effective contrastive explanations because, by con-
    1
      Free-text rationales (Narang et al., 2020) can be con-         struction, they highlight contrastive features. For
trastive if human justifications are collected by asking “why...
                                                                         3
instead of...” which is not the case with current benchmarks             Related work also calls it the foil (Miller, 2019).
(Camburu et al., 2018; Rajani et al., 2019; Zellers et al., 2019).       4
                                                                         Inspired by an example in Talmor et al. (2019): Question:
    2
      Our code and trained E DITOR models are publicly avail-        “Where would you store a pillow case that is not in use?”
able at https://github.com/allenai/mice.                             Choices: “drawer, kitchen cupboard, bedding store, england.”

                                                                3841
Figure 2: An overview of M I CE, our two-stage approach to generating edits. In Stage 1 (§2.3), we train the
E DITOR to make edits targeting specific predictions from the P REDICTOR. In Stage 2 (§2.4), we make contrastive
edits with the E DITOR model from Stage 1 such that the P REDICTOR changes its output to the contrast prediction.

example, a contrastive edit of the original question        target label as input. In Stage 2 of M I CE, we gener-
for contrast case C1 would be: “Where can you find          ate contrastive edits e(x) using the E DITOR model
a clean pillow case that is not in use?”; the informa-      from Stage 1. Specifically, we generate candidate
tion provided by this edit—that it is whether or not        edits e(x) by masking different percentages of x
the pillow case is in use that determines whether           and giving masked inputs with prepended contrast
the answer is “the drawer” or “on the bed”—is anal-         label yc to the E DITOR; we use binary search to
ogous to the information provided by E1. Similarly,         find optimal masking percentages and beam search
a contrastive edit for contrast case C2 that changed        to keep track of candidate edits that result in the
the question to “Where can you find a clean dirty           highest probability of the contrast labels p(yc |e(x))
pillow case that is not in use?” provides analogous         given by the P REDICTOR.
information to E2.
                                                            2.3   Stage 1: Fine-tuning the E DITOR
2.2   Overview of M I CE                                    In Stage 1 of M I CE, we fine-tune the E DITOR to
We define a contrastive edit to be a modifica-              infill masked spans of text in a targeted manner.
tion of an input instance that causes a P REDIC -           Specifically, we fine-tune a pretrained model to in-
TOR model (whose behavior is being explained)               fill masked spans given masked text and a target
to change its output from its original prediction           end-task label as input. In this work, we use the
for the unedited input to a given target (contrast)         T EXT- TO -T EXT T RANSFER T RANSFORMER (T5)
prediction. Formally, for textual inputs, given a           model (Raffel et al., 2020) as our pretrained E DI -
fixed P REDICTOR f , input x = (x1 , x2 , ..., xN )         TOR , but any model suitable for span infilling can
of N tokens, original prediction f (x) = yp and             in principle be the E DITOR in M I CE. The addition
contrast prediction yc 6= yp , a contrastive edit is a      of the target label allows the highly-contextualized
mapping e : (x1 , ..., xN ) ! (x01 , ..., x0M ) such that   E DITOR to condition its predictions on both the
f (e(x)) = yc .                                             masked context and the given target label such that
   We propose M I CE, a two-stage approach to gen-          the contrast label is not ignored in Stage 2. What to
erating contrastive edits, illustrated in Figure 2. In      use as target labels during Stage 1 depends on who
Stage 1, we prepare a highly-contextualized E DI -          the end-users of M I CE are. The end-user could
TOR model to associate edits with given end-task            be: (1) a model developer who has access to the
labels (i.e., labels for the task of the P REDICTOR)        labeled data used to train the predictor, or (2) lay-
such that the contrast label yc is not ignored in           users, domain experts, or other developers without
M I CE’s second stage. Intuitively, we do this by           access to the labeled data. In the former case, we
masking the spans of text that are “important” for          could use the gold label as targets, and in the latter
the given target label (as measured by the P REDIC -        case, we could use the labels predicted by P REDIC -
TOR’s gradients) and training our E DITOR to recon-         TOR. Therefore, during fine-tuning, we experiment
struct these spans of text given the masked text and        with using both gold labels and original predictions

                                                       3842
yp of our P REDICTOR model as target labels. To                3     Evaluation
provide target labels, we prepend them to inputs
to the E DITOR. For more information about how                 This section presents empirical findings that M I CE
these inputs are formatted, see Appendix B. Results            produces minimal and fluent contrastive edits.
in Table 2 show that fine-tuning with target labels
                                                               3.1    Experimental Setup
results in better edits than fine-tuning without them.
   The above procedure allows our E DITOR to con-              Tasks We evaluate M I CE on three English-
dition its infilled spans on both the context and the          language datasets: IMDB, a binary sentiment clas-
target label. But this still leaves open the ques-             sification task (Maas et al., 2011), a 6-class ver-
tion of where to mask our text. Intuitively, we                sion of the 20 N EWSGROUPS topic classification
want to mask the tokens that contribute most to                task (Lang, 1995), and R ACE, a multiple choice
the P REDICTOR’s predictions, since these are the              question-answering task (Lai et al., 2017).6
tokens that are most strongly associated with the              P REDICTORS M I CE can be used to make con-
target label. We propose to use gradient attribu-              trastive edits for any differentiable P REDICTOR
tion (Simonyan et al., 2014) to choose tokens to               model, i.e., any end-to-end neural model. In this
mask. For each instance, we take the gradient of               paper, for each task, we train a P REDICTOR model
the predicted logit for the target label with respect          f built on RO BERTA - LARGE (Liu et al., 2019),
to the embedding layers of f and take the `1 norm              and fix it during evaluation. The test accuracies
across the embedding dimension. We then mask                   of our P REDICTORS are 95.9%, 85.3% and 84%
the n1 % of tokens with the highest gradient norms.            for I MDB, N EWSGROUPS, and R ACE, respectively.
We replace consecutive tokens (i.e., spans) with               For training details, see Appendix A.1.
sentinel tokens, following Raffel et al. (2020). Re-
sults in Table 1 show that gradient-based masking              E DITORS Our E DITORS build on the base ver-
outperforms random masking.                                    sion of T5. For fine-tuning our E DITORS (Stage 1),
                                                               we use the original training data used to train P RE -
2.4   Stage 2: Making Edits with the E DITOR                   DICTORS . We randomly split the data, 75%/25%
In the second stage of our approach, we use our fine-          for fine-tuning/validation and fine-tune until the
tuned E DITOR to make edits using beam search                  validation loss stops decreasing (for a max of 10
(Reddy, 1977). In each round of edits, we mask                 epochs) with n1 % of tokens masked, where n1 is
consecutive spans of n2 % of tokens in the original            a randomly chosen value in [20, 55]. For more
input, prepend the contrast prediction to the masked           details, see Appendix A.2. In Stage 2, for each
input, and feed the resulting masked instance to the           instance, we set the label with the second highest
E DITOR; the E DITOR then generates m edits. The               predicted probability as the contrast prediction. We
masking procedure during this stage is gradient-               set beam width b = 3, consider s = 4 search levels
based as in Stage 1.                                           during binary search over n2 in each edit round,
   In one round of edits, we conduct a binary search           and run our search for a max of 3 edit rounds. For
with s levels over values of n2 between values                 each n2 , we sample m = 15 generations from our
n2 = 0% to n2 = 55% to efficiently find a value                fine-tuned E DITORS with p = 0.95, k = 30.7
of n2 that is large enough to result in the contrast           Metrics We evaluate M I CE on the test sets of
prediction while also modifying only minimal parts             the three datasets. The R ACE and N EWSGROUPS
of the input. After each round of edits, we get f ’s           test sets contain 4,934 and 7,307 instances, respec-
predictions on the edited inputs, order them by con-           tively.8 For I MDB, we randomly sample 5K of the
trast prediction probabilities, and update the beam
to store the top b edited instances. As soon as an                 6
                                                                     We create this 6-class version by mapping the 20 exist-
edit e⇤ = e(t) is found that results in the contrast           ing subcategories to their respective larger categories—i.e.
                                                               “talk.politics.guns” and “talk.religion.misc” ! “talk.” We do
prediction, i.e., f (e⇤ ) = yc , we stop the search            this in order to make the label space smaller. The resulting
procedure and return this edit. For generation, we             classes are: alt, comp, misc, rec, sci, and talk.
                                                                   7
use a combination of top-k (Fan et al., 2018) and                    We tune these hyperparameters on a 50-instance subset
                                                               of the I MDB validation set prior to evaluation. We note that
top-p (nucleus) sampling (Holtzman et al., 2020).5             for larger values of n2 , the generations produced by the T5
                                                               E DITORS sometimes degenerate; see Appendix C for details.
   5                                                               8
     We use this combination because we observed in prelimi-         For the N EWSGROUPS test set, there are 7,307 instances
nary experiments that it led to good results.                  remaining after filtering out empty strings.

                                                          3843
M I CE                               I MDB                         N EWSGROUPS                              R ACE
                               "           #          ⇡1           "           #         ⇡1           "          #        ⇡1
 VARIANT                   Flip Rate     Minim.      Fluen.    Flip Rate     Minim.     Fluen.    Flip Rate    Minim.    Fluen.

 *P RED + G RAD              1.000        0.173      0.981         0.992      0.261     0.968      0.915       0.331     0.981
 *G OLD + G RAD              1.000        0.185      0.979         0.992      0.271     0.966      0.945       0.335     0.979
 P RED + R AND               1.000        0.257      0.958         0.968      0.378     0.928       0.799      0.440      0.953
 G OLD + R AND               1.000        0.302      0.952         0.965      0.370     0.929       0.801      0.440      0.955
 N O -F INETUNE              0.995        0.360      0.960         0.941      0.418     0.938         –           –         –

Table 1: Efficacy of the M I CE procedure. We evaluate M I CE edits on three metrics (described in §3.1): flip rate,
minimality, and fluency. We report mean values for minimality and fluency. * marks full M I CE variants; others
explore ablations. For each property (i.e., column), the best value across M I CE variants is bolded. We experiment
with P REDICTOR’s predictions (P RED) and gold labels (G OLD) as target labels during Stage 1. Across datasets,
our G RAD M I CE procedure achieves a high flip rate with small and fluent edits.

25K instances in the test set for evaluation because                 high flip rate across all three tasks. This is the out-
of the computational demands of evaluation.9                         come regardless of whether predicted target labels
   For each dataset, we measure the following three                  (first row, 91.5–100% flip rate) or gold target labels
properties: (1) flip rate: the proportion of in-                     (second row, 94.5–100% flip rate) are used for fine-
stances for which an edit results in the contrast                    tuning in Stage 1. We observe a slight improvement
label; (2) minimality: the “size” of the edit as                     from using the gold labels for the R ACE P REDIC -
measured by the word-level Levenshtein distance                      TOR , which may be explained by the fact that it is
between the original and edited input, which is the                  less accurate (with a training accuracy of 89.9%)
minimum number of deletions, insertions, or sub-                     than the I MDB and N EWSGROUPS classifiers.
stitutions required to transform one into the other.                     M I CE achieves a high flip-rate while its edits
We report a normalized version of this metric with                   remain small and result in fluent text. In particular,
a range from 0 to 1—the Levenshtein distance di-                     M I CE on average changes 17.3–33.1% of the origi-
vided by the number of words in the original in-                     nal tokens when predicted labels are used in Stage 1
put; (3) fluency: a measure of how similarly dis-                    and 18.5–33.5% with gold labels. Fluency is close
tributed the edited output is to the original data. We               to 1.0 indicating no notable change in mask lan-
evaluate fluency by comparing masked language                        guage modeling loss after the edit—i.e., edits fall
modeling loss on both the original and edited inputs                 in distribution of the original data. We achieve the
using a pretrained model. Specifically, given the                    best results across metrics on the I MDB dataset, as
original N -length sequence, we create N copies,                     expected since I MDB is a binary classification task
each with a different token replaced by a mask to-                   with a small label space. These results demonstrate
ken, following Salazar et al. (2020). We then take                   that M I CE presents a promising research direction
a pretrained T 5- BASE model and compute the aver-                   for the generation of contrastive explanations; how-
age loss across these N copies. We compute this                      ever, there is still room for improvement, especially
loss value for both the original input and edited                    for more challenging tasks such as R ACE.
input and report their ratio—i.e., edited / original.                   In the rest of this section, we provide results
We aim for a value of 1.0, which indicates equiva-                   from several ablation experiments.
lent losses for the original and edited texts. When
M I CE finds multiple edits, we report metrics for                   Fine-tuning vs. No Fine-tuning We investigate
the edit with the smallest value for minimality.                     the effect of fine-tuning (Stage 1) with a base-
                                                                     line that skips Stage 1 altogether. For this N O -
3.2    Results                                                       F INETUNE baseline variant of M I CE, we use the
                                                                     vanilla pretrained T5- BASE as our E DITOR. As
Results are shown in Table 1. Our proposed G RAD
                                                                     shown in Table 1, the N O -F INETUNE variant un-
M I CE procedure (upper part of Table 1) achieves a
                                                                     derperforms all other (two-stage) variants of M I CE
   9
     A single contrastive edit is expensive and takes an average     for the I MDB and N EWSGROUPS datasets.10 Fine-
of ⇡ 15 seconds per I MDB instance (⇡ 230 tokens). Calculat-
                                                                        10
ing the fluency metric adds an additional average of ⇡ 16.5                We leave R ACE out from our evaluation with the N O -
seconds per I MDB instance. For more details, see Section 5.         F INETUNE baseline because we observe that the pretrained

                                                              3844
I MDB Condition               "          #         ⇡1          bels in both stages provides signal that allows the
  Stage 1 Stage 2           Flip Rate    Minim.     Fluen.       E DITOR in Stage 2 to generate prediction-flipping
  No Label     No Label       0.994       0.369      0.966       edits at lower masking percentages.
  No Label     Label          0.997       0.362      0.967
  Label        No Label       0.999       0.327      0.968       4        Analysis of Edits
  Label        Label         1.000        0.173      0.981
                                                                 In this section, we compare M I CE edits with hu-
Table 2: Effect of using target end-task labels during           man contrastive edits. Then, we turn to a key mo-
the two stages of PRED+GRAD M I CE on the I MDB                  tivation for this work: the potential for contrastive
dataset. When end-task labels are provided, they are             explanations to assist in NLP system development.
original P REDICTOR labels during Stage 1 and contrast           We show how M I CE edits can be used to debug
labels during Stage 2. The best values for each property         incorrect predictions and uncover dataset artifacts.
(column) are bolded. Using end-task labels during both
Stage 1 (E DITOR fine-tuning) and Stage 2 (making ed-            4.1       Comparison with Human Edits
its) of M I CE outperforms all other conditions.
                                                                 We ask whether the contrastive edits produced by
                                                                 M I CE are minimal and fluent in a meaningful
tuning particularly improves the minimality of ed-               sense. In particular, we compare these two met-
its, while leaving the flip rate high. We hypothesize            rics for M I CE edits and human contrastive edits.
that this effect is due to the effectiveness of Stage            We work with the I MDB contrast set created by
2 of M I CE at finding contrastive edits: Because                Gardner et al. (2020), which consists of original
we iteratively generate many candidate edits using               test inputs and human-edited inputs that cause a
beam search, we are likely to find a prediction-                 change in true label. We report metrics on the sub-
flipping edit. Fine-tuning allows us to find such an             set of this contrast set for which the human-edited
edit at a lower masking percentage.                              inputs result in a change in model prediction for our
                                                                 I MDB P REDICTOR; this subset consists of 76 in-
Gradient vs. Random Masking We study the                         stances. The flip rate of M I CE edits on this subset
impact of using gradient-based masking in Stage                  is 100%. The mean minimality values of human
1 of the M I CE procedure with a R AND variant,                  and M I CE edits are 0.149 (human) and 0.179
which masks spans of randomly chosen tokens. As                  (M I CE), and the mean fluency values are 1.01 (hu-
shown in the middle part of Table 1, gradient-based              man) and 0.949 (M I CE). The similarity of these
masking outperforms random masking when using                    values suggests that M I CE edits are comparable to
both predicted and gold labels across all three tasks            human contrastive edits along these dimensions.
and metrics, suggesting that the gradient-based at-                 We also ask to what extent human edits overlap
tribution used to mask text during Stage 1 of M I CE             with M I CE edits. For each input, we compute the
is an important part of the procedure. The differ-               overlap between the original tokens changed by hu-
ences are especially notable for R ACE, which is the             mans and the original tokens edited by M I CE. The
most challenging task according to our metrics.                  mean number of overlapping tokens, normalized by
                                                                 the number of original tokens edited by humans, is
Targeted vs. Un-targeted Infilling We investi-                   0.298. Thus, while there is some overlap between
gate the effect of using target labels in both stages            M I CE and human contrastive edits, they gener-
of M I CE by experimenting with removing target                  ally change different parts of text.11 This analysis
labels during Stage 1 (E DITOR fine-tuning) and                  suggests that there may exist multiple informative
Stage 2 (making edits). As shown in Table 2, we                  contrastive edits for a single input. Future work
observe that giving target labels to our E DITORS                can investigate and compare the different kinds of
during both stages of M I CE improves edit qual-                 insight that can be obtained through human and
ity. Fine-tuning E DITORS without labels in Stage 1              model-driven contrastive edits.
(“No Label”) leads to worse flip rate, minimality,
and fluency than does fine-tuning E DITORS with la-              4.2       Use Case 1: Debugging Incorrect Outputs
bels (“Label”). Minimality is particularly affected,             Here, we illustrate how M I CE edits can be used to
and we hypothesize that using target end-task la-                debug incorrect model outputs. Consider the R ACE
                                                                     11
T5 model does not generate text formatted as span infills; we        M I CE edits explain P REDICTORS’ behavior and therefore
hypothesize that this model has not been trained to generate     need not be similar to human edits, which are designed to
infills for masked inputs formatted as multiple choice inputs.   change gold labels.

                                                             3845
Original pred yp = positive       Contrast pred yc = negative
              An interesting pairing of stories, this little flick manages to bring together seemingly different characters and
              story lines all in the backdrop of WWII and succeeds in tying them together without losing the audience.
      I MDB
              I was impressed by the depth portrayed by the different characters and also by how much I really felt I
              understood them and their motivations, even though the time spent on the development of each character was
              very limited. The outstanding acting abilities of the individuals involved with this picture are easily noted. A
              fun, stylized movie with a slew of comic moments and a bunch more head shaking events. 7/10 4/10

                                      Question: Mark went up in George’s plane                   .
                                      (a) twice (b) only once (c) several times (d) once or twice.
                                 Original pred yp = (a) twice        Contrast pred yc = (b) only once
              When George was thirty-five, he bought a small plane and learned to fly it. He soon became very good and
      R ACE   made his plane do all kinds of tricks. George had a friend, whose name was Mark. One day George offered to
              take Mark up in his plane. Mark thought, "I’ve traveled in a big plane several times, but I’ve never been in a
              small one, so I’ll go." They went up, and George flew around for half an hour and did all kinds of tricks in the
              air. When they came down again, Mark was glad to be back safely, and he said to his friend in a shaking voice,
              "Well, George, thank you very much for those two trips tricks in your plane." George was very surprised and
              said, "Two trips? tricks." Yes, That’s my first and my last time, George." answered said Mark.

Table 3: Examples of edits produced by M I CE. Insertions are bolded in red. Deletions are struck through. yp is
the P REDICTOR’s original prediction, and yc the contrast prediction. True labels for original inputs are underlined.

input in Table 3, for which the R ACE P REDICTOR                            yc = positive                yc = negative
gives an incorrect prediction. In this case, a model                    Removed       Inserted       Removed        Inserted
developer may want to understand why the model                            4/10      excellent          10/10       awful
                                                                       ridiculous     enjoy            8/10    disappointed
got the answer wrong. This setting naturally brings                     horrible     amazing           7/10          1
rise to a contrastive question, i.e., Why did the                           4      entertaining          9           4
model predict the wrong choice (“twice”) instead                       predictable      10           enjoyable   annoying
of the correct one (“only once”)?
                                                                   Table 4: Top 5 I MDB tokens edited by M I CE at a higher
   The M I CE edit shown offers insight into this
                                                                   rate than expected given their original frequency (§4.3).
question: Firstly, it highlights which part of                     Results are separated by contrast predictions.
the paragraph has an influence on the model
prediction—the last few sentences. Secondly, it
reveals that a source of confusion is Mark’s joke                  ative prediction from the P REDICTOR even though
about having traveled in George’s plane twice, as                  the edited text is overwhelmingly positive. We test
changing Mark’s dialogue from talking about a                      this hypothesis by investigating whether numerical
“first and...last” trip to a single trip results in a cor-         tokens are more likely to be edited by M I CE.
rect model prediction.
                                                                      We analyze the edits produced by M I CE (G OLD
   M I CE edits can also be used to debug model
                                                                   + G RAD) described in §3.1. We limit our analy-
capabilities by offering hypotheses about “bugs”
                                                                   sis to a subset of the 5K instances for which the
present in models: For instance, the edit in Table
                                                                   edit produced by M I CE has a minimality value of
3 might prompt a developer to investigate whether
                                                                   0.05, as we are interested in finding simple arti-
this P REDICTOR lacks non-literal language under-
                                                                   facts driving the predictions of the I MDB P REDIC -
standing capabilities. In the next section, we show
                                                                   TOR ; this subset has 902 instances. We compute
how insight from individual M I CE edits can be
                                                                   three metrics for each unique token, i.e., type t:
used to uncover a bug in the form of a dataset-level
artifact learned by a model. In Appendix D, we fur-                      p(t) = #_occurrences(t)/ #_all_tokens,
ther analyze the debugging utility of M I CE edits                     pr (t) = #_removals(t)/ #_all_removals,
with a P REDICTOR designed to contain a bug.
                                                                        pi (t) = #_insertions(t)/ #_all_insertions,
4.3    Use Case 2: Uncovering Dataset Artifacts                    and report the tokens with the highest values for
Manual inspection of some edits for I MDB suggests                 the ratios pr (t)/p(t) and pi (t)/p(t). Intuitively,
that the I MDB P REDICTOR has learned to rely heav-                these tokens are removed/inserted at a higher rate
ily on numerical ratings. For instance, in the I MDB               than expected given the frequency with which they
example in Table 3, the M I CE edit results in a neg-              appear in the original I MDB inputs. We exclude

                                                              3846
tokens that occur
trolled text generation methods to generate targeted              7   Conclusion
counterfactuals and explores their use as test cases
and augmented examples in the context of clas-                    We argue that contrastive edits, which change the
sification. Another concurrent work by Wu et al.                  output of a P REDICTOR to a given contrast pre-
(2021) presents P OLYJUICE, a general-purpose, un-                diction, are effective explanations of neural NLP
targeted counterfactual generator. Very recent work               models. We propose M INIMAL C ONTRASTIVE
by Sha et al. (2021), introduced after the submis-                E DITING (M I CE), a method for generating such
sion of M I CE, proposes a method for targeted con-               edits. We introduce evaluation criteria for con-
trastive editing for Q&A that selects answer-related              trastive edits that are motivated by human con-
tokens, masks them, and generates new tokens. Our                 trastive explanations—minimality and fluency—
work differs from these works in our novel frame-                 and show that M I CE edits for the I MDB, N EWS -
work for efficiently finding minimal edits (M I CE                GROUPS , and R ACE datasets are contrastive, flu-
Stage 2) and our use of edits as explanations.                    ent, and minimal. Through qualitative analysis of
                                                                  M I CE edits, we show that they have utility for
Connection to Adversarial Examples Adver-                         robust and reliable NLP system development.
sarial examples are minimally edited inputs that
cause models to incorrectly change their predic-
tions despite no change in true label (Jia and Liang,             8   Broader Impact Statement
2017; Ebrahimi et al., 2018; Pal and Tople, 2020).
Recent methods for generating adversarial exam-                   M I CE is intended to aid the interpretation of NLP
ples also preserve fluency (Zhang et al., 2019; Li                models. As a model-agnostic explanation method,
et al., 2020b; Song et al., 2020)15 ; however, ad-                it has the potential to impact NLP system devel-
versarial examples are designed to find erroneous                 opment across a wide range of models and tasks.
change in model outputs; contrastive edits place no               In particular, M I CE edits can benefit NLP model
such constraint on model correctness. Thus, cur-                  developers in facilitating debugging and exposing
rent approaches to generating adversarial examples,               dataset artifacts, as discussed in §4. As a conse-
which can exploit semantics-preserving operations                 quence, they can also benefit downstream users of
(Ribeiro et al., 2018) such as paraphrasing (Iyyer                NLP models by facilitating access to less biased
et al., 2018) or word replacement (Alzantot et al.,               and more robust systems.
2018; Ren et al., 2019; Garg and Ramakrishnan,                       While the focus of our work is on interpreting
2020), cannot be used to generate contrastive edits.              NLP models, there are potential misuses of M I CE
                                                                  that involve other applications. Firstly, malicious
Connection to Style Transfer The goal of style                    actors might employ M I CE to generate adversarial
transfer is to generate minimal edits to inputs to                examples; for instance, they may aim to generate
result in a target style (sentiment, formality, etc.)             hate speech that is minimally edited such that it
(Fu et al., 2018; Li et al., 2018; Goyal et al., 2020).           fools a toxic language classifier. Secondly, naively
Most existing approaches train an encoder to learn                applying M I CE for data augmentation could plau-
style-agnostic latent representation of inputs and                sibly lead to less robust and more biased models:
train attribute-specific decoders to generate text                Because M I CE edits are intended to expose issues
reflecting the content of inputs but exhibiting a                 in models, straightforwardly using them as addi-
different target attribute (Fu et al., 2018; Li et al.,           tional training examples could reinforce existing
2018; Goyal et al., 2020). Recent works by Wu                     artifacts and biases present in data. To mitigate
et al. (2019) and Malmi et al. (2020) adopt two-                  this risk, we encourage researchers exploring data
stage approaches that first identify where to make                augmentation to carefully think about how to select
edits and then make them using pretrained language                and label edited instances.
models. Such approaches can only be applied to                       We also encourage researchers to develop more
generate contrastive edits for classification tasks               efficient methods of generating minimal contrastive
with well-defined “styles,” which exclude more                    edits. As discussed in §5, a limitation of M I CE is
complex tasks such as question answering.                         its computational demand. Therefore, we recom-
  15
                                                                  mend that future work focus on creating methods
    Song et al. (2020) propose a method to produce fluent se-
mantic collisions, which they call the “inverse” of adversarial   that require less compute.
examples.

                                                             3848
References                                                  31–36, Melbourne, Australia. Association for Com-
                                                            putational Linguistics.
David Alvarez-Melis, Hal Daumé III, Jennifer Wort-
  man Vaughan, and Hanna Wallach. 2019. Weight            Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-
  of evidence as a basis for human-oriented expla-          erarchical neural story generation. In Proceedings
  nations. In Workshop on Human-Centric Machine             of the 56th Annual Meeting of the Association for
  Learning at the 33rd Conference on Neural Informa-        Computational Linguistics (Volume 1: Long Papers),
  tion Processing Systems (NeurIPS 2019), Vancouver,        pages 889–898, Melbourne, Australia. Association
  Canada.                                                   for Computational Linguistics.
Moustafa Alzantot, Yash Sharma, Ahmed Elgohary,           Amir Feder, Nadav Oved, Uri Shalit, and Roi Re-
 Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang.          ichart. 2021. CausaLM: Causal Model Explanation
 2018. Generating natural language adversarial ex-         Through Counterfactual Language Models. Compu-
 amples. In Proceedings of the 2018 Conference on          tational Linguistics, pages 1–54.
 Empirical Methods in Natural Language Processing,
 pages 2890–2896, Brussels, Belgium. Association          Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao,
 for Computational Linguistics.                             and Rui Yan. 2018. Style transfer in text: Explo-
                                                            ration and evaluation. In AAAI Conference on Artifi-
Jasmijn Bastings, Wilker Aziz, and Ivan Titov. 2019.        cial Intelligence.
   Interpretable neural predictions with differentiable
   binary variables. In Proceedings of the 57th Annual    Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan
   Meeting of the Association for Computational Lin-       Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi,
   guistics, pages 2963–2977, Florence, Italy. Associa-    Dheeru Dua, Yanai Elazar, Ananth Gottumukkala,
   tion for Computational Linguistics.                     Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco,
                                                           Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel-
Yonatan Belinkov and James Glass. 2019. Analysis
                                                           son F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer
  Methods in Neural Language Processing: A Survey.
                                                           Singh, Noah A. Smith, Sanjay Subramanian, Reut
  Transactions of the Association for Computational
                                                           Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou.
  Linguistics, 7:49–72.
                                                           2020. Evaluating models’ local decision boundaries
Oana-Maria Camburu, Tim Rocktäschel, Thomas                via contrast sets. In Findings of the Association
  Lukasiewicz, and Phil Blunsom. 2018. e-snli: Nat-        for Computational Linguistics: EMNLP 2020, pages
  ural language inference with natural language expla-     1307–1323. Association for Computational Linguis-
  nations. In Advances in Neural Information Process-      tics.
  ing Systems, volume 31, pages 9539–9549. Curran
                                                          Matt Gardner, Joel Grus, Mark Neumann, Oyvind
  Associates, Inc.
                                                           Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew
Shiyu Chang, Yang Zhang, Mo Yu, and Tommi                  Peters, Michael Schmitz, and Luke S. Zettlemoyer.
  Jaakkola. 2019. A game theoretic approach to class-      2017. Allennlp: A deep semantic natural language
  wise selective rationalization. In Advances in Neural    processing platform.
  Information Processing Systems, volume 32, pages
  10055–10065. Curran Associates, Inc.                    Siddhant Garg and Goutham Ramakrishnan. 2020.
                                                             BAE: BERT-based adversarial examples for text
Seth Chin-Parker and Julie A Cantelon. 2017. Con-            classification. In Proceedings of the 2020 Confer-
  trastive constraints guide explanation-based cate-         ence on Empirical Methods in Natural Language
  gory learning. Cognitive Science, 41 6:1645–1655.         Processing (EMNLP), pages 6174–6181. Associa-
                                                             tion for Computational Linguistics.
Susanne Dandl, Christoph Molnar, Martin Binder, and
  Bernd Bischl. 2020. Multi-objective counterfactual      Navita Goyal, Balaji Vasan Srinivasan, N. Anand-
  explanations. In Parallel Problem Solving from Na-        havelu, and Abhilasha Sancheti. 2020.       Multi-
  ture – PPSN XVI, pages 448–469, Cham. Springer            dimensional style transfer for partially annotated
  International Publishing.                                 data using language models as discriminators.
                                                            ArXiv, arXiv:2010.11578.
Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani,
   Eric Lehman, Caiming Xiong, Richard Socher, and        Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi
   Byron C. Wallace. 2020. ERASER: A benchmark to           Parikh, and Stefan Lee. 2019. Counterfactual vi-
   evaluate rationalized NLP models. In Proceedings         sual explanations. In Proceedings of the 36th In-
   of the 58th Annual Meeting of the Association for        ternational Conference on Machine Learning, vol-
  Computational Linguistics, pages 4443–4458. Asso-         ume 97 of Proceedings of Machine Learning Re-
   ciation for Computational Linguistics.                   search, pages 2376–2384, Long Beach, California,
                                                            USA. PMLR.
Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing
   Dou. 2018. HotFlip: White-box adversarial exam-        Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell,
   ples for text classification. In Proceedings of the       and Zeynep Akata. 2018. Grounding visual expla-
   56th Annual Meeting of the Association for Compu-         nations. In Computer Vision – ECCV 2018, pages
   tational Linguistics (Volume 2: Short Papers), pages      269–286, Cham. Springer International Publishing.

                                                     3849
Denis Hilton. 2017. Social Attribution and Explana-        Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016.
  tion. Oxford University Press.                             Rationalizing neural predictions. In Proceedings of
                                                             the 2016 Conference on Empirical Methods in Nat-
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and           ural Language Processing, pages 107–117, Austin,
  Yejin Choi. 2020. The curious case of neural text de-      Texas. Association for Computational Linguistics.
  generation. In International Conference on Learn-
  ing Representations.                                     Chuanrong Li, Lin Shengshuo, Zeyu Liu, Xinyi Wu,
                                                             Xuhui Zhou, and Shane Steinert-Threlkeld. 2020a.
Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke            Linguistically-informed transformations (LIT): A
 Zettlemoyer. 2018. Adversarial example generation           method for automatically generating contrast sets.
 with syntactically controlled paraphrase networks.          In Proceedings of the Third BlackboxNLP Workshop
 In Proceedings of the 2018 Conference of the North          on Analyzing and Interpreting Neural Networks for
 American Chapter of the Association for Computa-            NLP, pages 126–135, Online. Association for Com-
 tional Linguistics: Human Language Technologies,            putational Linguistics.
 Volume 1 (Long Papers), pages 1875–1885, New              Juncen Li, Robin Jia, He He, and Percy Liang. 2018.
 Orleans, Louisiana. Association for Computational           Delete, retrieve, generate: a simple approach to sen-
 Linguistics.                                                timent and style transfer. In Proceedings of the 2018
                                                             Conference of the North American Chapter of the
Alon Jacovi and Y. Goldberg. 2020. Aligning faithful         Association for Computational Linguistics: Human
  interpretations with their social attribution. ArXiv,      Language Technologies, Volume 1 (Long Papers),
  arXiv:2006.01067.                                          pages 1865–1874, New Orleans, Louisiana. Associ-
                                                             ation for Computational Linguistics.
Alon Jacovi, Swabha Swayamdipta, Shauli Ravfogel,
  Yanai Elazar, Yejin Choi, and Yoav Goldberg. 2021.       Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue,
  Contrastive explanations for model interpretability.       and Xipeng Qiu. 2020b. BERT-ATTACK: Adver-
  ArXiv:2103.01378.                                          sarial attack against BERT using BERT. In Proceed-
                                                             ings of the 2020 Conference on Empirical Methods
Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, and By-         in Natural Language Processing (EMNLP), pages
  ron C. Wallace. 2020. Learning to faithfully rational-     6193–6202. Association for Computational Linguis-
  ize by construction. In Proceedings of the 58th An-        tics.
  nual Meeting of the Association for Computational
  Linguistics, pages 4459–4473. Association for Com-       Weixin Liang, James Zou, and Zhou Yu. 2020. ALICE:
  putational Linguistics.                                   Active learning with contrastive natural language ex-
                                                             planations. In Proceedings of the 2020 Conference
                                                             on Empirical Methods in Natural Language Process-
Robin Jia and Percy Liang. 2017. Adversarial exam-
                                                             ing (EMNLP), pages 4380–4391, Online. Associa-
  ples for evaluating reading comprehension systems.
                                                             tion for Computational Linguistics.
  In Proceedings of the 2017 Conference on Empiri-
  cal Methods in Natural Language Processing, pages        Peter Lipton. 1990. Contrastive explanation. Royal
  2021–2031, Copenhagen, Denmark. Association for            Institute of Philosophy Supplement, 27:247–266.
  Computational Linguistics.
                                                           Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Amir-Hossein Karimi, G. Barthe, B. Balle, and                 dar Joshi, Danqi Chen, Omer Levy, M. Lewis,
 I. Valera. 2020. Model-agnostic counterfactual ex-           Luke Zettlemoyer, and Veselin Stoyanov. 2019.
 planations for consequential decisions. Proceedings          RoBERTa: A robustly optimized BERT pretraining
 of the 23rd International Conference on Artificial In-       approach. ArXiv, arXiv:1907.11692.
 telligence and Statistics (AISTATS).
                                                           Arnaud Van Looveren and Janis Klaise. 2019. Inter-
Divyansh Kaushik, Eduard Hovy, and Zachary Lipton.           pretable counterfactual explanations guided by pro-
  2020. Learning the difference that makes a differ-         totypes. ArXiv, arXiv:1907.02584.
  ence with counterfactually-augmented data. In Inter-     Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
  national Conference on Learning Representations.           Dan Huang, Andrew Y. Ng, and Christopher Potts.
                                                             2011. Learning word vectors for sentiment analy-
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,             sis. In Proceedings of the 49th Annual Meeting of
  and Eduard Hovy. 2017. RACE: Large-scale ReAd-             the Association for Computational Linguistics: Hu-
  ing comprehension dataset from examinations. In            man Language Technologies, pages 142–150, Port-
  Proceedings of the 2017 Conference on Empirical            land, Oregon, USA. Association for Computational
  Methods in Natural Language Processing, pages              Linguistics.
  785–794, Copenhagen, Denmark. Association for
  Computational Linguistics.                               Nishtha Madaan, Inkit Padhi, Naveen Panwar, and Dip-
                                                             tikalyan Saha. 2021. Generate your counterfactu-
Ken Lang. 1995. Newsweeder: Learning to filter net-          als: Towards controlled counterfactual generation
  news. In Proceedings of the Twelfth International          for text. In Proceedings of the AAAI Conference on
  Conference on Machine Learning, pages 331–339.             Artificial Intelligence.

                                                      3850
Eric Malmi, Aliaksei Severyn, and Sascha Rothe. 2020.        Chris Russell. 2019. Efficient search for diverse coher-
   Unsupervised text style transfer with padded masked         ent explanations. In Proceedings of the Conference
   language models. In Proceedings of the 2020 Con-            on Fairness, Accountability, and Transparency.
   ference on Empirical Methods in Natural Language
  Processing (EMNLP), pages 8671–8680. Associa-              Julian Salazar, Davis Liang, Toan Q. Nguyen, and
   tion for Computational Linguistics.                          Katrin Kirchhoff. 2020. Masked language model
                                                                scoring. In Proceedings of the 58th Annual Meet-
Tim Miller. 2019. Explanation in Artificial Intelli-            ing of the Association for Computational Linguis-
  gence: Insights from the social sciences. Artificial          tics, pages 2699–2712. Association for Computa-
  Intelligence, 267:1–38.                                       tional Linguistics.
Sharan Narang, Colin Raffel, Katherine Lee, Adam
  Roberts, Noah Fiedel, and Karishma Malkan. 2020.           Lei Sha, Patrick Hohenecker, and Thomas Lukasiewicz.
  WT5?! training text-to-text models to explain their          2021. Controlling text edition by changing answers
  predictions. arXiv:2004.14546.                               of specific questions. ArXiv:2105.11018.

B. Pal and S. Tople. 2020. To transfer or not to trans-      Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-
   fer: Misclassification attacks against transfer learned     man. 2014. Deep inside convolutional networks: Vi-
   text classifiers. ArXiv, arXiv:2001.02438.                  sualising image classification models and saliency
                                                               maps. In 2nd International Conference on Learn-
Judea Pearl. 2009. Causality: Models, Reasoning and            ing Representations, ICLR 2014, Banff, AB, Canada,
  Inference, 2nd edition. Cambridge University Press,          April 14-16, 2014, Workshop Track Proceedings.
  USA.
                                                             Jacob Sippy, Gagan Bansal, and Daniel S. Weld. 2020.
Colin Raffel, Noam Shazeer, Adam Roberts, Kather-               Data staining: A method for comparing faithfulness
  ine Lee, Sharan Narang, Michael Matena, Yanqi                 of explainers. In 2020 ICML Workshop on Human
  Zhou, Wei Li, and Peter J. Liu. 2020. Exploring               Interpretability in Machine Learning (WHI 2020).
  the limits of transfer learning with a unified text-to-
  text transformer. Journal of Machine Learning Re-          D. Smilkov, Nikhil Thorat, Been Kim, F. Viégas, and
  search, 21(140):1–67.                                        M. Wattenberg. 2017. Smoothgrad: removing noise
Nazneen Fatema Rajani, Bryan McCann, Caiming                   by adding noise. In ICML Workshop on Visualiza-
  Xiong, and Richard Socher. 2019. Explain yourself!           tion for Deep Learning.
  leveraging language models for commonsense rea-
  soning. In Proceedings of the 57th Annual Meet-            Congzheng Song, Alexander Rush, and Vitaly
  ing of the Association for Computational Linguis-            Shmatikov. 2020. Adversarial semantic collisions.
  tics, pages 4932–4942, Florence, Italy. Association          In Proceedings of the 2020 Conference on Empirical
  for Computational Linguistics.                               Methods in Natural Language Processing (EMNLP),
                                                               pages 4198–4210, Online. Association for Computa-
D Raj Reddy. 1977. Speech understanding systems: A             tional Linguistics.
  summary of results of the five-year research effort.
                                                             Kaiser Sun and Ana Marasović. 2021. Effective atten-
Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che.              tion sheds light on interpretability. In Findings of
  2019. Generating natural language adversarial ex-            ACL.
  amples through probability weighted word saliency.
  In Proceedings of the 57th Annual Meeting of the           Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017.
  Association for Computational Linguistics, pages            Axiomatic attribution for deep networks. In Pro-
  1085–1097, Florence, Italy. Association for Compu-          ceedings of the 34th International Conference on
  tational Linguistics.                                       Machine Learning - Volume 70, page 3319–3328.
                                                              JMLR.org.
Marco Tulio Ribeiro, Sameer Singh, and Carlos
 Guestrin. 2016. "why should i trust you?": Explain-
                                                             Alon Talmor, Jonathan Herzig, Nicholas Lourie, and
 ing the predictions of any classifier. Proceedings of
                                                               Jonathan Berant. 2019. CommonsenseQA: A ques-
 the 22nd ACM SIGKDD International Conference
                                                               tion answering challenge targeting commonsense
 on Knowledge Discovery and Data Mining.
                                                               knowledge. In Proceedings of the 2019 Conference
Marco Tulio Ribeiro, Sameer Singh, and Carlos                  of the North American Chapter of the Association
 Guestrin. 2018. Semantically equivalent adversar-             for Computational Linguistics: Human Language
 ial rules for debugging NLP models. In Proceedings            Technologies, Volume 1 (Long and Short Papers),
 of the 56th Annual Meeting of the Association for             pages 4149–4158, Minneapolis, Minnesota. Associ-
 Computational Linguistics (Volume 1: Long Papers),            ation for Computational Linguistics.
 pages 856–865, Melbourne, Australia. Association
 for Computational Linguistics.                              Damien Teney, Ehsan Abbasnejad, and A. V. D. Hen-
                                                               gel. 2020. Learning what makes a difference from
Mireia Ribera and Àgata Lapedriza. 2019. Can We Do             counterfactual examples and gradient supervision.
  Better Explanations? A Proposal of User-Centered             In Proceedings of the European Conference on Com-
  Explainable AI. In ACM IUI Workshop.                         puter Vision (ECCV).

                                                        3851
Bas C Van Fraassen. 1980. The scientific image. Ox-         and the 9th International Joint Conference on Natu-
  ford University Press.                                    ral Language Processing (EMNLP-IJCNLP), pages
                                                            4094–4103, Hong Kong, China. Association for
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov,            Computational Linguistics.
   Sharon Qian, Daniel Nevo, Yaron Singer, and Stu-
   art Shieber. 2020. Investigating gender bias in lan-   Omar Zaidan, Jason Eisner, and Christine Piatko. 2007.
   guage models using causal mediation analysis. In        Using “annotator rationales” to improve machine
   Advances in Neural Information Processing Systems,      learning for text categorization. In Human Lan-
   volume 33, pages 12388–12401. Curran Associates,        guage Technologies 2007: The Conference of the
   Inc.                                                    North American Chapter of the Association for Com-
                                                           putational Linguistics; Proceedings of the Main
S. Wachter, Brent D. Mittelstadt, and Chris Russell.       Conference, pages 260–267, Rochester, New York.
   2017. Counterfactual explanations without opening       Association for Computational Linguistics.
   the black box: Automated decisions and the gdpr.
   European Economics: Microeconomics & Industrial        Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin
   Organization eJournal.                                   Choi. 2019. From Recognition to Cognition: Visual
                                                            Commonsense Reasoning. In CVPR.
Sarah Wiegreffe and Yuval Pinter. 2019. Attention is
  not not explanation. In Proceedings of the 2019 Con-    Huangzhao Zhang, Hao Zhou, Ning Miao, and Lei Li.
  ference on Empirical Methods in Natural Language          2019. Generating fluent adversarial examples for
  Processing and the 9th International Joint Confer-        natural languages. In Proceedings of the 57th An-
  ence on Natural Language Processing (EMNLP-               nual Meeting of the Association for Computational
  IJCNLP), pages 11–20, Hong Kong, China. Associ-           Linguistics, pages 5564–5569, Florence, Italy. Asso-
  ation for Computational Linguistics.                      ciation for Computational Linguistics.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
  Chaumond, Clement Delangue, Anthony Moi, Pier-
  ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
  icz, Joe Davison, Sam Shleifer, Patrick von Platen,
  Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
  Teven Le Scao, Sylvain Gugger, Mariama Drame,
  Quentin Lhoest, and Alexander M. Rush. 2020.
  Transformers: State-of-the-art natural language pro-
  cessing. In Proceedings of the 2020 Conference on
  Empirical Methods in Natural Language Processing:
  System Demonstrations, pages 38–45, Online. Asso-
  ciation for Computational Linguistics.

Tongshuang Wu, Marco Túlio Ribeiro, J. Heer,
  and Daniel S. Weld. 2021.      Polyjuice: Auto-
  mated, general-purpose counterfactual generation.
  arXiv:2101.00288.

Xing Wu, Tao Zhang, Liangjun Zang, Jizhong Han,
  and Songlin Hu. 2019. Mask and infill: Apply-
  ing masked language model for sentiment transfer.
  In Proceedings of the Twenty-Eighth International
  Joint Conference on Artificial Intelligence, IJCAI-
  19, pages 5271–5277. International Joint Confer-
  ences on Artificial Intelligence Organization.

Linyi Yang, Eoin Kenny, Tin Lok James Ng, Yi Yang,
  Barry Smyth, and Ruihai Dong. 2020. Generating
  plausible counterfactual explanations for deep trans-
  formers in financial text classification. In Proceed-
  ings of the 28th International Conference on Com-
  putational Linguistics, pages 6150–6160, Barcelona,
  Spain (Online). International Committee on Compu-
  tational Linguistics.

Mo Yu, Shiyu Chang, Yang Zhang, and Tommi
 Jaakkola. 2019. Rethinking cooperative rationaliza-
 tion: Introspective extraction and complement con-
 trol. In Proceedings of the 2019 Conference on
 Empirical Methods in Natural Language Processing

                                                     3852
You can also read