Exploring the Efficacy of Automatically Generated Counterfactuals for Sentiment Analysis

Page created by Sidney Frank
 
CONTINUE READING
Exploring the Efficacy of Automatically Generated Counterfactuals for
                            Sentiment Analysis
            Linyi Yang 1,2,3,4 , Jiazheng Li 2 , Pádraig Cunningham 2 , Yue Zhang 3,4

                              Barry Smyth 1,2 , Ruihai Dong 1,2
               1
              The Insight Centre for Data Analytics, University College Dublin
                  2
                    School of Computer Science, University College Dublin
                        3
                          School of Engineering, Westlake University
        4
          Institute of Advanced Technology, Westlake Institute for Advanced Study
     {linyi.yang, ruihai.dong, barry.smyth}@insight-centre.org
                           {padraig.cunningham}@ucd.ie
                           {jiazheng.li}@ucdconnect.ie
                           {yue.zhang}@westlake.edu.cn

                     Abstract                              et al., 2020a). As an example, in the sentence
                                                          “Nolan’s films always shock people, thanks to his
   While state-of-the-art NLP models have been             superb directing skills”, the most influential word
   achieving the excellent performance of a wide           for the prediction of a positive sentiment should be
   range of tasks in recent years, important ques-
                                                          “superb” instead of “Nolan” or “film”. The issue of
   tions are being raised about their robustness
   and their underlying sensitivity to systematic          spurious patterns also partially affects the out-of-
   biases that may exist in their training and test        domain (OOD) generalization of the models trained
   data. Such issues come to be manifest in per-           on independent, identical distribution (IID) data,
   formance problems when faced with out-of-               leading to performance decay under distribution
   distribution data in the field. One recent so-          shift (Quionero-Candela et al., 2009; Sugiyama
   lution has been to use counterfactually aug-            and Kawanabe, 2012; Ovadia et al., 2019).
   mented datasets in order to reduce any reliance
   on spurious patterns that may exist in the orig-       Researchers have recently found that such con-
   inal data. Producing high-quality augmented         cerns about model performance decay and social
   data can be costly and time-consuming as it         bias in NLP come about out-of-domain because
   usually needs to involve human feedback and         of a sensitivity to semantically spurious signals
   crowdsourcing efforts. In this work, we pro-        (Gardner et al., 2020), and recent studies have un-
   pose an alternative by describing and evalu-        covered a problematic tendency for gender bias in
   ating an approach to automatically generating
                                                       sentiment analysis (Zmigrod et al., 2019; Maudslay
   counterfactual data for the purpose of data aug-
   mentation and explanation. A comprehensive          et al., 2019; Lu et al., 2020). To this end, one of
   evaluation on several different datasets and us-    the possible solutions is data augmentation with
   ing a variety of state-of-the-art benchmarks        counterfactual examples (Kaushik et al., 2020) to
   demonstrate how our approach can achieve sig-       ensure that models learn real causal associations
   nificant improvements in model performance          between the input text and labels. For example,
   when compared to models training on the orig-       a sentiment-flipped counterfactual of last exam-
   inal data and even when compared to models          ple could be “Nolan’s movies always bore peo-
   trained with the benefit of human-generated
                                                       ple, thanks to his poor directorial skills.”. When
   augmented data.
                                                       added to the original set of training data, such kinds
1 Introduction                                         of counterfactually augmented data (CAD) have
                                                       shown their benefits on learning real causal asso-
Deep neural models have recently made remark- ciations and improving the model robustness in
able advances on sentiment analysis (Devlin et al., recent studies (Kaushik et al., 2020, 2021; Wang
2018; Liu et al., 2019; Yang et al., 2019; Xie et al., and Culotta, 2021). Unlike gradient-based adver-
2020). However, their implementation in practical      sarial examples (Wang and Wan, 2019; Zhang et al.,
applications still encounters significant challenges. 2019; Zang et al., 2020), which cannot provide a
Of particular concern, these models tend to learn in- clear boundary between positive and negative in-
tended behavior that is often associated with spuri- stances to humans, counterfactuals could provide
ous patterns (artifacts) (Jo and Bengio, 2017; Slack “human-like” logic to show a modification to the

                                                       306
               Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
             and the 11th International Joint Conference on Natural Language Processing, pages 306–316
                         August 1–6, 2021. ©2021 Association for Computational Linguistics
input that makes a difference to the output classifi-   formance benefit tends to decrease, regardless of
cation (Byrne, 2019).                                   whether CAD is controlled manually or automat-
   Recent attempts for generating counterfactual        ically. Second, we introduce a novel masked lan-
examples (also known as minimal pairs) rely on          guage model for helping improve the fluency and
human-in-the-loop systems. Kaushik et al. (2020)        grammar correctness of the generated CAD. Third,
proposed a human-in-the-loop method to generate         we add a fine-tuned model as a discriminator for
CAD by employing human annotators to generate           automatically evaluating the edit-distance, using
sentiment-flipped reviews. The human labeler is         data generated with minimal and fluent edits (same
asked to make minimal and faithful edits to produce     requirements for human annotators in Kaushik et al.
counterfactual reviews. Similarly, Srivastava et al.    (2020)) to ensure the quality of generated counter-
(2020) presented a framework to leverage strong         factuals. Experimental results show that it leads to
prior (human) knowledge to understand the possi-        significant prediction benefits using both hold-out
ble distribution shifts for a specific machine learn-   tests and generalization tests.
ing task; they use human commonsense reasoning             To the best of our knowledge, we are the first to
as a source of information to build a more robust       automatically generate counterfactuals for use as
model against spurious patterns. Although useful        augmented data to improve the robustness of neural
for reducing sensitivity to spurious correlations,      classifiers, which can outperform existing, state-of-
collecting enough high-quality human annotations        the-art, human-in-the-loop approaches. We will
is costly and time-consuming.                           release our code and datasets on GitHub 1 .
   The theory behind the ability of CAD to improve        2   Related Work
model robustness in sentiment analysis is discussed
by Kaushik et al. (2021), where researchers present     This work mainly touches on three important ar-
a theoretical characterization of the impact of noise   eas: approaches to evaluation that go beyond tradi-
in causal and non-causal features on model gener-       tional accuracy measures (Bender and Koller, 2020;
alization. However, methods for automatically gen-      Warstadt et al., 2020), the importance of counterfac-
erating CAD have received less attention. The only      tuals in eXplainable AI (XAI) (Byrne, 2019; Keane
existing approach (Wang and Culotta, 2021) has          and Smyth, 2020), and out-of-domain generaliza-
been tested on the logistic regression model only,      tion in sentiment analysis (Kim and Hovy, 2004;
despite the fact that recent state-of-the-art methods   Zhang et al., 2018; Zhang and Zhang, 2019).
for sentiment classification are driven by neural          There has been an increasing interest in the role
models. Also, their automatically generated CAD         of Robustness Causal Thinking in ML, often by
cannot produce competitive performance compared         leveraging human feedback. Recently, some of the
to human-generated CAD. We believe that their           standard benchmark datasets have been challenged
method does not sufficiently leverage the power of      (Gardner et al., 2020; Ribeiro et al., 2020), in which
pre-trained language models and fails to generate       the model performance is significantly lower on
fluent and effective CAD. In addition, the relation-    contrast sets than on original test sets; a difference
ships between out-of-domain generalization and          of up to 25% in some cases. Researchers propose
sensitivity to spurious patterns were not explicitly    counterfactual data augmentation approaches for
investigated by Wang and Culotta (2021).                building robust models (Maudslay et al., 2019; Zmi-
   To address these issues, we use four benchmark       grod et al., 2019; Lu et al., 2020), and find that
datasets (IMDB movie reviews as hold-out test           spurious correlations threaten the model’s validity
while Amazon, Yelp, and Twitter datasets for out-       and reliability. In an attempt to address this prob-
of-domain generalization test) to further explore       lem, Kaushik et al. (2020) explore opportunities
the efficacy of CAD for sentiment analysis. First,      for developing human-in-the-loop systems by us-
we conduct a systematic comparison of several           ing crowd-sourcing to generate counterfactual data
different state-of-the-art models (Wang and Cu-         from original data, for data augmentation. Teney
lotta, 2021). This reveals how large Transformer-       et al. (2020) shows the continuous effectiveness of
based models (Vaswani et al., 2017) with larger         CAD in computer vision (CV) and NLP.
parameter sizes may improve the resilience of ma-          The idea of generating Counterfactuals in XAI
chine learning models. Specifically, we have found           1
                                                               https://github.com/lijiazheng99/Counterfactuals-for-
that for increasing parameter spaces, CAD’s per-          Sentiment-Analysis

                                                    307
Human Human-generated                                                       Sentiment
    Kaushik, Hovy, and
                      Annotators Counterfactuals           Our Method
    Lipton (2020)                                                                                  Dictionary
                                                          Our Method                             Sentiment
                                                                                                 Dictionary         Hierarchical
                                                                                                                    RM-CT
                                                                                               Identify causalHierarchical
                                                                             Classifiers
                                                                                               terms
                                                                                         Identify    with SCD
                                                                                                  causal      RM-CT
    Original Sampled    The approved counterfactual is                  Classifiers                                 Hierarchical
                                                           Original dataset              terms with MLM Hierarchical
    Dataset Data points added to the original dataset    Original dataset                                     REP-CTREP-CT
    Wang and Culotta
    (2021)                                                              MoverScore          Classifiers
                                  Replace causal                              MoverScore           Classifiers
               Identify likely
               causal features
                                  features with          Counterfactually MoverScore is used to control the minimal edits of
                                                                            the automatically generated counterfactuals
                                  antonyms                         dataset MoverScore is used to control the minimal edits
                                                           Counterfactually
                                                         augmented
    Original                     Based on                  augmented dataset of the automatically generated CAD.
    Dataset                      PyDictionary

Figure 1: Overview of previous CAD methods are shown on the left side, while the pipeline of our method is shown
on the right. Hierarchical RM-CT (removing the casual terms) and Hierarchical REP-CT (replacing the casual
terms) are our methods for automatically generating CAD, respectively. SCD denotes sampling and sensitivity of
contextual decomposition. Sentiment Dictionary refers to the opinion lexicon published by (Hu and Liu, 2004).

also shares important conceptual features with our                 their attendant accuracy issues using a hold-out test
work. Since human counterfactual explanations are                  methodology. For this reason, we designed an indi-
minimal in the sense that they select a few relevant               rect method for evaluating the robustness of models,
causes (Byrne, 2019; Keane and Smyth, 2020) as is                  by comparing the performance of models trained on
the requirement of minimal edits in our generation                 original and augmented data using out-of-domain
process. This has been explored more in the field of               data. The prediction benefit for out-of-domain data
CV (Goyal et al., 2019; Kenny and Keane, 2021),                    should provide some evidence about whether a
but investigated less in NLP. Recent work (Jacovi                  model’s sensitivity to spurious patterns has been
and Goldberg, 2020) highlight explanations of a                    successfully mitigated. The resulting counterfactu-
given causal format, and Yang et al. (2020a) gener-                als can be used for data augmentation and can also
ate counterfactuals for explaining the prediction of               provide contrastive explanations for classifiers, and
financial text classification. We propose a similar                important and desirable consideration for the re-
but different research question, that is, whether the              cent move towards more XAI (Ribeiro et al., 2016;
automatically generated counterfactual can be used                 Lundberg and Lee, 2017; Lipton, 2018; Pedreschi
for data augmentation to build more robust mod-                    et al., 2019; Slack et al., 2020b).
els, which has not been considered by the previous
methods in XAI (Pedreschi et al., 2019; Slack et al.,              3     Detailed Implementation
2020b; Yang et al., 2020b; Ding et al., 2020).                   We propose a new approach for automatically gen-
   In the case of Sentiment Analysis, most of the                erating counterfactuals to enhance the robustness
previous works report experiments using a hold-                  of sentiment analysis models by inverting the sen-
out test on the IID dataset (Liu, 2012; Yang et al.,             timent of causally important terms according to
2016; Johnson and Zhang, 2017). The current state-               Algorithm 1 and based on the following stages:
of-the-art methods make use of large pre-trained
                                                                       1. The identification of genuine causal terms us-
language models (e.g., BERT (Devlin et al., 2018),
                                                                          ing self-supervised contextual decomposition
RoBERTa (Liu et al., 2019) and SMART-RoBERTa
                                                                          (Section 3.1).
(Jiang et al., 2020)) for calculating input represnta-
tions. It has been shown that these methods can                        2. Generating counterfactual samples by (a) RM-
suffer from spurious patterns (Kaushik et al., 2020;                      CT (removing causal terms) and (b) REP-CT
Wang and Culotta, 2021). Very recently, Wang and                          (replacing the causal terms) (Section 3.2).
Culotta (2021) provide a starting point for explor-
                                                                       3. Selecting the human-like counterfactuals us-
ing the efficacy of automatically generated CAD
                                                                          ing MoverScore. (Zhao et al., 2019) (Section
for sentiment analysis, but it is still based on IID
                                                                          3.3).
hold-out tests only. However, spurious patterns
in the training and test sets could be tightly cou-                  The end result will be a set of counterfactuals
pled, which may limit the possibility of observing                 that can be used to augment an existing dataset.

                                                             308
3.1   Identifying Causal Terms                                       Algorithm 1 Generating plausible counterfactual
 To identify causally important terms, we propose                     instances.
 a hierarchical method, based on the sampling and                     Input: Test document D(n) = {P1 , P2 , ..., Pn }, with corre-
                                                                      sponding ground-truth labels Y, pre-trained Mask Language
 sensitivity of contextual decomposition technique                    Model MLM, fine-tuned transformer classifier C, Positive
 from Jin et al. (2019), by incrementally remov-                      Word Dictionaries POS, Negative Word Dictionaries NEG.
                                                                      (pos and neg are predicates for positive and negative labels)
 ing words from a sentence in order to evaluate                                                            (k)     (k)    (k)
                                                                      Output: Plausible counterfactual Dcf = {Drep , Drm }
 the model’s sensitivity to these words. Significant
                                                                           1: for Pk in D(n) do
 changes in model outputs suggest the removal of                           2:     for S (i) , Yi in Pk do
 important terms. For example, removing the word                           3:         Sb(i) ← w ∈ S (i) | (w ∈ P OS ∧ Yi = pos)
“best” from “The movie is the best that I have ever                                                 ∨ (w ∈ N EG ∧ Yi = neg)
                                                                                        (i)
                                                                                      Ssorted ← sort Sb(i) , key = φ(w, Sb(i) ) (eq.1)
                                                                                                                               
 seen.”, is likely to alter a model’s sentiment predic-                    4:
                                                                                        (i)         (i)
 tion more than the removal of other words from the                        5:         Srm ← Ssorted [1 :]
                                                                                        (i)          (i)
 sentence; thus “best” is an important word with                           6:         Srep ← Ssorted
                                                                                                      (i)
 respect to this sentence’s sentiment. In a similar                        7:         for w ∈ Srep do
                                                                                                                  (i) (i) 
                                                                           8:              Wp ← M LM Smask(w) , Srep
 way, phrases beginning with negative pronouns will
                                                                           9:              Wc ← {w ∈ Wp | (w ∈ P OS ∧ Yi ! =
 likely be important; for instance, “not satisfy you”                         pos) ∨ (w ∈ N EG ∧ Yi ! = neg)
 is important in “This movie could not satisfy you”.                      10:
                                                                                              (i)                                 
                                                                                           Srep (w) ← sort Wc , key = φ(w, Wc ) [0]
    Given a word (or phrase starting with negative                        11:          end for
                                                                                         (k)          (k)     (i)
 limitations) w in the sentence s, the importance of                      12:          Prm ← Prm + Srm
                                                                                         (k)          (k)     (i)
                                                                          13:          Prep ← Prep + Srep
 w can be calculated as in Equation 1 where s β \p                        14:     end for
                                                                                    (n)          (n)      (k)
 denotes the sentence that resulting after masking                        15:     Drm ← Drm + Prm
                                                                                    (n)          (n)      (k)
 out a single word (or a negative phrase as above).                       16:     Drep ← Drep + Prep
                                                                          17: end for
 We use l (s β \p; b s) to represent the model predic-                                   (n)
                                                                          18: return Drm , Drep
                                                                                                   (n)

 tion after replacing the masked-out context, while
 sbβ is a input sequence sampled from the input s. \p
                                                                                                         (i)
 indicates the operation of masking out the phrase p                      to create a new sentence Srep . For each word w we
 in a input document D from the training set. The                         use a masked language model (MLM) to generate
 specific candidate causal terms found by this mask-                      a set of plausible replacements, Wp (line 8), and
 ing operation vary for different prediction models.                      a subset of these, Wc , as replacement candidates
                                                                          if their sentiment is different from the sentiment
                                                                          of S (i) , which is given by Yi (line 9). Here we are
                                                         
                      l (s β ; sbβ ) − l (s β \p; sbβ )
  φ(w, b
       s) = Esβ                                               (1)         using the BERT-base-uncased as the pre-trained
                                 l (s β ; sbβ )
                                                                          MLM for SVM and BiLSTM models 1 . The size
 3.2   Generating Human-like Counterfactuals                              of candidate substitutions found by MLM output
This approach and the scoring function in Equation                        is set to 100 for all models.Then, Wc is sorted in
1 is used in Algorithm 1 in two ways, to generate                         descending order of importance using Equation 1
two types of plausible counterfactuals. First, it is                      and the most important candidate is selected and
                                                                                                   (i)
used to identify words to remove from a sentence to                       used to replace w in Srep (line 10).
produce a plausible counterfactual. This is referred                         Algorithm 1 continues in this fashion to gen-
to as RM-CT and is performed by lines 3–5 in Al-                          erate counterfactual sentences using RM-CT and
gorithm 1; for a sentence S (i) , it’s correctly labeled                  REP-CT for each sentence in each paragraph of
sentiment words are identified (line 3), and sorted                       the target document 2 . It returns two counterfac-
based on Equation 1 (line 4) with classifier C, and                       tual documents, which correspond to documents
the most important of these words is removed from                         produced from the RM-CT and REP-CT sentences;
                   (i)
S (i) to produce Srm (line 5).                                            see lines 15–18.
   Second, the REP-CT technique instead replaces                             The above approach is not guaranteed to always
each causally important sentiment word in S (i)                           generate counterfactuals. Typically, reviews that
                                                                             1
with an alternative word that has an opposing sen-                              For Transformers-based models, we use their own pre-
timent polarity (lines 6-11 in Algorithm 1). To do                        trained MLM (e.g., RoBERTa and XLNet) as the generator.
                                                                              2
                                                                                Generating one counterfactual edit for an IMDB instance
this the words in S (i) are each considered for re-                       takes an average of ≈ 3.4 seconds based on the RoBERTa-
placement in order of their importance (lines 6 & 7)                      Large model.

                                                                    309
cannot be transformed into plausible counterfac-            State-of-the-art Models                 SST-2   IMDB
                                                             SMART-RoBERTa (Jiang et al., 2020)      97.5    96.3
 tuals contain spurious associations that interfere          RoBERTa-Large (Liu et al., 2019)        96.7    96.3
 with the model’s predictions. For example, in our           RTC-attention (Zhang and Zhang, 2019)   90.3    88.7
 method, the negative review “The film is pretty             Bi-LSTM                                 86.7    86.0
 bad, and her performance is overacted” will be             Table 1: The performance of state-of-the-art models in
 first modified as “The film is pretty good, and her        sentiment analysis.
 performance is lifelike”. The revised review’s pre-
 diction will remain negative. Meanwhile, the word
“her” will be identified as a potential causal term. To     4.1    In-domain Data
 alleviate this problem, we further conduct the sub-      We first adopt two of the most popular benchmark
 stitution of synonyms for those instances that have      datasets – SST-2 and IMDB (Maas et al., 2011) –
 been already modified with antonym substitution          to show the recent advances on sentiment analysis
 by using causal terms. As an example, we will con-       with the benefit of pre-trained models. However,
 tinue replacing the word “her” with “their” until        we mainly focus on the robustness of various mod-
 the prediction has been flipped; see also Zmigrod        els for sentiment analysis in this work, rather than
 et al. (2019) for related ideas.                         in-domain accuracy. Hence, following Wang and
   In conclusion, then, the final augmented dataset       Culotta (2021) and Kaushik et al. (2020), we per-
 that is produced of three parts: (1) counterfactuals     form binary sentiment classification experiments
 generated by RM-CT; (2) counterfactuals generated        on the IMDB dataset sampled from Maas et al.
 by REP-CT; (3) adversarial examples generated by         (2011) that contains 1707 training, 245 validation,
 synonym substitutions.                                   and 488 testing examples with challenge dataset
                                                          (paired counterfactuals).
 3.3    Ensuring Minimal Changes                            4.2    Challenge Data
When generating plausible counterfactuals, it is          Based on the in-domain IMDB data, Kaushik et al.
desirable to make minimal changes so that the             (2020) employ crowd workers not to label docu-
resulting counterfactual is as similar as possible        ments, but to revise movie review to reverse its
to the original instance (Miller, 2019; Keane and         sentiment, without making any gratuitous changes.
Smyth, 2020). To evaluate this for the approach           We directly use human-generated counterfactuals
described we use the MoverScore (Zhao et al.,             by Kaushik et al. (2020) as our challenge data, en-
2019) – an edit-distance scoring metric originally        forcing a 50:50 class balance.
designed for machine translation – which confirms
that the MoverScore for the automatic CAD in-               4.3    Out-of-domain Data
stances is marginally higher when compared to             We also evaluate our method on different out-of-
human-generated counterfactuals, indicated greater        domain datasets, including Amazon reviews (Ni
similarity between counterfactuals and their orig-        et al., 2019) from six genres: beauty, fashion, ap-
inal instances. The MoverScore between human-             pliances, gift cards, magazines, and software, a
generated counterfactuals and original reviews is         Yelp review dataset, and the Semeval-2017 Twitter
0.74 on average (minimum value of 0.55) and our           dataset (Rosenthal et al., 2017). These have all
augmented data results in a slightly higher average       been sampled to provide a 50:50 label split. The
score than human-generated data for all models.           size of the training data has been kept the same for
The generated counterfactuals and synonym sub-            all methods, and the results reported are the aver-
stitutions that achieve a MoverScore above 0.55           age from five runs to facilitate a direct comparison
are combined with the original dataset for training       with baselines (Kaushik et al., 2020, 2021).
robust classifiers.
                                                            5     Results and Discussions
 4     Datasets                                           We first describe the performance of the cur-
                                                          rent state-of-the-art methods on sentiment analysis
 Our evaluation uses three different kinds of             based on the SST-2 and IMDB benchmark datasets.
 datasets, in-domain data, challenge data, and out-       Next, we will discuss the performance benefits by
 of-domain data.                                          using our automatically generated counterfactuals

                                                      310
Models                 Parameter       Training / Testing data           AC: (Our method)
                                               O/O CF/O CF/CF O/CF               C/O AC/O C/CF        AC/CF
         SVM(TF-IDF)        -                  80.0 58.3        91.2    51.0     83.7 84.8     87.3   86.1
         Bi-LSTM            0.2M               79.3 62.5        89.1    55.7     81.5 82.2     92.0   88.5
         Transformer-based Models
         BERT [ICLR,2021]   110M               87.4   80.4     90.8       82.2   88.5   90.6   95.1   92.2
         WWM-BERT-Large 335M                   91.2   86.9     96.9       93.0   91.0   91.8   95.3   94.1
         XLNet-Large        340M               95.3   90.8     98.0       93.9   93.9   94.9   96.9   95.5
         RoBERTa-Large      355M               93.4   91.6     96.9       93.0   93.6   94.1   96.7   94.3

Table 2: The accuracy of various models for sentiment analysis using different datasets, including the human-
generated counterfactual data and counterfactual samples generated by our pipeline. O denotes the original IMDB
review dataset, CF represents the human-revised counterfactual samples, C denotes the combined dataset con-
sisting of original and human-revised dataset, and AC denotes the original dataset combined with automatically
generated counterfactuals. C and AC contain the same size of training samples (3.4K).

  Original Samples             Original          Robust        The SVM model for sentiment analysis is from
                               superb:0.213      0.627
  Nolan’s film...superb                                        scikit-learn and uses TF-IDF (Term Frequency-
                               film:0.446        0.019
  directing skills (POS)
                               Nolan:0.028       0.029         Inverse Document Frequency) scores, while the
  It’s a poor film, but I      poor:-0.551       -0.999        Transformer-based models are built based on the
  must give it to the lead     film:-0.257       -7e-7
  actress in this one (NEG)    actress:-0.02     -1e-6         Pytorch-Transformer package 4 . We keep the pre-
                                                               diction models the same as Kaushik et al. (2020),
Table 3: Less sensitivity to spurious patterns has been        except for Naive Bayes, which has been abandoned
shown in the robust BERT-base-uncased model.                   due to its high-variance performance shown in our
                                                               experiments.
on an in-domain test. We further compare our                      In the following experiments, we only care about
method, human-label method, and two state-of-the-              whether the robustness of models has been im-
art style-transfer methods (Sudhakar et al., 2019;             proved when training on the augmented dataset
Madaan et al., 2020) in terms of the model robust-             (original data & CAD). Different counterfactual ex-
ness on generalization test. Notably, we provide               amples have been generated for different models in
an ablation study lastly to discuss the influence of           terms of their own causal terms in practice, while
edit-distance for performance benefits.                        the hyper-parameters for different prediction mod-
                                                               els are all identified using a grid search conducted
5.1   State-of-the-art Models                                  over the validation set.
As the human-generated counterfactuals (Kaushik
                                                                   5.2   Comparison with Original Data
et al., 2020) are sampled from Maas et al. (2011),
the results in Table 1 cannot be directly compared                 On the Influence of Spurious Patterns. As
with Table 2 3 . As shown in Table 1, by comparing                 shown in Table 2, we find that the linear
BiLSTM to Transformer-base methods, it can be                      model (SVM) trained on the original and chal-
seen that remarkable advances in sentiment analy-                  lenge (human-generated counterfactuals) data can
sis have been achieved in recent years. On SST-2,                  achieve 80% and 91.2% accuracy testing on the
SMART-RoBERTa (Jiang et al., 2020) outperforms                     IID hold-out data, respectively. However, the ac-
Bi-LSTM by 10.8% (97.5% vs. 86.7%) accuracy,                       curacy of the SVM model trained on the original
where a similar improvement is observed on IMDB                    set when testing on the challenge data drops dra-
(96.3% vs. 86.0%).                                                 matically (91.2% vs. 51%), and vice versa (80%
   According to the results, we select the following               vs. 58.3%). Similar findings were reported by
models for our experiments, which covers a spec-                   Kaushik et al. (2020), where a similar pattern was
trum of statistical, neural and pre-trained neural                 observed in the Bi-LSTM model and BERT-base
methods: SVM (Suykens and Vandewalle, 1999),                       model. This provides further evidence supporting
Bi-LSTM (Graves and Schmidhuber, 2005), BERT-                      the idea that the spurious association in machine
Base (Devlin et al., 2018), RoBERTa-Large (Liu                     learning models is harmful to the performance on
et al., 2019), and XLNet-Large (Yang et al., 2019).                the challenge set for sentiment analysis.
  3                                                                  4
    We can only get the human-generated counterfactual ex-             https://github.com/huggingface/
amples (Kaushik et al., 2020) sampled from the IMDB dataset.       pytorch-transformers

                                                             311
On the Benefits of Robust BERT. As shown                Out-of-domain Test using
                                                           Different Training Data            SVM BERT
in Table 3, we also test whether the sensitivity to                  Accuracy on Amazon Reviews
spurious patterns has been eliminated in the robust        Orig & CAD (Our Method) (3.4k) 78.6      84.7
BERT model. We notice that the correlations of the         Orig & CAD (By Human) (3.4k)       79.3  83.3
                                                           Orig. & (Sudhakar et al., 2019)    64.0  77.2
real causal association “superb” and “poor” are            Orig. & (Madaan et al., 2020)      74.3  71.3
improved from 0.213 to 0.627 and -0.551 to -0.999,         Orig. (3.4k)                       74.5  80.0
respectively. While the correlation of spurious as-           Accuracy on Semeval 2017 Task B (Twitter)
                                                           Orig & CAD (Our Method) (3.4k) 69.7      83.8
sociation “film” is decreased from 0.446 to 0.019          Orig & CAD (By Human) (3.4k)       66.8  82.8
and -0.257 to -7e-7 on positive and the negative           Orig. & (Sudhakar et al., 2019)    59.4  72.5
samples, respectively. This shows that the model           Orig. & (Madaan et al., 2020)      62.8  79.3
                                                           Orig. (3.4k)                       63.1  72.6
trained with our CAD data does provide robustness                       Accuracy on Yelp Reviews
against spurious patterns.                                 Orig & CAD (Our Method) (3.4k) 85.5      87.9
                                                           Orig & CAD (By Human) (3.4k)       85.6  86.6
   On the Influence of Model Size. Previous                Orig. & (Sudhakar et al., 2019)    69.4  84.5
works (Kaushik et al., 2021; Wang and Culotta,             Orig. & (Madaan et al., 2020)      81.3  78.8
2021) have not investigated the performance bene-          Orig. (3.4k)                       81.9  84.3
fits on larger pre-trained models. While we further   Table 4: Out-of-domain test accuracy of SVM and
conduct experiments on various Transformer-based      BERT-base-uncased models trained on the original
models with different parameter sizes to explore      (Orig.) IMDB review only, Counterfactually Aug-
whether the larger transformer-based models can       mented Data (CAD) combining with original data, and
still enjoy the performance benefits of CAD (Table    sentiment-flipped style-transfer examples.
2). We observe that although the test result can
increase with the parameter size increasing (best
for 94.9% using XLNet), the performance benefits
                                                        human-generated (CF) data, it may be because the
brought by human-generated CAD and the auto-
                                                        training and test sets of the human-generated (CF)
generated CAD declines continuously with the pa-
                                                        data are generated by the same group of labelers.
rameter size increase. For example, the BERT-base-
uncased model trained on the auto-generated com-         Robustness in the Generalization Test. We ex-
bined dataset can receive 3.2% (90.6% vs. 87.4%)      plore how our approach makes prediction models
improvement on accuracy while performance in-         more robust out-of-domain in Table 4. For direct
creases only 0.6% (91.8% vs. 91.2%) on accuracy       comparison between our method and the human-
for WWM-BERT-Large. It suggests that larger pre-      generated method, we adopt the fine-tuned BERT-
trained Transformer models may be less sensitive      base model trained with the augmented dataset
to spurious patterns.                                 (original & automatically revised data). The fine-
                                                      tuned model is directly tested for out-of-domain
5.3   Comparison with Human CAD                       data without any adjustment. As shown in Table 4,
                                                      only our method and the human-label method can
Robustness in the In-domain Test. We can see
                                                      outperform the BERT model trained on the original
that all of the models trained on automatic CAD
                                                      data with average 6.5% and 5.3% accuracy im-
– shown as AC in the Table 2 – can outperform
                                                      provements, respectively. Our method also offers
the human-generated CAD varying with the mod-
                                                      performance benefits over three datasets even when
els (AC/O vs. C/O) as follows: SVM (+1.1%),
                                                      compared to the human-label method on BERT.
Bi-LSTM (+0.7%), BERT-base-uncased (+2.1%),
BERT-Large (+0.8%), XLNet-Large (+1.0%), and               Neural Method vs. Statistical Method. As
RoBERTa-Large (+0.5%) when testing on the orig-         shown in Table 4, the performance of the SVM
inal data. If we adopt the automatic CAD (AC), we       model with automatic CAD is more robust than
note a distinct improvement in Table 2 across all       other automated methods (Sudhakar et al., 2019;
models trained on the challenge data in terms of        Madaan et al., 2020) across all datasets. However,
11.3% in average (AC/O vs. CF/O), whereas the           the human-labeled CAD can improve Amazon re-
human-generated CAD can achieve 10.2% accu-             views’ accuracy compared to our method using
racy improvement (C/O vs. CF/O) in average. It          the SVM model by 0.7%. It indicates that human-
is noteworthy that the human-generated CAD can          generated data may lead to more performance ben-
slightly outperform our method when testing on the      efits on a statistical model.

                                                  312
Types of Algorithms                 Examples
                                     Ori: Some films just simply should not be remade. This is one of them. In and of
  Hierarchical RM-CT:                itself it is not a bad film.
  Remove negative limitations        Rev: Some films just simply should be remade. This is one of them. In and of itself it
                                     is a bad film.
  Hierarchical RE-CT:                Ori: It is badly directed, badly acted and boring.
  Replacing the causal terms         Rev: It is well directed, well acted and entertaining.
                                     Ori: This movie is so bad, it can only be compared to the all-time worst “comedy”:
                                     Police Academy 7. No laughs throughout the movie.
  Combined method:
                                     Rev: This movie is so good, it can only be compared to the all-time best “comedy”:
                                     Police Academy 7. Laughs throughout the movie.

Table 5: Most prominent categories of edits for flipping the sentiment performed by our algorithms, namely hier-
archical RM-CT and hierarchical REP-CT.

5.4   Comparison with Automatic Methods                      Training Data            IMDB     Out-of-domain Test
                                                             BERT-base-uncased        Orig.    Amazon Twitter Yelp
Automatic CAD vs. Style-transfer Methods. As                 Orig. & CAD ↑ (3.4K)     90.6     84.7     83.8    87.9
shown in Table 4, the style-transfer results are             Orig. & CAD ↓ (3.4K)     87.1     79.5     73.8    79.0
                                                             Orig. (1.7K)             87.4     80.0     72.6    84.3
consistent with Kaushik et al. (2021). We find
that the sentiment-flipped instances generated by Table 6: Ablation study on the influence of the edit-
style-transfer methods degrade the test accuracy    distance controlled by the threshold of MoverScore. ↑
for all models on all kinds of datasets, whereas    indicates the CAD (1.7K) above the threshold, while ↓
our method has achieved the best performance for    denotes the CAD (1.7K) below the threshold.
all settings. It suggests that our method have its
absolute advantage for data augmentation in senti- et al., 2020). As shown in Table 5, we flipped the
ment analysis when compared to the state-of-the- model’s prediction by replacing the causal terms in
art style-transfer models.                          the phrase “badly directed, badly acted and boring”
   Our Methods vs. Implausible CAD. The au- to “well directed, well acted and entertaining”, or
thors of the only existing approach for automati- removing “No laughs throughout the movie.” to
cally generating CAD (Wang and Culotta, 2021) “Laughs throughout the movie” for a movie review.
report that their methods are not able to match the    We also noticed that our method may face the
performance of human-generated CAD. Our meth- challenge when handling more complex reviews.
ods consistently outperform human-labeled meth- For example, the sentence “Watch this only if some-
ods on both In-domain and Out-of-domain tests. one has a gun to your head ... maybe.” is an ap-
To further provide quantitative evidence of the in- parent negative review for a human. However, our
fluence of the edit-distance in automatic CAD, we   algorithm is hard to flip the sentiment of such re-
demonstrate an ablation study in Table 6. The re- views with no explicit casual terms. The technique
sult shows that the quality of the generated CAD, on sarcasm and irony detection may have benefits
which is ignored in the previous work Wang and      for dealing with this challenge.
Culotta (2021), is crucial when training the robust
classifiers. In particular, the BERT model fine- 6 Conclusion
tuned with implausible CAD (below the threshold)
can receive comparable negative results with the    We proposed a new framework to automatically
style-transfer samples, alongside the performance   generate counterfactual augmented data (CAD) for
decrease on all datasets, except for Twitter.       enhancing the robustness of sentiment analysis
                                                    models. By combining the automatically generated
5.5 Case Study and Limitations                      CAD with the original training data, we can pro-
The three most popular kinds of edits are shown     duce more robust classifiers. We further show that
in Table 5. These are, negation words removal, our methods can achieve better performance even
sentiment words replacement, and the combination    when compared to models trained with human-
of these. It can be observed from these examples    generated counterfactuals. More importantly, our
that we ensure the edits on original samples should evaluation based on several datasets has demon-
be minimal and fluent as was required previously    strated that models trained on the augmented data
with human-annotated counterfactuals (Kaushik       (original & automatic CAD) appear to be less af-

                                                       313
fected by spurious patterns and generalize better           Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan
to out-of-domain data. This suggests there exists            Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi,
                                                             Dheeru Dua, Yanai Elazar, Ananth Gottumukkala,
a significant opportunity to explore the use of the
                                                             et al. 2020. Evaluating models’ local decision
CAD in a range of tasks (e.g., natural language in-          boundaries via contrast sets. In Proceedings of the
ference, natural language understanding, and social          2020 Conference on Empirical Methods in Natural
bias correction.).                                           Language Processing: Findings, pages 1307–1323.
                                                            Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi
Impact Statement                                              Parikh, and Stefan Lee. 2019. Counterfactual visual
                                                              explanations. In ICML.
Although the experiments in this paper are con-
ducted only in the sentiment classification task,           Alex Graves and Jürgen Schmidhuber. 2005. Frame-
this study could be a good starting point to investi-         wise phoneme classification with bidirectional lstm
                                                              and other neural network architectures. Neural net-
gate the efficacy of automatically generated CAD
                                                              works, 18(5-6):602–610.
for building robust systems in many NLP tasks, in-
cluding Natural Language Inference (NLI), Named             Minqing Hu and Bing Liu. 2004. Mining and summa-
Entity Recognition (NER), Question Answering                  rizing customer reviews. In Proceedings of the tenth
                                                              ACM SIGKDD international conference on Knowl-
(QA) system, etc.                                             edge discovery and data mining, pages 168–177.

Acknowledgment                                              Alon Jacovi and Yoav Goldberg. 2020. Aligning faith-
                                                              ful interpretations with their social attribution. arXiv
We would like to thank Eoin Kenny and Prof. Mark              preprint arXiv:2006.01067.
Keane from Insight Centre for their helpful advice
                                                            Haoming Jiang, Pengcheng He, Weizhu Chen, Xi-
and discussion during this work. Also, we would               aodong Liu, Jianfeng Gao, and Tuo Zhao. 2020.
like to thank the anonymous reviewers for their               SMART: Robust and efficient fine-tuning for pre-
insightful comments and suggestions to help im-               trained natural language models through principled
prove the paper. This publication has emanated                regularized optimization. In Proceedings of the 58th
                                                              Annual Meeting of the Association for Computa-
from research conducted with the financial support            tional Linguistics, pages 2177–2190, Online. Asso-
of Science Foundation Ireland under Grant number              ciation for Computational Linguistics.
12/RC/2289 P2.
                                                            Xisen Jin, Zhongyu Wei, Junyi Du, Xiangyang Xue,
                                                              and Xiang Ren. 2019. Towards hierarchical impor-
                                                              tance attribution: Explaining compositional seman-
References                                                    tics for neural sequence models. In International
Emily M. Bender and Alexander Koller. 2020. Climb-            Conference on Learning Representations.
  ing towards NLU: On meaning, form, and under-
                                                            Jason Jo and Yoshua Bengio. 2017. Measuring the ten-
  standing in the age of data. In Proceedings of the
                                                               dency of cnns to learn surface statistical regularities.
  58th Annual Meeting of the Association for Compu-
                                                               arXiv preprint arXiv:1711.11561.
  tational Linguistics, pages 5185–5198, Online. As-
  sociation for Computational Linguistics.                  Rie Johnson and Tong Zhang. 2017. Deep pyramid
                                                              convolutional neural networks for text categoriza-
Ruth MJ Byrne. 2019. Counterfactuals in explain-              tion. In Proceedings of the 55th Annual Meeting of
  able artificial intelligence (xai): evidence from hu-       the Association for Computational Linguistics (Vol-
  man reasoning. In Proceedings of the 28th Inter-            ume 1: Long Papers), pages 562–570.
  national Joint Conference on Artificial Intelligence,
  pages 6276–6282. AAAI Press.                              Divyansh Kaushik, Eduard Hovy, and Zachary Lipton.
                                                              2020. Learning the difference that makes a differ-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and                 ence with counterfactually-augmented data. In Inter-
   Kristina Toutanova. 2018. Bert: Pre-training of deep       national Conference on Learning Representations.
   bidirectional transformers for language understand-
   ing. arXiv preprint arXiv:1810.04805.                    Divyansh Kaushik, Amrith Setlur, Eduard Hovy, and
                                                              Zachary C Lipton. 2021. Explaining the efficacy of
Xiao Ding, Dingkui Hao, Yuewei Zhang, Kuo Liao,               counterfactually augmented data. In International
  Zhongyang Li, Bing Qin, and Ting Liu. 2020. HIT-            Conference on Learning Representations.
  SCIR at SemEval-2020 task 5: Training pre-trained
  language model with pseudo-labeling data for coun-        Mark T Keane and Barry Smyth. 2020. Good counter-
  terfactuals detection. In Proceedings of the Four-         factuals and where to find them: A case-based tech-
  teenth Workshop on Semantic Evaluation, pages              nique for generating counterfactuals for explainable
  354–360, Barcelona (online). International Commit-         ai (xai). In International Conference on Case-Based
  tee for Computational Linguistics.                         Reasoning (ICCBR).

                                                      314
Eoin M Kenny and Mark T Keane. 2021. On generat-                the 2019 Conference on Empirical Methods in Nat-
  ing plausible counterfactual and semi-factual expla-          ural Language Processing and the 9th International
  nations for deep learning. In AAAI.                           Joint Conference on Natural Language Processing
                                                                (EMNLP-IJCNLP), pages 188–197.
Soo-Min Kim and Eduard Hovy. 2004. Determining
  the sentiment of opinions. In COLING 2004: Pro-             Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado,
  ceedings of the 20th International Conference on              David Sculley, Sebastian Nowozin, Joshua Dillon,
  Computational Linguistics, pages 1367–1373.                   Balaji Lakshminarayanan, and Jasper Snoek. 2019.
                                                                Can you trust your model’s uncertainty? evaluating
Zachary C Lipton. 2018. The mythos of model inter-              predictive uncertainty under dataset shift. In Ad-
  pretability. Queue, 16(3):31–57.                              vances in Neural Information Processing Systems,
                                                                pages 13991–14002.
Bing Liu. 2012. Sentiment analysis and opinion min-
  ing. Synthesis lectures on human language technolo-         Dino Pedreschi, Fosca Giannotti, Riccardo Guidotti,
  gies, 5(1):1–167.                                             Anna Monreale, Salvatore Ruggieri, and Franco
                                                                Turini. 2019. Meaningful explanations of black box
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-             ai decision systems. In Proceedings of the AAAI
  dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,                 Conference on Artificial Intelligence, volume 33,
  Luke Zettlemoyer, and Veselin Stoyanov. 2019.                 pages 9780–9784.
  Roberta: A robustly optimized bert pretraining ap-
  proach. arXiv preprint arXiv:1907.11692.                    Joaquin Quionero-Candela, Masashi Sugiyama, Anton
                                                                Schwaighofer, and Neil D Lawrence. 2009. Dataset
Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Aman-            shift in machine learning. The MIT Press.
  charla, and Anupam Datta. 2020. Gender bias in
  neural natural language processing. In Logic, Lan-          Marco Tulio Ribeiro, Sameer Singh, and Carlos
  guage, and Security, pages 189–202. Springer.                Guestrin. 2016. ” why should i trust you?” explain-
                                                               ing the predictions of any classifier. In Proceed-
Scott M Lundberg and Su-In Lee. 2017. A unified                ings of the 22nd ACM SIGKDD international con-
  approach to interpreting model predictions. In Ad-           ference on knowledge discovery and data mining,
  vances in neural information processing systems,             pages 1135–1144.
  pages 4765–4774.
                                                              Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin,
Andrew L. Maas, Raymond E. Daly, Peter T. Pham,                and Sameer Singh. 2020. Beyond accuracy: Be-
  Dan Huang, Andrew Y. Ng, and Christopher Potts.              havioral testing of NLP models with CheckList. In
  2011. Learning word vectors for sentiment analy-             Proceedings of the 58th Annual Meeting of the Asso-
  sis. In Proceedings of the 49th Annual Meeting of            ciation for Computational Linguistics, pages 4902–
  the Association for Computational Linguistics: Hu-           4912, Online. Association for Computational Lin-
  man Language Technologies, pages 142–150, Port-              guistics.
  land, Oregon, USA. Association for Computational
  Linguistics.                                                Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017.
                                                                Semeval-2017 task 4: Sentiment analysis in twitter.
Aman Madaan, Amrith Setlur, Tanmay Parekh, Barn-                In Proceedings of the 11th international workshop
 abas Poczos, Graham Neubig, Yiming Yang, Ruslan                on semantic evaluation (SemEval-2017), pages 502–
 Salakhutdinov, Alan W Black, and Shrimai Prabhu-               518.
 moye. 2020. Politeness transfer: A tag and generate
 approach. In Proceedings of the 58th Annual Meet-            Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh,
 ing of the Association for Computational Linguistics,          and Himabindu Lakkaraju. 2020a. Fooling lime and
 pages 1869–1881, Online. Association for Computa-              shap: Adversarial attacks on post hoc explanation
 tional Linguistics.                                            methods. In Proceedings of the AAAI/ACM Confer-
                                                                ence on AI, Ethics, and Society, pages 180–186.
Rowan Hall Maudslay, Hila Gonen, Ryan Cotterell,
  and Simone Teufel. 2019. It’s all in the name: Mit-         Dylan Slack, Sophie Hilgard, Sameer Singh, and
  igating gender bias with name-based counterfactual            Himabindu Lakkaraju. 2020b. How much should i
  data substitution. In Proceedings of the 2019 Con-            trust you? modeling uncertainty of black box expla-
  ference on Empirical Methods in Natural Language              nations. arXiv preprint arXiv:2008.05030.
  Processing and the 9th International Joint Confer-
  ence on Natural Language Processing (EMNLP-                 Megha Srivastava, Tatsunori Hashimoto, and Percy
  IJCNLP), pages 5270–5278.                                    Liang. 2020. Robustness to spurious correlations via
                                                               human annotations. In International Conference on
Tim Miller. 2019. Explanation in artificial intelligence:      Machine Learning, pages 9109–9119. PMLR.
  Insights from the social sciences. Artificial Intelli-
  gence, 267:1–38.                                            Akhilesh Sudhakar, Bhargav Upadhyay, and Arjun Ma-
                                                                heswaran. 2019. “transforming” delete, retrieve,
Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019.               generate approach for controlled text style transfer.
   Justifying recommendations using distantly-labeled           In Proceedings of the 2019 Conference on Empirical
   reviews and fine-grained aspects. In Proceedings of          Methods in Natural Language Processing and the

                                                        315
9th International Joint Conference on Natural Lan-             American chapter of the association for computa-
  guage Processing (EMNLP-IJCNLP), pages 3260–                   tional linguistics: human language technologies,
  3270.                                                          pages 1480–1489.

Masashi Sugiyama and Motoaki Kawanabe. 2012. Ma-               Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu,
 chine learning in non-stationary environments: In-              Meng Zhang, Qun Liu, and Maosong Sun. 2020.
 troduction to covariate shift adaptation. MIT press.            Word-level textual adversarial attacking as combina-
                                                                 torial optimization. In Proceedings of the 58th An-
Johan AK Suykens and Joos Vandewalle. 1999. Least                nual Meeting of the Association for Computational
  squares support vector machine classifiers. Neural             Linguistics, pages 6066–6080.
  processing letters, 9(3):293–300.
                                                               Huangzhao Zhang, Hao Zhou, Ning Miao, and Lei Li.
Damien Teney, Ehsan Abbasnedjad, and Anton van den               2019. Generating fluent adversarial examples for
  Hengel. 2020. Learning what makes a difference                 natural languages. In Proceedings of the 57th An-
  from counterfactual examples and gradient supervi-             nual Meeting of the Association for Computational
  sion. arXiv preprint arXiv:2004.09034.                         Linguistics, pages 5564–5569.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob               Yuan Zhang and Yue Zhang. 2019. Tree communica-
  Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz                  tion models for sentiment analysis. In Proceedings
  Kaiser, and Illia Polosukhin. 2017. Attention is all           of the 57th Annual Meeting of the Association for
  you need. In NIPS.                                             Computational Linguistics, pages 3518–3527.
Ke Wang and Xiaojun Wan. 2019. Automatic genera-               Yue Zhang, Qi Liu, and Linfeng Song. 2018. Sentence-
  tion of sentimental texts via mixture adversarial net-         state lstm for text representation. In Proceedings
  works. Artificial Intelligence, 275:540–558.                   of the 56th Annual Meeting of the Association for
                                                                 Computational Linguistics (Volume 1: Long Papers),
Zhao Wang and Aron Culotta. 2021. Robustness to spu-             pages 317–327.
  rious correlations in text classification via automati-
  cally generated counterfactuals. In AAAI.                    Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Chris-
                                                                 tian M Meyer, and Steffen Eger. 2019. Moverscore:
Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mo-            Text generation evaluating with contextualized em-
  hananey, Wei Peng, Sheng-Fu Wang, and Samuel R                 beddings and earth mover distance. In Proceedings
  Bowman. 2020. Blimp: The benchmark of linguis-                 of the 2019 Conference on Empirical Methods in
  tic minimal pairs for english. Transactions of the As-        Natural Language Processing and the 9th Interna-
  sociation for Computational Linguistics, 8:377–392.            tional Joint Conference on Natural Language Pro-
                                                                 cessing (EMNLP-IJCNLP), pages 563–578.
Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong,
  and Quoc Le. 2020. Unsupervised data augmenta-               Ran Zmigrod, Sebastian J Mielke, Hanna Wallach, and
  tion for consistency training. Advances in Neural              Ryan Cotterell. 2019. Counterfactual data augmen-
  Information Processing Systems, 33.                            tation for mitigating gender stereotypes in languages
                                                                 with rich morphology. In Proceedings of the 57th
Linyi Yang, Eoin Kenny, Tin Lok James Ng, Yi Yang,
                                                                 Annual Meeting of the Association for Computa-
  Barry Smyth, and Ruihai Dong. 2020a. Generating
                                                                 tional Linguistics, pages 1651–1661.
  plausible counterfactual explanations for deep trans-
  formers in financial text classification. In Proceed-
  ings of the 28th International Conference on Com-
  putational Linguistics, pages 6150–6160.

Xiaoyu Yang, Stephen Obadinma, Huasha Zhao, Qiong
  Zhang, Stan Matwin, and Xiaodan Zhu. 2020b.
  SemEval-2020 task 5: Counterfactual recognition.
  In Proceedings of the Fourteenth Workshop on Se-
  mantic Evaluation, pages 322–335, Barcelona (on-
  line). International Committee for Computational
  Linguistics.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
  bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
  Xlnet: Generalized autoregressive pretraining for
  language understanding. In Advances in neural in-
  formation processing systems, pages 5753–5763.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
  Alex Smola, and Eduard Hovy. 2016. Hierarchi-
  cal attention networks for document classification.
  In Proceedings of the 2016 conference of the North

                                                         316
You can also read