Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

Page created by Clinton Chandler
 
CONTINUE READING
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
Natural Language Rationales with Full-Stack Visual Reasoning:
                                                    From Pixels to Semantic Frames to Commonsense Graphs
                                                          Ana Marasović†♦ Chandra Bhagavatula† Jae Sung Park♦†
                                                              Ronan Le Bras† Noah A. Smith♦† Yejin Choi♦†
                                                                          †
                                                                       Allen Institute for Artificial Intelligence
                                              ♦
                                                  Paul G. Allen School of Computer Science & Engineering, University of Washington
                                                                                  Seattle, WA, USA
                                                            {anam,chandrab,jamesp,ronanlb}@allenai.org
                                                            {jspark96,nasmith,yejin}@cs.washington.edu

                                                               Abstract                          be best conveyed through natural language, as has
                                                                                                 been studied in literature on (visual) NLI (Do et al.,
                                             Natural language rationales could provide in-
                                                                                                 2020; Camburu et al., 2018), (visual) QA (Wu and
                                             tuitive, higher-level explanations that are eas-
arXiv:2010.07526v1 [cs.CL] 15 Oct 2020

                                             ily understandable by humans, complementing
                                                                                                 Mooney, 2019; Rajani et al., 2019), arcade games
                                             the more broadly studied lower-level explana-       (Ehsan et al., 2019), fact checking (Atanasova et al.,
                                             tions based on gradients or attention weights.      2020), image classification (Hendricks et al., 2018),
                                             We present the first study focused on generat-      motivation prediction (Vondrick et al., 2016), al-
                                             ing natural language rationales across several      gebraic word problems (Ling et al., 2017), and
                                             complex visual reasoning tasks: visual com-         self-driving cars (Kim et al., 2018).
                                             monsense reasoning, visual-textual entailment,         In this paper, we present the first focused study
                                             and visual question answering. The key chal-
                                                                                                 on generating natural language rationales across
                                             lenge of accurate rationalization is comprehen-
                                             sive image understanding at all levels: not just    several complex visual reasoning tasks: visual com-
                                             their explicit content at the pixel level, but      monsense reasoning, visual-textual entailment, and
                                             their contextual contents at the semantic and       visual question answering. Our study aims to com-
                                             pragmatic levels. We present R ATIONALE VT          plement the more broadly studied lower-level ex-
                                             T RANSFORMER, an integrated model that learns       planations such as attention weights and gradients
                                             to generate free-text rationales by combining       in deep neural networks (Simonyan et al., 2014;
                                             pretrained language models with object recog-
                                                                                                 Zhang et al., 2017; Montavon et al., 2018, among
                                             nition, grounded visual semantic frames, and
                                             visual commonsense graphs. Our experiments          others). Because free-text rationalization is a chal-
                                             show that the base pretrained language model        lenging research question, we assume the gold an-
                                             benefits from visual adaptation and that free-      swer for a given instance is given and scope our
                                             text rationalization is a promising research di-    investigation to justifying the gold answer.
                                             rection to complement model interpretability           The key challenge in our study is that accurate ra-
                                             for complex visual-textual reasoning tasks.         tionalization requires comprehensive image under-
                                                                                                 standing at all levels: not just their basic content at
                                         1   Introduction
                                                                                                 the pixel level (recognizing “waitress”, “pancakes”,
                                         Explanatory models based on natural language ra-        “people” at the table in Figure 1), but their contex-
                                         tionales could provide intuitive, higher-level expla-   tual content at the semantic level (understanding
                                         nations that are easily understandable by humans        the structural relations among objects and entities
                                         (Miller, 2019). In Figure 1, for example, the natu-     through action predicates such as “delivering” and
                                         ral language rationale given in free-text provides a    “pointing to”) as well as at the pragmatic level (un-
                                         much more informative and conceptually relevant         derstanding the “intent” of the pointing action is to
                                         explanation to the given QA problem compared to         tell the waitress who ordered the pancakes).
                                         the non-linguistic explanations that are often pro-        We present R ATIONALE VT T RANSFORMER, an inte-
                                         vided as localized visual highlights on the image.      grated model that learns to generate free-text ra-
                                         The latter, while pertinent to what the vision compo-   tionales by combining pretrained language models
                                         nent of the model was attending to, cannot provide      based on GPT-2 (Radford et al., 2019) with visual
                                         the full scope of rationales for such complex reason-   features. Besides commonly used features derived
                                         ing tasks as illustrated in Figure 1. Indeed, expla-    from object detection (Fig. 2a), we explore two
                                         nations for higher-level conceptual reasoning can       new types of visual features to enrich base mod-
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
Figure 1: An illustrative example showing that explaining higher-level conceptual reasoning cannot be well con-
veyed only through the attribution of raw input features (individual pixels or words); we need natural language.

els with semantic and pragmatic knowledge: (i)                        2       Rationale Generation with
visual semantic frames, i.e., the primary activity                            R ATIONALE VT T RANSFORMER
and entities engaged in it detected by a grounded
                                                                      Our approach to visual-textual rationalization is
situation recognizer (Fig. 2b; Pratt et al., 2020),
                                                                      based on augmenting GPT-2’s input with output of
and (ii) commonsense inferences inferred from an
                                                                      external vision models that enable different levels
image and an optional event predicted from a visual
                                                                      of visual understanding.
commonsense graph (Fig. 2c; Park et al., 2020).1
    We report comprehensive experiments with care-                    2.1      Background: Conditional Text
ful analysis using three datasets with human ra-                               Generation
tionales: (i) visual question answering in VQA - E                    The GPT-2’s backbone architecture can be de-
(Li et al., 2018), (ii) visual-textual entailment in                  scribed as the decoder-only Transformer (Vaswani
E - SNLI - VE (Do et al., 2020), and (iii) an answer                  et al., 2017) which is pretrained with the conven-
justification subtask of visual commonsense reason-                   tional language modeling (LM) likelihood objec-
ing in VCR (Zellers et al., 2019a). Our empirical                     tive.3 This makes it more suitable for generation
findings demonstrate that while free-text rational-                   tasks compared to models trained with the masked
ization remains a challenging task, newly emerging                    LM objective (BERT; Devlin et al., 2019).4
state-of-the-art models support rationale generation                     We build on pretrained LMs because their capa-
as a promising research direction to complement                       bilities make free-text rationalization of complex
model interpretability for complex visual-textual                     reasoning tasks conceivable. They strongly condi-
reasoning tasks. In particular, we find that inte-                    tion on the preceding tokens, produce coherent and
gration of richer semantic and pragmatic visual                       contentful text (See et al., 2019), and importantly,
knowledge can lead to generating rationales with                      capture some commonsense and world knowledge
higher visual fidelity, especially for tasks that re-                 (Davison et al., 2019; Petroni et al., 2019).
quire higher-level concepts and richer background                        To induce conditional text generation behavior,
knowledge.                                                            Radford et al. (2019) propose to add the context
   Our code, model weights, and the templates used                    tokens (e.g., question and answer) before a special
for human evaluation are publicly available.2                         token for the generation start. But for visual-textual
                                                                      tasks, the rationale generation has to be conditioned
   1
                                                                      not only on textual context, but also on an image.
     Figures 2a–2c are taken and modified from Zellers et al.
                                                                          3
(2019a), Pratt et al. (2020), and Park et al. (2020), respectively.         Sometimes referred to as density estimation, or left-to-
   2                                                                  right or autoregressive LM (Yang et al., 2019).
     https://github.com/allenai/
                                                                          4
visual-reasoning-rationalization                                            See Appendix §A.1 for other details of GPT-2.
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
(a) Object Detection                   (b) Grounded Situation Recognition       (c) Visual Commonsense Graph

        Figure 2: An illustration of outputs of external vision models that we use to visually adapt GPT-2.

                    Object Detector                   Grounded Situation Recognizer           Visual Commonsense Graphs
 Understanding      Basic                             Semantics                               Pragmatics
 Model name         Faster R-CNN (Ren et al., 2015)   JSL (Pratt et al., 2020)                V ISUAL C OMET (Park et al., 2020)
                                                      RetinaNet (Lin et al., 2017), ResNet-
 Backbone           ResNet-50 (He et al., 2016)                                             GPT-2 (Radford et al., 2019)
                                                      50, LSTM
                                                                                              OpenWebText (Gokaslan and Cohen,
 Pretraining data   ImageNet (Deng et al., 2009)      ImageNet, COCO
                                                                                              2019)
 Finetuning data    COCO (Lin et al., 2014)           SWiG (Pratt et al., 2020)               VCG (Park et al., 2020)
  UNIFORM           “non-person“ object labels        top activity and its roles              top-5 before, after, intent inferences
                                                                                              V ISUAL C OMET’s embedding for
                    Faster R-CNN’s object boxes’ JSL’s role boxes’ representations and
  HYBRID                                                                                      special tokens that signal the start
                    representations and coordinates coordinates
                                                                                              of before, after, intent inference

 Table 1: Specifications of external vision models and their outputs that we use as features for visual adaptation.

2.2   Outline of Full-Stack Visual                                 cus on recognition-level understanding. But vi-
      Understanding                                                sual understanding also requires attributing mental
                                                                   states such as beliefs, intents, and desires to people
We first outline types of visual information and
                                                                   participating in an image. In order to achieve this,
associated external models that lead to the full-
                                                                   we use the output of V ISUAL C OMET (Fig. 2c; Park
stack visual understanding. Specifications of these
                                                                   et al., 2020), another GPT-2-based model that gen-
models and features that we use appear in Table 1.
                                                                   erates commonsense inferences, i.e. events before
   Recognition-level understanding of an image be-
                                                                   and after as well as people’s intents, given an image
gins with identifying the objects present within it.
                                                                   and a description of an event present in the image.
To this end, we use an object detector that predicts
objects present in an image, their labels (e.g., “cup              2.3       Fusion of Visual and Textual Input
or “chair”), bounding boxes, and the boxes’ hidden
                                                                   We now describe how we format outputs of the
representations (Fig. 2a).
                                                                   external models (§2.2; Table 1) to augment GPT-2’s
   The next step of recognition-level understanding
                                                                   input with visual information.
is capturing relations between objects. A computer
                                                                      We explore two ways of extending the input. The
vision task that aims to describe such relations is
                                                                   first approach adds a vision model’s textual output
situation recognition (Yatskar et al., 2016). We use
                                                                   (e.g., object labels such as “food” and “table”) be-
a model for grounded situation recognition (Fig.
                                                                   fore the textual context (e.g., question and answer).
2b; Pratt et al., 2020) that predicts the most promi-
                                                                   Since everything is textual, we can directly embed
nent activity in an image (e.g., “surfing”), roles
                                                                   each token using the GPT-2’s embedding layer, i.e.,
of entities engaged in the activity (e.g., “agent” or
                                                                   by summing the corresponding token, segmenta-
“tool”), the roles’ bounding boxes, and the boxes’
                                                                   tion, and position embeddings.5 We call this kind
hidden representations.
                                                                       5
   The object detector and situation recognizer fo-                        The segment embeddings are (to the best of our knowl-
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
Dataset                             Task                           Train         Dev       Expected Visual Understanding
                                      visual commonsense reasoning                           higher-order cognitive, commonsense,
  VCR   (Zellers et al., 2019a)                                      212,923       26,534
                                      (question answering)                                   recognition
  E - SNLI - VE   (Do et al., 2020)                                  511,112†      17,133†
                                      visual-textual entailment                              higher-order cognitive, recognition
  ¬ NEUTRAL                                                          341,095       13,670
  VQA - E   (Li et al., 2018)         visual question answering      181,298       88,488    recognition

Table 2: Specifications of the datasets we use for rationale generation. Do et al. (2020) re-annotate the SNLI - VE
dev and test splits due to the high labelling error of the neutral class (Vu et al., 2018). Given the remaining errors in
the training split, we generate rationales only for entailment and contradiction examples. †Do et al. (2020) report
529,527 and 17,547 training and validation examples, but the available data with explanation is smaller.

of input fusion UNIFORM.
    This is the simplest way to extend the input, but
it is prone to propagation of errors from external
vision models. Therefore, we explore using vision
models’ embeddings for regions-of-interest (RoI)
in the image that show relevant entities.6 For each
RoI, we sum its visual embedding (described later)
with the three GPT-2’s embeddings (token, segment,
position) for a special “unk” token and pass the re-                     Hypothesis: A dog plays with a tennis ball.
sult to the following GPT-2 blocks.7 After all RoI                       Label: Entailment.
embeddings, each following token (question, an-                          Rationale: A dog jumping is how he plays.
swer, rationale, separator tokens) is embedded sim-                      Textual premise: A brown dog is jumping
ilarly, by summing the three GPT-2’s embeddings                          after a tennis ball.
and a visual embedding of the entire image.
    We train and evaluate our models with differ-                    Figure 3: An illustrative example of the entailment ar-
                                                                     tifact in E - SNLI - VE.
ent fusion types and visual features separately to
analyze where the improvements come from. We
provide details of feature extraction in App. §A.4.                  we do not add it to the question, answer, rationale,
                                                                     separator tokens.
Visual Embeddings We build visual embed-
dings from bounding boxes’ hidden representations                    3       Experiments
(the feature vector prior to the output layer) and
boxes’ coordinates (the top-left, bottom-right co-                   For all experiments, we visually adapt and fine-
ordinates, and the fraction of image area covered).                  tune the original GPT-2 with 117M parameters. We
We project bounding boxes’ feature vectors as well                   train our models using the language modeling loss
as their coordinate vectors to the size of GPT-2 em-                 computed on rationale tokens.8
beddings. We sum projected vectors and apply the
                                                                     Tasks and Datasets We consider three tasks and
layer normalization. We take a different approach
                                                                     datasets shown in Table 2. Models for VCR and
for V ISUAL C OMET embeddings, since they are not
                                                                     VQA are given a question about an image, and they
related to regions-of-interest of the input image
                                                                     predict the answer from a candidate list. Models
(see §2.2). In this case, as visual embeddings, we
                                                                     for visual-textual entailment are given an image
use V ISUAL C OMET embeddings that signal to start
                                                                     (that serves as a premise) and a textual hypothesis,
generating before, after, and intent inferences, and
                                                                     and they predict an entailment label between them.
since there is no representation of the entire image,
                                                                     The key difference among the three tasks is the
edge) first introduced in Devlin et al. (2019) to separate input     level of required visual understanding.
elements from different sources in addition to the special sep-         We report here the main observations about how
arator tokens.
    6
      The entire image is also a region-of-interest.                 the datasets were collected, while details are in the
    7
      Visual embeddings and object labels do not have a natural      Appendix §A.2. Foremost, only VCR rationales are
sequential order among each other, so we assign position zero
                                                                         8
to them.                                                                     See Table 7 (§A.5) for hyperparameter specifications.
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
human-written for a given problem instance. Ratio-               these samples to tune any of our hyperparameters.
nales in VQA - E are extracted from image captions               Each generation was evaluated by 3 crowdworkers.
relevant for question-answer pairs (Goyal et al.,                The workers were paid ∼$13 per hour.
2017) using a constituency parse tree. To create a
dataset for explaining visual-textual entailment, E -            Baselines The main objectives of our evaluation
SNLI - VE , Do et al. (2020) combined the SNLI - VE
                                                                 are to assess whether (i) proposed visual features
dataset (Xie et al., 2019) for visual-textual entail-            help GPT-2 generate rationales that support a given
ment and the E - SNLI dataset (Camburu et al., 2018)             answer or entailment label better (visual plausibil-
for explaining textual entailment.                               ity), and whether (ii) models that generate more
   We notice that this methodology introduced a                  plausible rationales are less likely to mention con-
data collection artifact for entailment cases. To il-            tent that is irrelevant to a given image (visual fi-
lustrate this, consider the example in Figure 3. In              delity). As a result, a text-only GPT-2 approach
visual-textual entailment, the premise is the image.             represents a meaningful baseline to compare to.
Therefore, there is no reason to expect that a model                In light of work exposing predictive data arti-
will build a rationale around a word that occurs                 facts (e.g., Gururangan et al., 2018), we estimate
in the textual premise it has never seen (“jump-                 the effect of artifacts by reporting the difference
ing”). We will test whether models struggle with                 between visual plausibility of the text-only baseline
entailment cases.                                                and plausibility of its rationales assessed without
                                                                 looking at the image (textual plausibility). If both
Human Evaluation For evaluating our models,                      are high, then there are problematic lexical cues in
we follow Camburu et al. (2018) who show that                    the datasets. Finally, we report estimated plausibil-
BLEU (Papineni et al., 2002) is not reliable for eval-           ity of human rationales to gauge what has been
uation of rationale generation, and hence use hu-                solved and what is next.12
man evaluation.9 We believe that other automatic
sentence similarity measures are also likely not                 3.1    Visual Plausibility
suitable due to a similar reason; multiple rationales            We ask workers to judge whether a rationale sup-
could be plausible, although not necessarily para-               ports a given answer or entailment label in the con-
phrases of each other (e.g., in Figure 4 both gener-             text of the image (visual plausibility). They could
ated and human rationales are plausible, but they                select a label from {yes, weak yes, weak no, no}.
are not strict paraphrases).10 Future work might                 We later merge weak yes and weak no to yes and no,
consider newly emerging learned evaluation mea-                  respectively. We then calculate the ratio of yes la-
sures, such as BLEURT (Sellam et al., 2020), that                bels for each rationale and report the average ratio
could learn to capture non-trivial semantic similari-            in a sample.13
ties between sentences beyond surface overlap.                      We compare the text-only GPT-2 with visual
   We use Amazon Mechanical Turk to crowd-                       adaptations in Table 3. We observe that GPT-2’s
source human judgments of generated rationales                   visual plausibility benefits from some form of vi-
according to different criteria. Our instructions are            sual adaptation for all tasks. The improvement is
provided in the Appendix §A.6. For VCR, we ran-                  most visible for VQA - E, followed by VCR, and
domly sample one QA pair for each movie in the                   then E - SNLI - VE (all). We suspect that the mi-
development split of the dataset, resulting in 244               nor improvement for E - SNLI - VE is caused by the
examples for human evaluation. For VQA and E -                   entailment-data artifact. Thus, we also report the
SNLI - VE, we randomly sample 250 examples from                  visual plausibility for entailment and contradiction
their development splits.11 We did not use any of                cases separately. The results for contradiction hy-
    9                                                            potheses follow the trend that is observed for VCR
      This is based on a low inter-annotator BLEU-score be-
tween three human rationales for the same NLI example.           and VQA - E. In contrast, visual adaption does not
   10
      In Table 8 (§A.5), we report automatic captioning mea-     help rationalization of entailed hypotheses. These
sures for the best R ATIONALE VT T RANSFORMER for each
dataset. These results should be used only for reproducibility   instance is judged by 3 workers.
                                                                    12
and not as measures of rationale plausibility.                         Plausibility of human-written rationales is estimated from
   11                                                            our evaluation samples.
      The size of evaluation sample is a general problem of
                                                                    13
generation evaluation, since human evaluation is crucial but           We follow the related work (Camburu et al., 2018; Do
expensive. Still, we evaluate ∼2.5 more instances per each of    et al., 2020; Narang et al., 2020) in using yes/no judgments.
24 dataset-model combinations than related work (Camburu         We introduced weak labels because they help evaluating cases
et al., 2018; Do et al., 2020; Narang et al., 2020); and each    with a slight deviation from a clear-cut judgment.
E - SNLI - VE
                                                                 VCR                                                                      VQA - E
                                                                         Contradiction                   Entailment          All
                                   Baseline                      53.14               46.85                 46.76           46.80           47.20
        T RANSFORMERS
                         UNIFORM
                                   Object labels                 54.92               58.56                 36.45           46.27           54.40
         R ATIONALE VT

                                   Situation frames              56.97               59.16                 38.13           47.47           50.93
                                   V IS C OMET text inferences   60.93               53.75                 29.26           40.13           53.47
                                   Object regions                47.40               60.96                 34.05           46.00           59.07
                         HYBRID

                                   Situation roles regions       47.95               51.95                 37.65           44.00           63.33
                                   V IS C OMET embeddings        59.84               48.95                 32.13           39.60           54.93
                                   Human (estimate)              87.16               80.78                 76.98           78.67           66.53

Table 3: Visual plausibility of random samples of generated and human (gold) rationales. Our baseline is text-only
GPT-2. The best model is boldfaced.

findings, together with the fact that we have already                                                           VCR
                                                                                                                        E - SNLI - VE   E - SNLI - VE
                                                                                                                                                        VQA - E
                                                                                                                        (contrad.)       (entail.)
discarded neutral hypotheses due to the high error                                    Baseline                  70.63      74.47           46.28        68.27
rate, raise concern about the E - SNLI - VE dataset.
                                                                           UNIFORM    Object labels             75.14      77.48           37.17        64.80
Henceforth, we report entailment and contradiction                                    Situation frames          74.18      72.37           38.61        61.73
                                                                                      V IS C OMET text          73.91      71.17           32.37        69.73
separately, and focus on contradiction when dis-
                                                                                      Object regions            69.26      69.97           33.81        68.53
                                                                           HYBRID

cussing results. We illustrate rationales produced                                    Situation roles regions   69.81      62.46           38.61        69.47
by R ATIONALE VT T RANSFORMER in Figure 4, and pro-                                   V IS C OMET embd.         81.15      74.77           32.37        76.53

vide additional analyses in the Appendix §B.                                          Human (estimate)          90.71      79.58           74.34        64.27

                                                                          Table 4: Plausibility of random samples of human
3.2   Effect of Visual Features
                                                                          (gold) and generated rationales assessed without look-
We motivate different visual features with varying                        ing at the image (textual plausibility). The best model
levels of visual understanding (§2.2). We reflect                         is boldfaced.
on our assumptions about them in light of the vi-
sual plausibility results in Table 3. We observe that
                                                                          sion works better. A similar point was recently
V ISUAL C OMET, designed to help attribute mental
                                                                          raised by Singh et al. (2020). Future work might
states, indeed results in the most plausible ratio-
                                                                          consider carefully combining both fusion types and
nales for reasoning in VCR, which requires a high-
                                                                          multiple visual features.
order cognitive and commonsense understanding.
We propose situation frames to understand relations
                                                                          3.3           Textual Plausibility
between objects which in turn can result in better
recognition-level understanding. Our results show                         It has been shown that powerful pretrained LMs
that situation frames are the second best option                          can reason about textual input well in the current
for VCR and the best for VQA, which supports our                          benchmarks (e.g., Zellers et al., 2019b; Khashabi
hypothesis. The best option for E - SNLI - VE (con-                       et al., 2020). In our case, that would be illustrated
tradiction) is HYBRID fusion of objects, although                         with a high plausibility of generated rationales in
UNIFORM situation fusion is comparable. More-                             an evaluation setting where workers are instructed
over, V ISUAL C OMET is less helpful for E - SNLI - VE                    to ignore images (textual plausibility).
compared to objects and situation frames. This sug-                          We report textual plausibility in Table 4. Text-
gests that visual-textual entailment in E - SNLI - VE                     only GPT-2 achieves high textual plausibility (rel-
is perhaps focused on recognition-level understand-                       ative to the human estimate) for all tasks (except
ing more than it is anticipated.                                          the entailment part of E - SNLI - VE), demonstrating
   One fusion type does not dominate across                               good reasoning capabilities of GPT-2, when the con-
datasets (see an overview in Table 9 in the Ap-                           text image is ignored for plausibility assessment.
pendix §B). We hypothesize that the source domain                         This result also verifies our hypothesis that gen-
of the pretraining dataset of vision models as well                       erating a textually plausible rationale is easier for
as their precision can influence which type of fu-                        models than producing a visually plausible ratio-
Figure 4: R ATIONALE VT T RANSFORMER generations for VCR (top), E - SNLI - VE (contradiction; middle), and VQA - E
(bottom). We use the best model variant for each dataset (according to results in Table 3).

nale. For example, GPT-2 can likely produce many           3.4   Plausibility of Human Rationales
statements that contradict “the woman is texting”          The best performing models for VCR and E - SNLI -
(see Figure 4), but producing a visually plausible         VE (contradiction) are still notably behind the es-
rationale requires conquering another challenge:           timated visual plausibility of human-written ratio-
capturing what is present in the image.                    nales (see Table 3). Moreover, plausibility of hu-
                                                           man rationales is similar when evaluated in the con-
   If both textual and visual plausibility of the text-
                                                           text of the image (visual plausibility) and without
only GPT-2 were high, that would indicate there are
                                                           the image (text plausibility) because (i) data anno-
some lexical cues in the datasets that allow models
                                                           tators produce visually plausible rationales since
to ignore the context image. The decrease in plau-
                                                           they have accurate visual understanding, and (ii)
sibility performance once the image is shown (cf.
                                                           visually plausible rationales are usually textually
Tables 3 and 4) confirms that the text-only baseline
                                                           plausible. These results show that generating visu-
is not able to generate visually plausible rationales
                                                           ally plausible rationales for VCR and E - SNLI - VE is
by fitting lexical cues.
                                                           still challenging even for our best models.
   We notice another interesting result: textual plau-        In contrast, we seem to be closing the gap for
sibility of visually adapted models is higher than         VQA - E . In addition, due in part to the automatic
textual plausibility of the text-only GPT-2. The           extraction of rationales, the human rationales in
following three insights together suggest why this         VQA - E suffer from a notably lower estimate of
could be the case: (i) the gap between textual plau-       plausibility.
sibility of generated and human rationales shows
that generating textual plausible rationales is not        3.5   Visual Fidelity
solved, (ii) visual models produce rationales that         We investigate further whether visual plausibility
are more visually plausible than the text-only base-       improvements come from better visual understand-
line, and (iii) visually plausible rationales are usu-     ing. We ask workers to judge if the rationale men-
ally textually plausible (see examples in Figure 4).       tions content unrelated to the image, i.e., anything
Figure 5: The relation between visual plausibility (§3.1) and visual fidelity (§3.5). We denote UNIFORM fusion
with (U) and HYBRID fusion with (H).

                                                                                      E - SNLI - VE                   E - SNLI - VE
                                                              VCR                                                                                     VQA - E
                                                                                     (contradict.)                     (entail.)
                                                     Plaus.   Fidelity    r     Plaus.   Fidelity      r     Plaus.      Fidelity      r     Plaus.   Fidelity    r
                               Baseline              53.14     61.07     0.68   46.85     44.74       0.53   46.76        74.34       0.50   47.20     52.40     0.61
    T RANSFORMERS
                     UNIFORM

                               Object labels         54.92     60.25     0.73   58.56     58.56       0.55   36.45        67.39       0.58   54.40     63.47     0.54
     R ATIONALE VT

                               Situation frames      56.97     62.43     0.78   59.16     66.07       0.37   38.13        72.90       0.51   50.93     61.07     0.53
                               V IS C OMET text      60.93     70.22     0.62   53.75     55.26       0.45   29.26        73.14       0.49   53.47     64.27     0.66
                               Object regions        47.40     54.37     0.67   60.96     61.86       0.40   34.05        74.34       0.31   59.07     69.87     0.53
                     HYBRID

                               Situ. roles regions   47.95     54.92     0.66   51.95     56.16       0.45   37.65        70.50       0.59   63.33     71.47     0.62
                               V IS C OMET embd.     59.84     72.81     0.72   48.95     54.05       0.48   32.13        81.53       0.41   54.93     60.27     0.59
                               Human (estimate)      87.16     91.67     0.58   80.78     68.17       0.28   76.98        88.49       0.43   66.53     89.20     0.35

Table 5: Visual plausibility (Table 3; §3.1), visual fidelity (§3.5), and Pearson’s r that measures linear correlation
between the visual plausibility and fidelity. Our baseline is text-only GPT-2. The best model is boldfaced and the
second best underlined.

that is not directly visible and is unlikely to be                                         generate rationales for VCR (Zellers et al., 2019a).
present in the scene in the image. They could se-                                          The bottom-up top-down attention (BUTD) model
lect a label from {yes, weak yes, weak no, no}. We                                         (Anderson et al., 2018) has been proposed to in-
later merge weak yes and weak no to yes and no,                                            corporate rationales with visual features for VQA - E
respectively. We then calculate the ratio of no la-                                        and E - SNLI - VE (Li et al., 2018; Do et al., 2020).
bels for each rationale. The final fidelity score is                                       Compared to BUTD, we use a pretrained decoder
the average ratio in a sample.14                                                           and propose a wider range of visual features to
   Figure 5 illustrates the relation between visual fi-                                    tackle comprehensive image understanding.
delity and plausibility. For each dataset (except the
entailment part of E - SNLI - VE), we observe that vi-                                     Conditional Text Generation Pretrained LMs
sual plausibility is larger as visual fidelity increases.                                  have played a pivotal role in open-text generation
We verify this with Pearson’s r and show moderate                                          and conditional text generation. For the latter, some
linear correlation in Table 5. This shows that mod-                                        studies trained a LM from scratch conditioned on
els that generate more visually plausible rationales                                       metadata (Zellers et al., 2019c) or desired attributes
are less likely to mention content that is irrelevant                                      of text (Keskar et al., 2019), while some fine-tuned
to a given image.                                                                          an already pretrained LM on commonsense knowl-
                                                                                           edge (Bhagavatula et al., 2020) or text attributes
4           Related Work                                                                   (Ziegler et al., 2019a). Our work belongs to the
                                                                                           latter group with focus on conditioning on compre-
Rationale Generation Applications of rationale
                                                                                           hensive image understanding.
generation (see §1) can be categorized as text-only,
vision-only, or visual-textual. Our work belongs to                                        Visual-Textual Language Models There is a
the final category, where we are the first to try to                                       surge of work that proposes visual-textual pretrain-
   14
      We also study assessing fidelity from phrases that are                               ing of LMs by predicting masked image regions
extracted from a rationale (see Appendix B).                                               and tokens (Tan and Bansal, 2019; Lu et al., 2019;
Chen et al., 2019, to name a few). We construct in-      the user what the prediction should be based on.
put elements of our models following the VL - BERT       Kumar and Talukdar (2020) highlight that this ap-
architecture (Su et al., 2020). Despite their success,   proach resembles post-hoc methods with the label
these models are not suitable for generation due to      and rationale being produced jointly (the end-to-
pretraining with the masked LM objective. Zhou           end predict-then-explain setting). Thus, all but the
et al. (2020) aim to address that, but they pretrain     pipeline predict-then-explain approach are suitable
their decoder from scratch using 3M images with          extensions of our models. A promising line of work
weakly-associated captions (Sharma et al., 2018).        trains end-to-end models for joint rationalization
This makes their decoder arguably less powerful          and prediction from weak supervision (Latcinnik
compared to LMs that are pretrained with remark-         and Berant, 2020; Shwartz et al., 2020), i.e., with-
ably more (diverse) data such as GPT-2. Ziegler          out human-written rationales.
et al. (2019b) augment GPT-2 with a feature vec-
tor for the entire image and evaluate this model         Limitations Natural language rationales are eas-
on image paragraph captioning. Some work ex-             ily understood by lay users who consequently
tend pretrained LM to learn video representations        feel more convinced and willing to use the model
from sequences of visual features and words, and         (Miller, 2019; Ribera and Lapedriza, 2019). Their
show improvements in video captioning (Sun et al.,       limitation is that they can be used to persuade users
2019a,b). Our work is based on fine-tuning GPT-2         that the model is reliable when it is not (Bansal
with features that come from visual object recog-        et al., 2020)—an ethical issue raised by Herman
nition, grounded semantic frames, and visual com-        (2017). This relates to the pipeline predict-then-
monsense graphs. The latter two features have not        explain setting, where a predictor model and a post-
been explored yet in this line of work.                  hoc explainer model are completely independent.
                                                         However, there are other settings where generated
5   Discussion and Future Directions                     rationales are intrinsic to the model by design (end-
                                                         to-end predict-then-explain, both end-to-end and
Rationale Definition The term interpretability           pipeline explain-then-predict). As such, generated
is used to refer to multiple concepts. Due to this,      rationales are more associated with the reasoning
criteria for explanation evaluation depend on one’s      process of the model. We recommend that fu-
definition of interpretability (Lipton, 2016; Doshi-     ture work develops rationale generation in these
Velez and Kim, 2017; Jacovi and Goldberg, 2020).         settings, and aims for sufficiently faithful models
In order to avoid problems arising from ambiguity,       as recommended by Jacovi and Goldberg (2020),
we reflect on our definition. We follow Ehsan et al.     Wiegreffe and Pinter (2019).
(2018) who define AI rationalization as a process
of generating rationales of a model’s behavior as if     6   Conclusions
a human had performed the behavior.
                                                         We present R ATIONALE VT T RANSFORMER, an integra-
Jointly Predicting and Rationalizing We nar-
                                                         tion of a pretrained text generator with semantic
row our focus on improving generation models and
                                                         and pragmatic visual features. These features can
assume gold labels for the end-task. Future work
                                                         improve visual plausibility and fidelity of gener-
can extend our model to an end-to-end (Narang
                                                         ated rationales for visual commonsense reasoning,
et al., 2020) or a pipeline model (Camburu et al.,
                                                         visual-textual entailment, and visual question an-
2018; Rajani et al., 2019; Jain et al., 2020) for
                                                         swering. This represents progress in tackling im-
producing both predictions and natural language
                                                         portant, but still relatively unexplored research di-
rationales. We expect that the explain-then-predict
                                                         rection; rationalization of complex reasoning for
setting (Camburu et al., 2018) is especially rele-
                                                         which explanatory approaches based solely on high-
vant for rationalization of commonsense reasoning.
                                                         lighting parts of the input are not suitable.
In this case, relevant information is not in the in-
put, but inferred from it, which makes extractive        Acknowledgments
explanatory methods based on highlighting parts
of the input unsuitable. A rationale generation          The authors thank Sarah Pratt for her assistance
model brings relevant information to the surface,        with the grounded situation recognizer, Amanda-
which can be passed to a prediction model. This          lynne Paullada, members of the AllenNLP team,
makes rationales intrinsic to the model, and tells       and anonymous reviewers for helpful feedback.
This research was supported in part by NSF               Finale Doshi-Velez and Been Kim. 2017. Towards A
(IIS1524371, IIS-1714566), DARPA under the                  Rigorous Science of Interpretable Machine Learn-
                                                            ing. arXiv:1702.08608.
CwC program through the ARO (W911NF-15-1-
0543), DARPA under the MCS program through               Upol Ehsan, Brent Harrison, Larry Chan, and Mark O.
NIWC Pacific (N66001-19-2-4031), and gifts from            Riedl. 2018. Rationalization: A Neural Machine
Allen Institute for Artificial Intelligence.               Translation Approach to Generating Natural Lan-
                                                           guage Explanations. In AIES.

                                                         Upol Ehsan, Pradyumna Tambwekar, Larry Chan,
References                                                 Brent Harrison, and Mark O. Riedl. 2019. Auto-
Peter Anderson, Xiaodong He, Chris Buehler, Damien         mated rationale generation: a technique for explain-
  Teney, Mark Johnson, Stephen Gould, and Lei              able AI and its effects on human perceptions. In IUI.
  Zhang. 2018. Bottom-Up and Top-Down Attention
                                                         Sam Gehman, Suchin Gururangan, Maarten Sap, Yejin
  for Image Captioning and Visual Question Answer-
                                                           Choi, and Noah A. Smith. 2020. RealToxici-
  ing. In CVPR.
                                                           tyPrompts: Evaluating Neural Toxic Degeneration
Pepa Atanasova, Jakob Grue Simonsen, Christina Li-         in Language Models. In EMNLP (Findings).
  oma, and Isabelle Augenstein. 2020. Generating
                                                         Aaron Gokaslan and Vanya Cohen. 2019. OpenWeb-
  Fact Checking Explanations. In ACL.
                                                           Text Corpus.
Gagan Bansal, Tongshuang Wu, Joyce Zhu, Raymond
  Fok, Besmira Nushi, Ece Kamar, Marco Túlio            Yash Goyal, Tejas Khot, Douglas Summers-Stay,
  Ribeiro, and Daniel S. Weld. 2020. Does the              Dhruv Batra, and Devi Parikh. 2017. Making the V
  Whole Exceed its Parts? The Effect of AI Ex-             in VQA Matter: Elevating the Role of Image Under-
  planations on Complementary Team Performance.            standing in Visual Question Answering. In CVPR.
  arXiv:2006.14779.                                      Suchin Gururangan, Swabha Swayamdipta, Omer
Chandra Bhagavatula, Ronan Le Bras, Chaitanya              Levy, Roy Schwartz, Samuel Bowman, and Noah A.
  Malaviya, Keisuke Sakaguchi, Ari Holtzman, Han-          Smith. 2018. Annotation Artifacts in Natural Lan-
  nah Rashkin, Doug Downey, Scott Yih, and Yejin           guage Inference Data. In NAACL.
  Choi. 2020. Abductive Commonsense Reasoning.
                                                         Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
  In ICLR.
                                                           Sun. 2016. Deep Residual Learning for Image
Samuel R. Bowman, Gabor Angeli, Christopher Potts,         Recognition. In CVPR.
  and Christopher D. Manning. 2015. A Large Anno-
  tated Corpus for Learning Natural Language Infer-      Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell,
  ence. In EMNLP.                                           and Zeynep Akata. 2018. Grounding Visual Expla-
                                                            nations. In ECCV.
Oana-Maria Camburu, Tim Rocktäschel, Thomas
  Lukasiewicz, and Phil Blunsom. 2018. e-SNLI: Nat-      Bernease Herman. 2017. The Promise and Peril of
  ural language inference with natural language expla-     Human Evaluation for Model Interpretability. In
  nations. In NeurIPS.                                     Symposium on Interpretable Machine Learning @
                                                           NeurIPS.
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El
  Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and            Alon Jacovi and Yoav Goldberg. 2020. Towards Faith-
  Jingjing Liu. 2019. UNITER: Learning UNiversal           fully Interpretable NLP Systems: How should we
  Image-TExt Representations. In ECCV.                     define and evaluate faithfulness? In ACL.

Joe Davison, Joshua Feldman, and Alexander Rush.         Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, and By-
  2019. Commonsense Knowledge Mining from Pre-             ron C. Wallace. 2020. Learning to Faithfully Ratio-
  trained Models. In EMNLP.                                nalize by Construction. In ACL.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai       Nitish Shirish Keskar, Bryan McCann, Lav R. Varsh-
   Li, and Fei-Fei Li. 2009. ImageNet: A large-scale       ney, Caiming Xiong, and Richard Socher. 2019.
   hierarchical image database. In CVPR.                   CTRL: A Conditional Transformer Language Model
                                                           for Controllable Generation. arXiv:1909.05858.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2019. BERT: Pre-training of       Daniel Khashabi, Tushar Khot, Ashish Sabharwal,
   Deep Bidirectional Transformers for Language Un-        Oyvind Tafjord, P. Clark, and Hannaneh Hajishirzi.
   derstanding. In NAACL.                                  2020. UnifiedQA: Crossing Format Boundaries
                                                           With a Single QA System. In EMNLP (Findings).
Virginie Do, Oana-Maria Camburu, Zeynep Akata, and
  Thomas Lukasiewicz. 2020. e-SNLI-VE-2.0: Cor-          Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John F.
  rected Visual-Textual Entailment with Natural Lan-        Canny, and Zeynep Akata. 2018. Textual Explana-
  guage Explanations. arXiv:2004.03744.                     tions for Self-Driving Vehicles. In ECCV.
Sawan Kumar and Partha Talukdar. 2020. NILE : Nat-        Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
  ural Language Inference with Faithful Natural Lan-        Dario Amodei, and Ilya Sutskever. 2019. Language
  guage Explanations. In ACL.                               Models are Unsupervised Multitask Learners.

Veronica Latcinnik and Jonathan Berant. 2020. Ex-         Nazneen Fatema Rajani, Bryan McCann, Caiming
  plaining Question Answering Models through Text           Xiong, and Richard Socher. 2019. Explain Your-
  Generation. arXiv:2004.05569.                             self! Leveraging Language Models for Common-
                                                            sense Reasoning. In ACL.
Qing Li, Qingyi Tao, Shafiq R. Joty, Jianfei Cai, and
  Jiebo Luo. 2018. VQA-E: Explaining, Elaborating,        Shaoqing Ren, Kaiming He, Ross B. Girshick, and
  and Enhancing Your Answers for Visual Questions.          Jian Sun. 2015. Faster R-CNN: Towards Real-Time
  In ECCV.                                                  Object Detection with Region Proposal Networks.
                                                            IEEE Transactions on Pattern Analysis and Machine
Tsung-Yi Lin, Priyal Goyal, Ross B. Girshick, Kaim-         Intelligence, 39:1137–1149.
  ing He, and Piotr Dollár. 2017. Focal Loss for Dense
  Object Detection. In ICCV.                              Mireia Ribera and Àgata Lapedriza. 2019. Can We Do
                                                            Better Explanations? A Proposal of User-Centered
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James       Explainable AI. In ACM IUI Workshop.
  Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
  and C. Lawrence Zitnick. 2014. Microsoft COCO:          Anna Rohrbach, Atousa Torabi, Marcus Rohrbach,
  Common Objects in Context. In ECCV.                       Niket Tandon, Christopher Joseph Pal, Hugo
                                                            Larochelle, Aaron C. Courville, and Bernt Schiele.
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun-        2016. Movie Description. International Journal of
  som. 2017. Program Induction by Rationale Genera-         Computer Vision, 123:94–120.
  tion: Learning to Solve and Explain Algebraic Word
  Problems. In ACL.                                       Abigail See, Aneesh Pappu, Rohun Saxena, Akhila
                                                            Yerukola, and Christopher D. Manning. 2019. Do
Zachary Chase Lipton. 2016. The Mythos of Model             Massively Pretrained Language Models Make Bet-
  Interpretability. In WHI.                                 ter Storytellers? In CoNLL.

Jiasen Lu, Dhruv Batra, Devi Parikh, and Ste-             Thibault Sellam, Dipanjan Das, and Ankur Parikh.
   fan Lee. 2019.       ViLBERT: Pretraining Task-          2020. BLEURT: Learning Robust Metrics for Text
   Agnostic Visiolinguistic Representations for Vision-     Generation. In ACL.
   and-Language Tasks. In NeurIPS.                        Rico Sennrich, Barry Haddow, and Alexandra Birch.
Tim Miller. 2019. Explanation in Artificial Intelli-        2016. Neural Machine Translation of Rare Words
  gence: Insights from the Social Sciences. Artificial      with Subword Units. In ACL.
  Intelligence, 267:1–38.                                 Piyush Sharma, Nan Ding, Sebastian Goodman, and
                                                             Radu Soricut. 2018. Conceptual Captions: A
Grégoire Montavon, Wojciech Samek, and Klaus-
                                                             Cleaned, Hypernymed, Image Alt-text Dataset For
  Robert Müller. 2018. Methods for Interpreting and
                                                            Automatic Image Captioning. In ACL.
  Understanding Deep Neural Networks. Digit. Sig-
  nal Process., 73:1–15.                                  Emily Sheng, Kai-Wei Chang, Premkumar Natarajan,
                                                            and Nanyun Peng. 2019. The Woman Worked as a
Sharan Narang, Colin Raffel, Katherine J. Lee, Adam         Babysitter: On Biases in Language Generation. In
  Roberts, Noah Fiedel, and Karishma Malkan. 2020.          EMNLP-IJCNLP, Hong Kong, China.
  WT5?! Training Text-to-Text Models to Explain
  their Predictions. arXiv:2004.14546.                    Vered Shwartz, Peter West, Ronan Le Bras, Chandra
                                                            Bhagavatula, and Yejin Choi. 2020. Unsupervised
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-         Commonsense Question Answering with Self-Talk.
  Jing Zhu. 2002. Bleu: a Method for Automatic Eval-        In EMNLP.
  uation of Machine Translation. In ACL.
                                                          Karen Simonyan, Andrea Vedaldi, and Andrew Zis-
Jae Sung Park, Chandra Bhagavatula, Roozbeh Mot-            serman. 2014. Deep Inside Convolutional Net-
   taghi, Ali Farhadi, and Yejin Choi. 2020. Visual         works: Visualising Image Classification Models and
   Commonsense Graphs: Reasoning about the Dy-              Saliency Maps. In ICLR Workshop Track.
   namic Context of a Still Image. In ECCV.
                                                          Amanpreet Singh, Vedanuj Goswami, and Devi
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,         Parikh. 2020.     Are We Pretraining It Right?
  Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and            Digging Deeper into Visio-Linguistic Pretraining.
  Alexander Miller. 2019.    Language Models as            arXiv:2004.08744.
  Knowledge Bases? In EMNLP.
                                                          Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,
Sarah Pratt, Mark Yatskar, Luca Weihs, Ali Farhadi,         Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-
  and Aniruddha Kembhavi. 2020. Grounded Situa-             training of Generic Visual-Linguistic Representa-
  tion Recognition. In ECCV.                                tions. In ICLR.
Chen Sun, Fabien Baradel, Kevin Murphy, and               Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin
  Cordelia Schmid. 2019a. Contrastive bidirectional         Choi. 2019a. From Recognition to Cognition: Vi-
  transformer for temporal representation learning.         sual Commonsense Reasoning. In CVPR.
  arXiv:1906.05743.
                                                          Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Chen Sun, Austin Myers, Carl Vondrick, Kevin Mur-           Farhadi, and Yejin Choi. 2019b. HellaSwag: Can
  phy, and Cordelia Schmid. 2019b. VideoBERT: A             a Machine Really Finish Your Sentence? In ACL.
  Joint Model for Video and Language Representation
  Learning. In ICCV.                                      Rowan Zellers, Ari Holtzman, Hannah Rashkin,
                                                            Yonatan Bisk, Ali Farhadi, Franziska Roesner, and
Hao Hao Tan and Mohit Bansal. 2019. LXMERT:                 Yejin Choi. 2019c. Defending Against Neural Fake
  Learning Cross-Modality Encoder Representations           News. In NeurIPS.
  from Transformers. In EMNLP.
                                                          Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui
                                                             Shen, and Stan Sclaroff. 2017. Top-Down Neu-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
                                                             ral Attention by Excitation Backprop. International
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
                                                             Journal of Computer Vision, 126:1084–1102.
  Kaiser, and Illia Polosukhin. 2017. Attention Is All
  You Need. In NeurIPS.                                   Luowei Zhou, Hamid Palangi, Lefei Zhang, Houdong
                                                            Hu, Jason J. Corso, and Jianfeng Gao. 2020. Unified
Carl Vondrick, Deniz Oktay, H. Pirsiavash, and A. Tor-      Vision-Language Pre-Training for Image Captioning
  ralba. 2016. Predicting Motivations of Actions by         and VQA. In AAAI.
  Leveraging Text. In CVPR.
                                                          Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu,
Hoa Trong Vu, Claudio Greco, Aliia Erofeeva, So-            Tom B. Brown, Alec Radford, Dario Amodei, Paul
  mayeh Jafaritazehjan, Guido Linders, Marc Tanti,          Christiano, and Geoffrey Irving. 2019a.  Fine-
  Alberto Testoni, Raffaella Bernardi, and Albert Gatt.     Tuning Language Models from Human Preferences.
  2018. Grounded Textual Entailment. In COLING.             arXiv:1909.08593.

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gard-        Zachary M. Ziegler, Luke Melas-Kyriazi, Sebas-
   ner, and Sameer Singh. 2019. Universal Adversar-         tian Gehrmann, and Alexander M. Rush. 2019b.
   ial Triggers for Attacking and Analyzing NLP. In         Encoder-Agnostic Adaptation for Conditional Lan-
  EMNLP-IJCNLP.                                             guage Generation. arXiv:1908.06938.

Sarah Wiegreffe and Yuval Pinter. 2019. Attention is
  not not Explanation. In EMNLP.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
  Chaumond, Clement Delangue, Anthony Moi, Pier-
  ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
  icz, and Jamie Brew. 2019. HuggingFace’s Trans-
  formers: State-of-the-art natural language process-
  ing. arXiv:1910.03771.

Jialin Wu and Raymond Mooney. 2019. Faithful Mul-
   timodal Explanation for Visual Question Answering.
   In BlackboxNLP @ ACL.

Ning Xie, Farley Lai, Derek Doran, and Asim Kadav.
  2019. Visual Entailment: A Novel Task for Fine-
  Grained Image Understanding. arXiv:1901.06706.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-
  bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.
  XLNet: Generalized Autoregressive Pretraining for
  Language Understanding. In NeurIPS.

Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi.
 2016. Situation Recognition: Visual Semantic Role
 Labeling for Image Understanding. In CVPR.

Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-
  enmaier. 2014. From Image Descriptions to Visual
  Denotations: New Similarity Metrics for Semantic
  Inference over Event Descriptions. TACL, 2:67–78.
A         Experimental Setup                                      A.3   Details of External Vision Models

A.1         Deatils of GPT-2                                      In Table 6, we report sources of images that were
Input to GPT-2 is text that is split into           subtokens15   used to train external vision models and images in
(Sennrich et al., 2016). Each subtoken embedding                  the end-task datasets.
is added to a so-called positional embedding that
signals the order of the subtokens in the sequence to             A.4   Details of Input Elements
the transformer blocks. The GPT-2’s pretraining cor-
pus is OpenWebText corpus (Gokaslan and Cohen,                    Object Detector For UNIFORM fusion, we use
2019) which consists of 8 million Web documents                   labels for objects other that people because person
extracted from URLs shared on Reddit. Pretraining                 label occurs in every example for VCR. We use
on this corpus has caused degenerate and biased                   only a single instance of a certain object label, be-
behaviour of GPT-2 (Sheng et al., 2019; Wallace                   cause repeating the same label does not give new
et al., 2019; Gehman et al., 2020, among others).                 information to the model. The maximum number
Our models likely have the same issues since they                 of subtokens for merged object labels is determined
are built on GPT-2.                                               from merging all object labels, tokenizing them to
                                                                  subtokens, and set the maximum to the length at
A.2         Details of Datasets with Human                        the ninety-ninth percentile calculated from the VCR
            Rationales                                            training set. For HYBRID fusion, we use hidden
We obtain the data from the following links:                      representation of all objects because they differ for
  • https://visualcommonsense.com/                                different detections of objects with the same label.
          download/                                               These representations come from the feature vec-
     • https://github.com/virginie-do/                            tor prior to the output layer of the detection model.
          e-SNLI-VE                                               The maximum number of objects is set to the object
    • https://github.com/liqing-ustc/VQA-E                        number at the 99th percentile calculated from the
   Answers in VCR are full sentences, and in VQA                  VCR training set.

single words or short phrases. All annotations in
VCR are authored by crowdworkers in a single data                 Situation Recognizer For UNIFORM fusion,
collection phase. Rationales in VQA - E are extracted             we consider only the best verb because the top
from relevant image captions for question-answer                  verbs are often semantically similar (e.g. eating
pairs in VQA V 2 (Goyal et al., 2017) using a con-                and dining; see Figure 13 in Pratt et al. (2020)
stituency parse tree. The overall quality of VQA - E              for more examples). We define a structured
rationales is 4.23/5.0 from human perspective.                    format for the output of a situation recognizer.
   The E - SNLI - VE dataset is constructed from a se-            For example, the situation predicted from
ries of additions and changes of the SNLI dataset for             the first image in Figure 4, is assigned the
textual entailment (Bowman et al., 2015). The SNLI                following structure ” 
dataset is collected by using captions in Flickr30k               dining   people
(Young et al., 2014) as textual premises and crowd-                 restaurant
sourcing hypotheses.16 The E - SNLI dataset (Cam-                  ”.      We set the maximum
buru et al., 2018) adds crowdsourced explanations                 situation length to the length at the ninety-ninth
to SNLI. The SNLI - VE dataset (Xie et al., 2019) for             percentile calculated from the VCR training set.
visual-textual entailment is constructed from SNLI
by replacing textual premises with corresponding
                                                                  V ISUAL C OMET The input to V ISUAL C OMET is
Flickr30k images. Finally, Do et al. (2020) com-
                                                                  an image, question, and answer for VCR and VQA -
bine SNLI - VE and E - SNLI to produce a dataset
                                                                  E ; only image for E - SNLI - VE . Unlike situation
for explaining visual-textual entailment. They re-
                                                                  frames, top-k V ISUAL C OMET inferences are diverse.
annotate the dev and test splits due to the high
                                                                  We merge top-5 before, after, and intent inferences.
labelling error of the neutral class in SNLI - VE that
                                                                  We calculate the length of merged inferences in
is reported by Vu et al. (2018).
                                                                  number of subtokens and set the maximum V ISU -
    15
         Also known as wordpieces or subwords.                    AL C OMET length to the length at the ninety-ninth
    16
         Captions tend to be literal scene descriptions.          percentile calculated from the VCR training set.
Dataset         Image Source
                          COCO            Flickr
                          E - SNLI - VE   Flickr (SNLI; Bowman et al., 2015)
                          ImageNet        different search engines
                          SWiG            Google Search (imSitu; Yatskar et al., 2016)
                          VCG, VCR        movie clips (Rohrbach et al., 2016), Fandango†
                          VQA - E         Flickr (COCO)

                Table 6: Image sources. † https://www.youtube.com/user/movieclips

A.5   Training Details                                          • Please ignore minor grammatical errors—e.g.
We use the original GPT-2 version with 117M pa-                   case sensitivity, missing periods etc.
rameters. It consists of 12 layers, 12 heads for each           • Please ignore gender mismatch—e.g. if the
layer, and the size of a model dimension set to 768.              image shows a male, but the rationale men-
We report other hyperaparametes in Table 7. All of                tions female.
them are manually chosen due to the reliance on hu-
                                                                • Please ignore inconsistencies between person
man evaluation. In Table 8, for reproducibility, we
                                                                  and object detections in the QUESTION / AN-
report captioning measures of the best R ATIONALE VT
                                                                  SWER and those in the image—e.g. if a pile
T RANSFORMER variants. Our implementation uses
                                                                  of papers is labeled as a laptop in the image.
the HuggingFace transformers library (Wolf
                                                                  Do not ignore such inconsistencies for the ra-
et al., 2019).17
                                                                  tionale.
A.6   Crowdsourcing Human Evaluation                            • When judging the rationale, think about
We perform human evaluation of the generated                      whether it is plausible.
rationales through crowdsourcing on the Amazon                  • If the rationale just repeats an answer, it is
Mechanical Turk platform. Here, we provide the                    not considered as a valid justification for the
full set of Guidelines provided to workers:                       answer.
   • First, you will be shown a (i) Question, (ii)
     an Answer (presumed-correct), and (iii) a Ra-          B    Additional Results
     tionale. You’ll have to judge if the rationale         We provide the following additional results that
     supports the answer.                                   complement the discussion in Section 3:
   • Next, you will be shown the same question,               • a comparison between UNIFORM and HYBRID
     answer, rationale, and an associated image.                fusion in Table 9,
     You’ll have to judge if the rationale supports           • an investigation of fine-grained visual fidelity
     the answer, in the context of the given image.             in Table 11,
   • You’ll judge the grammaticality of the ratio-            • additional analysis of R ATIONALE VT T RANS -
                                                                FORMER to support future developments.
     nale. Please ignore the absence of periods,
     punctuation and case.                                  Fine-Grained Visual Fidelity At the time of
   • Next, you’ll have to judge if the rationale men-       running human evaluation, we did not know
     tions persons, objects, locations or actions un-       whether judging visual fidelity is a hard task for
     related to the image—i.e. things that are not          workers. To help them focus on relevant parts of a
     directly visible and are unlikely to be present        given rationale and to make their judgments more
     to the scene in the image.                             comparable, we give workers a list of nouns, noun
   • Finally, you’ll pick the NOUNS, NOUN                   phrases, as well as verb phrases with negation, with-
     PHRASES and VERBS from the rationale                   out adjuncts. We ask them to pick phrases that are
     that are unrelated to the image.                       unrelated to the image. For each rationale, we cal-
                                                            culate the ratio of nouns that are relevant over the
  We also provide the following additional tips:            number of all nouns. We call this “entity fidelity”
  17
     https://github.com/huggingface/                        because extracted nouns are mostly concrete (op-
transformers                                                posed to abstract). Similarly, from noun phrases
Computing Infrastructure                                      Quadro RTX 8000 GPU
    Model implementation         https://github.com/allenai/visual-reasoning-rationalization

                    Hyperparameter                                       Assignment
                    number of epochs                                     5
                    batch size                                           32
                    learning rate                                        5e-5
                    max question length                                  19
                    max answer length                                    23
                    max rationale length                                 50
                    max merged object labels length                      30
                    max situation’s structured description length        17
                    max V ISUAL C OMET merged text inferences length     148
                    max input length                                     93, 98, 123, 102, 112, 241
                    max objects embeddings number                        28
                    max situation role embeddings number                 7
                    dimension of object and situation role embeddings    2048
                    decoding                                             greedy

Table 7: Hyperparameters for R ATIONALE VT T RANSFORMER. The length is calculated in number of subtokens
including special separator tokens for a given input type (e.g., begin and end separator tokens for a question). We
calculate the maximum input length by summing the maximum lengths of input elements for each model separately.
A training epoch for models with shorter maximum input length ∼30 minutes and for the model with the longest
input ∼2H.

judgments, we calculate “entity detail fidelity”,                 In Figure 6, we show that the length of generated
and from verb phrases “action fidelity”. Results               rationales is similar for plausible and implausible
in Table 11 show close relation between the over-              rationales, with the exception of E - SNLI - VE for
all fidelity judgment and entity fidelity. Further-            which implausible rationales tend to be longer than
more, for the case where the top two models have               plausible. We show that plausible rationales tend to
close fidelity (V ISUAL C OMET models for VCR), the            rationalize slightly shorter textual context in VCR
fine-grained analysis shows where the difference               (question and answer) and E - SNLI - VE (hypothesis).
comes from (in this case from action fidelity). De-               Finally, in Figure 7, we show that there is more
spite possible advantages of fine-grained fidelity,            variation across {yes, weak yes, weak no, no} labels
we observe that is less correlated with plausibility           for our models than for human rationales.
compared to the overall fidelity.                                 In summary, future developments should im-
                                                               prove generations such that they repeat textual con-
Additional Analysis We ask workers to judge                    text less, handle long textual contexts, and produce
grammatically of rationales. We instruct them to               generations that humans will find more plausible
ignore some mistakes such as absence of periods                with high certainty.
and mismatched gender (see §A.6). Table 10 shows
that the ratio of grammatical rationales is high for
all model variants.
    We measure similarity of generated and gold
rationales to question (hypothesis) and answer. Re-
sults in Tables 12–13 show that generated rationales
repeat the question (hypothesis) more than human
rationales. We also observe that gold rationales in
E - SNLI - VE are notably more repetitive than human
rationales in other datasets.
You can also read