Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Noname manuscript No.
(will be inserted by the editor)
Beyond VQA: Generating Multi-word Answer and Rationale
to Visual Questions
Radhika Dua · Sai Srinivas Kancheti · Vineeth N Balasubramanian
arXiv:2010.12852v1 [cs.CV] 24 Oct 2020
Received: date / Accepted: date
Abstract Visual Question Answering is a multi-modal consists of either open-ended or multiple choice answers
task that aims to measure high-level visual understand- to a question about the image. There are an increasing
ing. Contemporary VQA models are restrictive in the number of models devoted to obtaining the best pos-
sense that answers are obtained via classification over sible performance on benchmark VQA datasets, which
a limited vocabulary (in the case of open-ended VQA), intend to measure visual understanding based on visual
or via classification over a set of multiple-choice-type questions. Most existing works perform VQA by using
answers. In this work, we present a completely gener- an attention mechanism and combining features from
ative formulation where a multi-word answer is gener- two modalities for predicting answers. However, an-
ated for a visual query. To take this a step forward, we swers in existing VQA datasets and models are largely
introduce a new task: ViQAR (Visual Question Answer- one-word answers (average length is 1.1 words), which
ing and Reasoning), wherein a model must generate the gives existing models the freedom to treat answer gener-
complete answer and a rationale that seeks to justify the ation as a classification task. For the open-ended VQA
generated answer. We propose an end-to-end architec- task, the top-K answers are chosen, and models perform
ture to solve this task and describe how to evaluate it. classification over this vocabulary.
We show that our model generates strong answers and However, many questions which require common-
rationales through qualitative and quantitative evalua- sense reasoning cannot be answered in a single word. A
tion, as well as through a human Turing Test. textual answer for a sufficiently complicated question
Keywords Visual Commonsense Reasoning · Vi- may need to be a sentence. For example, a question
sual Question Answering · Model with Generated of the type ”What will happen....” usually cannot be
Explanations answered completely using a single word. Fig 2 shows
examples of such questions where multi-word answers
are required (the answers and rationales in this figure
1 Introduction
are generated by our model in this work). Current VQA
Visual Question Answering(VQA) (Thomason et al. 2018; systems are not well-suited for questions of this type.
Lu et al. 2019; Storks et al. 2019; Jang et al. 2017; Lei To reduce this gap, more recently, the Visual Common-
et al. 2018) is a vision-language task that has seen a lot sense Reasoning (VCR) task (Zellers et al. 2018; Lu
of attention in recent years. In general, the VQA task et al. 2019; Dua et al. 2019; Zheng et al. 2019; Talmor
et al. 2018; Lin et al. 2019a) was proposed, which re-
Radhika Dua
Indian Institute of Technology, Hyderabad, India
quires a greater level of visual understanding and an
E-mail: radhikadua1997@gmail.com ability to reason about the world. More interestingly,
Sai Srinivas Kancheti
the VCR dataset features multi-word answers, with an
Indian Institute of Technology, Hyderabad, India average answer length of 7.55 words. However, VCR
E-mail: sai.srinivas.kancheti@gmail.com is still a classification task, where the correct answer
Vineeth N Balasubramanian is chosen from a set of four answers. Models which
Indian Institute of Technology, Hyderabad, India solve classification tasks simply need to pick an answer
E-mail: vineethnb@iith.ac.in in the case of VQA, or an answer and a rationale for2 R. Dua, S. Kancheti, V. Balasubramanian
Input Image Answer choices from VCR dataset
Question: Why are person3 and person4 Other possible multi-word answers:
sitting at the table? ● person3 and person4 are in a party
a) person3 and person4 are having tea ● They are having drinks in a birthday
person4 b) they are eating dinner party
person3
c) person3 and person4 are waiting for ● person3 and person4 are attending
breakfast a friend’s birthday party
d) person3 and person4 are celebrating a
● They are talking to each other at
birthday
a party
Fig. 1 An example from the VCR dataset (Zellers et al. 2018) shows that there can be many correct multi-word answers
to a question, which makes classification setting restrictive. The highlighted option is the correct option present in the VCR
dataset, the rest are examples of plausible correct answers.
VCR. However, when multi-word answers are required generated answer will help patients/doctors trust the
for a visual question, options are not sufficient, since models outcome and also understand the exact cause
the same ’correct’ answer can be paraphrased in a mul- for the disease.
titude of ways, each having the same semantic meaning In addition to formalizing this new task, we pro-
but differing in grammar. Fig 1 shows an image from vide a simple yet reasonably effective model consisting
the VCR dataset, where the first highlighted answer of four sequentially arranged recurrent networks to ad-
is the correct one among a set of four options pro- dress this challenge. The model can be seen as having
vided in the dataset. The remaining three answers in two parts: a generation module (GM), which comprises
the figure are included by us here (not in the dataset) of the first two sequential recurrent networks, and a
as other plausible correct answers. Existing VQA mod- refinement module (RM), which comprises of the final
els are fundamentally limited by picking a right option, two sequential recurrent networks. The GM first gen-
rather to answer in a more natural manner. Moreover, erates an answer, using which it generates a rationale
since the number of possible ‘correct’ options in multi- that explains the answer. The RM generates a refined
word answer settings can be large (as evidenced by Fig answer based on the rationale generated by GM. The
1), we propose that for richer answers, one would need refined answer is further used to generate a refined ra-
to move away from the traditional classification set- tionale. Our overall model design is motivated by the
ting, and instead let our model generate the answer to way humans think about answers to questions, wherein
a given question. We hence propose a new task which the answer and rationale are often mutually dependent
takes a generative approach to multi-word VQA in this on each other (one could motivate the other and also
work. refine the other). We seek to model this dependency
by first generating an answer-rationale pair and then
Humans when answering questions often use a ra- using them as priors to regenerate a refined answer
tionale to justify the answer. In certain cases, humans and rationale. We train our model on the VCR dataset,
answer directly from memory (perhaps through asso- which contains open-ended visual questions along with
ciations) and then provide a post-hoc rationale, which answers and rationales. Considering this is a genera-
could help improve the answer too - thus suggesting an tive task, we evaluate our methodology by comparing
interplay between an answer and its rationale. Follow- our generated answer/rationale with the ground truth
ing this cue, we also propose to generate a rationale answer/rationale on correctness and goodness of the
along with the answer which serves two purposes: (i) generated content using generative language metrics,
it helps justify the generated answer to end-users; and as well as by human Turing Tests.
(ii) it helps generate a better answer. Going beyond Our main contributions in this work can be sum-
contemporary efforts in VQA, we hence propose, for the marized as follows: (i) We propose a new task ViQAR
first time to the best of our knowledge, an approach that that seeks to open up a new dimension of Visual Ques-
automatically generates both multi-word answers and tion Answering tasks, by moving to a completely gen-
an accompanying rationale, that also serves as a textual erative paradigm; (ii) We propose a simple and effec-
justification for the answer. We term this task Visual tive model based on generation and iterative refinement
Question Answering and Reasoning( ViQAR) and pro- for ViQAR(which could serve as a baseline to the com-
pose an end-to-end methodology to address this task. munity); (iii) Considering generative models in general
This task is especially important in critical AI tasks can be difficult to evaluate, we provide a discussion on
such as VQA in the medical domain, where simply an- how to evaluate such models, as well as study a com-
swering questions about the medical images is not suf- prehensive list of evaluation metrics for this task; (iv)
ficient. Instead, generating a rationale to justify the We conduct a suite of experiments which show promiseBeyond VQA: Generating Multi-word Answer and Rationale to Visual Questions 3
Question: Why is person1 covering his Question: What is person8 doing right
face ? person 8 now?
person1 Generated answer: he is trying to Generated answer: person8 is giving
avoid getting burned a speech
Why this Why this
answer? answer?
Generated reason: there is a fire Generated reason: person8 is
right in front of him standing in front of a microphone
Question: Why is person2 wearing
person2
tie1?
Generated answer: person2 is
wearing a tie because he is at a
wedding
Why this
answer?
tie1
Generated reason: people wear
ties to formal events
Fig. 2 Given an image and a question about the image, we generate a natural language answer and reason that explains
why the answer was generated. The images shown above are examples of outputs that our proposed model generates. These
examples also illustrate the kind of visual questions for which a single-word answer is insufficient. Contemporary VQA models
handle even such kinds of questions only in a classification setting, which is limiting.
of the proposed model for this task, and also perform 2017). Visual Dialog (Das et al. 2018; Zheng et al. 2019)
ablation studies of various choices and components to extends VQA but requires an agent to hold a meaning-
study the effectiveness of the proposed methodology on ful conversation with humans in natural language based
ViQAR. We believe that this work could lead to further on visual questions.
efforts on common-sense answer and rationale genera- The efforts closest to ours are those that provide
tion in vision tasks in the near future. To the best of justifications along with answers (Li et al. 2018b; Hen-
our knowledge, this is the first such effort of automati- dricks et al. 2016; Li et al. 2018a; Park et al. 2018;
cally generating a multi-word answer and rationale to a Wu et al. 2019b; Park et al. 2018; Rajani and Mooney
visual question, instead of picking answers from a pre- 2017), each of which however also answers a question as
defined list. a classification task (and not in a generative manner)
as described below. (Li et al. 2018b) create the VQA-E
2 Related Work dataset that has an explanation along with the answer
In this section, we review earlier efforts from multiple to the question. (Wu et al. 2019b) provide relevant cap-
perspectives that may be related to this work: Visual tions to aid in solving VQA, which can be thought of
Question Answering, Visual Commonsense Reasoning as weak justifications. More recent efforts (Park et al.
and Image Captioning in general. 2018; Patro et al. 2020) attempt to provide visual and
textual explanations to justify the predicted answers.
Visual Question Answering (VQA): VQA (Antol
Datasets have also been proposed for VQA in the re-
et al. 2015; Goyal et al. 2016; Jabri et al. 2016; Selvaraju
cent past to test visual understanding (Zhu et al. 2015;
et al. 2020) refers to the task of answering questions re-
Goyal et al. 2016; Johnson et al. 2016); for e.g., the
lated to an image. VQA and its variants have been the
Visual7W dataset (Zhu et al. 2015) contains a richer
subject of much research work recently. A lot of recent
class of questions about an image with textual and vi-
work has focused on varieties of attention-based mod-
sual answers. However, all these aforementioned efforts
els, which aim to ’look’ at the relevant regions of the
continue to focus on answering a question as a classifica-
image in order to answer the question (Anderson et al.
tion task (often in one word, such as Yes/No), followed
2017; Lu et al. 2016b; Yu et al. 2017a; Xu and Saenko
by simple explanations. We however, in this work, focus
2015; Yi et al. 2018; Xu and Saenko 2015; Shih et al.
on generating multi-word answers with a corresponding
2015; Chen et al. 2015; Yang et al. 2015). Other recent
multi-word rationale, which has not been done before.
work has focused on better multimodal fusion meth-
ods (Kim et al. 2018; 2016; Fukui et al. 2016; Yu et al. Visual Commonsense Reasoning (VCR): VCR
2017b), the incorporation of relations (Norcliffe-Brown (Zellers et al. 2018) is a recently introduced vision-
et al. 2018; Li et al. 2019; Santoro et al. 2017), the use language dataset which involves choosing a correct an-
of multi-step reasoning (Cadène et al. 2019; Gan et al. swer (among four provided options) for a given ques-
2019; Hudson and Manning 2018), and neural module tion about the image, and then choosing a rationale
networks for compositional reasoning (Andreas et al. (among four provided options) that justifies the an-
2016; Johnson et al. 2017; Chen et al. 2019; Hu et al. swer. The task associated with the dataset aims to test4 R. Dua, S. Kancheti, V. Balasubramanian
for visual commonsense understanding and provides im- The generative task, proposed here in ViQAR, is harder
ages, questions and answers of a higher complexity than to solve when compared to VQA and VCR. One can
other datasets such as CLEVR (Johnson et al. 2016). divide ViQAR into two sub-tasks:
The dataset has attracted a few methods over the last – Answer Generation: Given an image, its caption,
year (Zellers et al. 2018; Lu et al. 2019; Dua et al. 2019; and a complex question about the image, a multi-
Zheng et al. 2019; Talmor et al. 2018; Lin et al. 2019a;b; word natural language answer is generated:
Wu et al. 2019a; Ayyubi et al. 2019; Brad 2019; Yu (I, Q, C) → A
et al. 2019; Wu et al. 2019a), each of which however – Rationale Generation: Given an image, its cap-
follow the dataset’s task and treat this as a classifica- tion, a complex question about the image, and an
tion problem. None of these efforts attempt to answer answer to the question, a rationale to justify the
and reason using generated sentences. answer is generated: (I, Q, C, A) → R
Image Captioning and Visual Dialog: One could We also study variants of the above sub-tasks (such as
also consider the task of image captioning (Xu et al. when captions are not available) in our experiments.
2015; You et al. 2016; Lu et al. 2016a; Anderson et al. Our experiments suggest that the availability of cap-
2017; Rennie et al. 2016), where natural language cap- tions helps performance for the proposed task, expect-
tions are generated to describe an image, as being close edly. We now present a methodology built using known
to our objective. However, image captioning is more a basic components to study and show that the proposed,
global description of an image than question-answering seemingly challenging, new task can be solved with ex-
problems that may be tasked with answering a question isting architectures. In particular, our methodology is
about understanding of a local region in the image. based on the understanding that the answer and ra-
In contrast to all the aforementioned efforts, our tionale can help each other, and hence needs an itera-
work, ViQAR, focuses on automatic complete generation tive refinement procedure to handle such a multi-word
of the answer, and of a rationale, given a visual query. multi-output task. We consider the simplicity of the
This is a challenging task, since the generated answers proposed solution as an aspect of our solution by design,
must be correct (with respect to the question asked), more than a limitation, and hope that the proposed ar-
be complete, be natural, and also be justified with a chitecture will serve as a baseline for future efforts on
well-formed rationale. We now describe the task, and this task.
our methodology for addressing this task.
4 Proposed Methodology
3 ViQAR: Task Description We present an end-to-end, attention-based, encoder-
Let V be a given vocabulary of size |V| and A = (a1 , a2 , decoder architecture for answer and rationale genera-
...ala ) ∈ V la , R = (r1 , r2 , ...rlr ) ∈ V lr represent answer tion which is based on an iterative refinement proce-
sequences of length la and rationale sequences of length dure. The refinement in our architecture is motivated
lr respectively. Let I ∈ RD represent the image repre- by the observation that answers and rationales can in-
sentation, and Q ∈ RB be the feature representation of fluence one another mutually. Thus, knowing the an-
a given question. We also allow the use of an image cap- swer helps in generation of a rationale, which in turn
tion, if available, in this framework given by a feature can help in the generation of a more refined answer. The
representation C ∈ RB . Our task is to compute a func- encoder part of the architecture generates the features
tion F : RD ×RB ×RB → V la ×V lr that maps the input from the image, question and caption. These features
image, question and caption features to a large space of are used by the decoder to generate the answer and
generated answers A and rationales R, as given below: rationale for a question.
Feature Extraction: We use spatial image features as
F(I, Q, C) = (A, R) (1) proposed in (Anderson et al. 2017), which are termed
bottom-up image features. We consider a fixed number
Note that the formalization of this task is different from
of regions for each image, and extract a set of k features,
other tasks in this domain, such as Visual Question An-
V , as defined below:
swering (Agrawal et al. 2015) and Visual Commonsense
Reasoning (Zellers et al. 2018). The VQA task can be V = {v1 , v2 , ..., vk } where vi ∈ RD (2)
formulated as learning a function G : RD × RB → C, We use BERT (Devlin et al. 2019) representations to
where C is a discrete, finite set of choices (classification obtain fixed-size (B) embeddings for the question and
setting). Similarly, the Visual Commonsense Reason- caption, Q ∈ RB and C ∈ RB respectively. The ques-
ing task provided in (Zellers et al. 2018) aims to learn a tion and caption are projected into a common feature
function H : RD ×RB → C1 ×C2 , where C1 is the set of space T ∈ RL given by:
possible answers, and C2 is the set of possible reasons. T = g(WtT (tanh(WqT Q) ⊕ tanh(WcT C))) (3)Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions 5
Fig. 3 The decoder of our proposed architecture: Given an image and a question on the image, the model must generate an
answer to the question and a rationale to justify why the answer is correct.
where g is a non-linear function, ⊕ indicates concate- The LSTMs: The two layers of each stacked LSTM
nation and Wt ∈ RL×L , Wq ∈ RB×L and Wc ∈ RB×L (Hochreiter and Schmidhuber 1997) are referred to as
are learnable weight matrices of the layers (we use two the Attention-LSTM (La ) and Language-LSTM (Ll )
linear layers in our implementation in this work). respectively. We denote hat and xat as the hidden state
Let the mean of the extracted spatial image fea- and input of the Attention-LSTM at time step t respec-
tures (as in Eqn 2) be denoted by V̄ ∈ RD . These are tively. Analogously, hlt and xlt denote the hidden state
concatenated with the projected question and caption and input of the Language-LSTM at time t. Since the
features to obtain F: four LSTMs are identical in operation, we describe the
F = V̄ ⊕ T (4) attention and sequence generation modules of one of
We use F as the common input feature vector to all the the sequential LSTMs below in detail.
LSTMs in our architecture.
Architecture: Fig. 3 shows our end-to-end architec- Spatial Visual Attention: We use a soft, spatial-
ture to address ViQAR. As stated earlier, our architec- attention model, similar to (Anderson et al. 2017) and
ture has two modules: generation (GM ) and refinement (Lu et al. 2016a), to compute attended image features
(RM ). The GM consists of two sequential, stacked LSTMs, V̂. Given the combined input features F and previous
henceforth referred to as answer generator (AG) and ra- hidden states hat−1 , hlt−1 , the current hidden state of
tionale generator (RG) respectively. The RM seeks to the Attention-LSTM is given by:
refine the generated answer as well as rationale, and is xat ≡ hp ⊕ hlt−1 ⊕ F ⊕ πt (5)
an important part of the proposed solution as seen in
hat = La (xat , hat−1 ) (6)
our experimental results. It also consists of two sequen- T
tial, stacked LSTMs, which we denote as answer refiner where πt = We 1t is the embedding of the input word,
(AR) and rationale refiner (RR). We ∈ R|V|×E is the weight of the embedding layer, and
1t is the one-hot representation of the input at time t.
Each sub-module (presented inside dashed lines in
hp is the hidden representation of the previous LSTM
the figure) is a complete LSTM. Given an image, ques-
(answer or rationale, depending on the current LSTM).
tion, and caption, the AG sub-module unrolls for la
time steps to generate an answer. The hidden state of The hidden state hat and visual features V are used
Language and Attention LSTMs after la time steps is a by the attention module (implemented as a two-layered
representation of the generated answer. Using the rep- MLP in this work) to compute the normalized set of
resentation of the generated answer from AG, RG sub- attention weights αt = {α1t , α2t , ..., αkt } (where αit is
module unrolls for lr time steps to generate a rationale the normalized weight of image feature vi ) as below:
T T T a
and obtain its representation. Then the AR sub-module yi,t = Way (tanh(Wav vi + Wah ht )) (7)
uses the features from RG to generate a refined answer. αt = softmax(y1t , y23 , ..., ykt ) (8)
Lastly, the RR sub-module uses the answer features In the above equations, Way ∈ RA×1 , Wav ∈ RD×A
from AR to generate a refined rationale. Thus, a re- and Wah ∈ RH×A are weights learned by the attention
fined answer is generated after la + lr time steps and a MLP, H is the hidden size of the LSTM and A is the
refined rationale is generated after la further time steps. hidden size of the attention MLP.
The complete architecture runs in 2la + 2lr time steps.6 R. Dua, S. Kancheti, V. Balasubramanian
Pk
The attended image feature vector V̂t = i=1 αit vi is where θi represents the ith sub-module LSTM, pt is
the weighted sum of all visual features. the conditional probability of the tth word in the input
Sequence Generation: The attended image features sequence as calculated by the corresponding LSTM, la
V̂t , together with T and hat , are inputs to the language- indicates the ground-truth answer length, and lr the
LSTM at time t. We then have: ground truth rationale length. Other loss formulations,
xlt ≡ hp ⊕ V̂t ⊕ hat ⊕ T (9) such as a weighted average of the cross entropy terms
did not perform better than a simple sum. We tried
hlt = Ll (xlt , hlt−1 ) (10) weights from 0.0, 0.25, 0.5, 0.75, 1.0 for the loss terms.
T l
yt = Wlh ht + blh (11) More implementation details are provided in Section 5.
pt = sof tmax(yt ) (12)
where h is the hidden state of the previous LSTM, hlt
p
5 Experiments and Results
is the output of the Language-LSTM, pt is the condi-
tional probability over words in V at time t. The word In this section, we describe the dataset used for this
at time step t is generated by a single-layered MLP with work, implementation details, as well as present the re-
learnable parameters: Wlh ∈ RH×|V| , blh ∈ R|V|×1 . The sults of the proposed method and its variants.
attention MLP parameters Way , Wav and Wah , and em- Dataset: Considering there has been no dataset ex-
bedding layer’s parameters We are shared across all four plicitly built for this new task, we study the perfor-
LSTMs. (We reiterate that although the architecture is mance of the proposed method on the recent VCR (Zellers
based on well-known components, the aforementioned et al. 2018) dataset, which has all components needed
design decisions were obtained after significant study.) for our approach. We train our proposed architecture on
Loss Function: For a better understanding of our ap- VCR, which contains ground truth answers and ground
proach, Figure 4 presents a high-level illustration of our truth rationales that allow us to compare our generated
proposed generation-refinement model. answers and rationales against. We also show in Section
6 on how a model trained on the VCR dataset can be
used to give a rationale for images from Visual7W (Zhu
et al. 2015), a VQA dataset with no ground-truth ra-
tionale.
F AG RG AR RR
VCR is a large-scale dataset that consists of 290k
Fig. 4 High-level illustration of our proposed Generation- triplets of questions, answers, and rationales over 110k
Refinement model unique movie scene images. For our method, we also
use the captions provided by the authors of VCR as in-
Let A1 = (a11 , a12 , ..., a1la ), R1 = (r11 , r12 , ..., r1lr ),
put to our method (we later perform an ablation study
A2 = (a21 , a22 , ..., a2la ) and R2 = (r21 , r22 , ..., r2lr ) be
without the captions). At inference, captions are gen-
the generated answer, generated rationale, refined an-
erated using (Vinyals et al. 2014) (trained on provided
swer and refined rationale sequences respectively, where
captions) and use them as input to our model. Since
aij and rij are discrete random variables taking values
the test set of the dataset is not provided, we randomly
from the common vocabulary V. Given the common
split the train set into train-train-split (202,923 sam-
input F , our objective is to maximize the likelihood
ples) and train-val-split (10,000 samples) while using
P (A1 , R1 , A2 , R2 |F ) given by:
the validation set as our test data (26,534 samples). All
P (A1 , R1 , A2 , R2 |F ) = P (A1 , R1|F )P (A2 , R2 |F, A1 , R1 ) our baselines are also compared on the same setting for
= P (A1 |F )P (R1 |F, A1 ) fair comparison.
P (A2 |F, A1 , R1 )P (R2 |F, A1 , R1 , A2 )
(13) Dataset Avg. A Avg. Q Avg. R Complexity
length length length
In our model design, each term in the RHS of Eqn 13 is VCR 7.55 6.61 16.2 High
computed by a distinct LSTM. Hence, minimizing the VQA-E 1.11 6.1 11.1 Low
sum of losses of the four LSTMs becomes equivalent to VQA-X 1.12 6.13 8.56 Low
maximizing the joint likelihood. Our overall loss is the Table 1 Statistical comparison of VCR with VQA-E, and
sum of four cross-entropy losses, one for each LSTM, as VQA-X datasets. VCR dataset is highly complex as it is made
given below: up of complex subjective questions.
Xla lr la lr
VQA-E (Li et al. 2018b) and VQA-X (Park et al.
X X X
L=− log pθt 1 + log pθt 2 + log pθt 3 + log pθt 4
t=1 t=1 t=1 t=1 2018) are competing datasets that contains explana-
(14) tions along with question-answer pairs. Table 1 showsBeyond VQA: Generating Multi-word Answer and Rationale to Visual Questions 7
the high-level analysis of the three datasets. Since VQA- including SkipThought cosine similarity (Kiros et al.
E and VQA-X are derived from VQA-2, many of the 2015), Vector Extrema cosine similarity (Forgues and
questions can be answered in one word (a yes/no an- Pineau 2014), Universal sentence encoder (Cer et al.
swer or a number). In contrast, VCR asks open-ended 2018), Infersent (Conneau et al. 2017) and BERTScore
questions and has longer answers. Since our task aims (Zhang et al. 2019). We use a comprehensive suite of all
to generate rich answers, the VCR dataset provides a the aforementioned metrics to study the performance of
richer context for this work. CLEVR (Johnson et al. our model. Subjective studies are presented later in this
2016) is another VQA dataset that measures the log- section.
ical reasoning capabilities by asking the question that Classification Accuracy: We also evaluate the per-
can be answered when a certain sequential reasoning formance of our model on the classification task. For
is followed. This dataset however does not contain rea- every question, there are four answer choices and four
sons/rationales on which we can train. Also, we do not rationale choices provided in the VCR dataset. We com-
perform a direct evaluation on CLEVR because our pute the similarity scores between each of the options
model is trained on real-world natural images while and our generated answer/ rationale, and choose the
CLEVR is a synthetic shapes dataset. option with the highest similarity score. Accuracy per-
In order to study our method further, we also study centage for answer classification, rationale classification
the transfer of our learned model to another challenging and overall answer+rationale classification (denoted as
dataset, Visual7W (Zhu et al. 2015), by generating an Overall) are reported in Table 2. Only samples that
answer/ rationale pair for visual questions in Visual7W correctly predict both answers and rationales are con-
(please see Section 6 more more details). Visual7W is sidered for overall answer+rationale classification accu-
a large-scale visual question answering (VQA) dataset, racy. The results show the difficulty of the ViQARtask,
which has multi-modal answers i.e visual ’pointing’ an- expounding the need for opening up this problem to the
swers and textual ’telling’ answers. community.
Implementation Details: We use spatial image fea- Results: Qualitative Results: Fig 5 shows examples of
tures generated from (Anderson et al. 2017) as our images and questions where the proposed model gener-
image input. Fixed-size BERT representations of ques- ates a meaningful answer with a supporting rationale.
tions and captions are used. Hidden size of all LSTMs is Qualitative results indicate that our model is capable
set to 1024 and hidden size of the attention MLP is set of generating answer-rationale pairs to complex sub-
to 512. We trained using the ADAM optimizer with a jective questions starting with ’Why’, ’What’, ’How’,
decaying learning rate starting from 4e−4 , using a batch etc. Given the question, ”What is person8 doing right
size of 64. Dropout is used as a regularizer. Our code is now?”, the generated rationale: ”person8 is standing
publicly available at radhikadua123.github.io/ViQAR. in front of a microphone” actively supports the gen-
Evaluation Metrics: We use multiple objective eval- erated answer: ”person8 is giving a speech”, justify-
uation metrics to evaluate the goodness of answers and ing the choice of a two-stage generation-refinement ar-
rationales generated by our model. Since our task is chitecture. For completeness of understanding, we also
generative, evaluation is done by comparing our gen- present a few examples on which our model fails to
erated sentences with ground-truth sentences to assess generate the semantically correct answer in Fig. S2.
their semantic correctness as well as structural sound- However, even on these results, we observe that our
ness. To this end, we use a combination of multiple ex- model does generate both answers and rationales that
isting evaluation metrics. Word overlap-based metrics are grammatically correct and complete (where ratio-
such as METEOR (Lavie and Agarwal 2007), CIDEr nale supports the generated answer). Improving the se-
(Vedantam et al. 2014) and ROUGE (Lin 2004) quan- mantic correctness of the generations will be an impor-
tify the structural closeness of the generated sentences tant direction of future work. More qualitative results
to the ground-truth. While such metrics give a sense of are presented in the supplementary owing to space con-
the structural correctness of the generated sentences, straints.
they may however be insufficient for evaluating genera- Quantitative Results: Quantitative results on the suite
tion tasks, since there could be many valid generations of evaluation metrics stated earlier are shown in Table
which are correct, but not share the same words as a 3. Since this is a new task, there are no known meth-
single ground truth answer. Hence, in order to mea- ods to compare against. We hence compare our model
sure how close the generation is to the ground-truth in against a basic two-stage LSTM that generates the an-
meaning, we additionally use embedding-based metrics swer and reason independently as a baseline (called
(which calculate the cosine similarity between sentence Baseline in Table 3), and a VQA-based method (An-
embeddings for generated and ground-truth sentences) derson et al. 2017) that extracts multi-modal features to8 R. Dua, S. Kancheti, V. Balasubramanian
Metrics Q+I+C Q+I Q+C
Answer Rationale Overall Answer Rationale Overall Answer Rationale Overall
Infersent 34.90 31.78 11.91 34.73 31.47 11.68 30.50 27.99 9.17
USE 34.56 30.81 11.13 34.7 30.57 11.17 30.15 27.57 8.56
Table 2 Quantitative results on the VCR dataset. Accuracy percentage for answer classification, rationale classification and
overall answer-rationale classification is reported.
person2 person2 Question: What does person2 think of person1?
person1
Answer: person2 is worried about person1
Reason: person2 is looking down at person1, and
person1 is stressed
Answer: person2 thinks person1 is a strange guy
Reason: person2 is looking at person1 with a suspicious
look
book1 person1
person4
Question: Why is person4 wearing a tie?
Answer: he is at a business meeting
tie
Reason: business attire is a suit and tie
Answer: he is a businessman
Reason: business men wear suits and ties to work
Fig. 5 (Best viewed in color) Sample qualitative results for ViQAR from our proposed Generation-Refinement architecture.
Blue box = question about image; Green = Ground truth answer and rationale; Red = Answer and rationale generated by
our architecture. (Object regions shown on image is for reader’s understanding and are not given as input to model.)
person1
Question: Where are person1 person2 person3 going ?
Answer: person1 person2 person3 are going into the
library inside a churge .
Reason: person1 person2 person3 are walking past the
bottle3 church ' s stain glass window toward a room filled with
bottle4 books .
bottle1
bottle2
Answer: they are walking to their next class
Reason: they are walking in a hallway of a building
Fig. 6 (Best viewed in color) Challenging inputs for which our model fails to generate the semantically correct answer, but
the generated answer and rationale still show grammatical correctness and completeness, showing the promise of the overall
idea. Blue box = question about the image; Green = Ground truth; Red = Results generated by our architecture. (Object
regions shown on image are for reader’s understanding and are not given as input to model.)
generate answers and rationales independently (called man Turing test on the generated answers and ratio-
VQA- Baseline in Table 3). (Comparison with other nales. 30 human evaluators were presented each with
standard VQA models is not relevant in this setting, 50 randomly sampled image-question pairs, each con-
since we perform a generative task, unlike existing VQA taining an answer to the question and its rationale.
models.) We show results on three variants of our pro- The test aims to measure how humans score the gen-
posed Generation-Refinement model: Q+I+C (when ques- erated sentences w.r.t. ground truth sentences. Sixteen
tion, image and caption are given as inputs), Q+I (ques- of the fifty questions had ground truth answers and ra-
tion and image alone as inputs), and Q+C (question tionales, while the rest were generated by our proposed
and caption alone as inputs). Evidently, our Q+I+C model. For each sample, the evaluators had to give a
performed the most consistently across all the evalu- rating of 1 to 5 for five different criteria, with 1 being
ation metrics. (The qualitative results in Fig 5 were very poor and 5 being very good. The results are pre-
obtained using the (Q+I+C) setting.) Importantly, our sented in Table 4. Despite the higher number of gen-
model outperforms both baselines, including the VQA- erated answer-rationales judged by human users, the
based one, on every single evaluation metric, showing answers and rationales produced by our method were
the utility of the proposed architecture. deemed to be fairly correct grammatically. The evalua-
Human Turing Test: In addition to the study of the tors also agreed that our answers were relevant to the
objective evaluation metrics, we also performed a hu-Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions 9
Metrics VQA-Baseline (Anderson et al. 2017) Baseline Q+I+C Q+I Q+C
Univ Sent Encoder CS 0.419 0.410 0.455 0.454 0.440
Infersent CS 0.370 0.400 0.438 0.442 0.426
Embedding Avg CS 0.838 0.840 0.846 0.853 0.845
Vector Extrema CS 0.474 0.444 0.493 0.483 0.475
Greedy Matching Score 0.662 0.633 0.672 0.661 0.657
METEOR 0.107 0.095 0.116 0.104 0.103
Skipthought CS 0.430 0.359 0.436 0.387 0.385
RougeL 0.259 0.206 0.262 0.232 0.236
CIDEr 0.364 0.158 0.455 0.310 0.298
F-BERTScore 0.877 0.860 0.879 0.867 0.868
Table 3 Quantitative evaluation on VCR dataset; CS = cosine similarity; we compare against a basic two-stage LSTM model
and a VQA model (Anderson et al. 2017) as baselines; remaining columns are proposed model variants
question and the generated rationales are acceptably the refinement module, supporting our claim. More re-
relevant to the generated answer. sults are presented in the supplementary.
Transfer to Other Datasets: We also studied whether
6 Discussions and Analysis
the proposed model, trained on the VCR dataset, can
Ablation Studies on Refinement Module: We eval- provide answers and rationales to visual questions in
uated the performance of the following variations of other VQA datasets (which do not have ground truth
our proposed generation-refinement architecture M : (i) rationale provided). To this end, we tested our trained
M − RM : where the refinement module is removed; model on the widely used Visual7W Zhu et al. (2015)
and (ii) M + RM : where a second refinement module dataset without any additional training. Fig 8 presents
is added, i.e. the model has one generation module and qualitative results for ViQAR task on the Visual7W dataset.
two refinement modules (to see if further refinement of We also perform a Turing test on the generated an-
answer and rationale helps). Table 5 shows the quanti- swers and rationales to evaluate the model’s perfor-
tative results. mance on Visual7W. Thirty human evaluators were pre-
sented each with twenty five hand-picked image-question
#Refine Modules pairs, each of which contains a generated answer to
Metrics 0 1 2 the question and its rationale. The results, presented
Univ Sent Encoder 0.453 0.455 0.430 in Table 6 show that our algorithm generalizes rea-
Infersent 0.434 0.438 0.421
Embedding Avg Cosine similarity 0.85 0.846 0.840
sonably well to the other VQA dataset and generates
Vector Extrema Cosine Similarity 0.482 0.493 0.462 answers and rationales relevant to the image-question
Greedy Matching Score 0.659 0.672 0.639 pair, without any explicit training for this dataset. This
METEOR 0.101 0.116 0.090 adds a promising dimension to this work. More results
Skipthought Cosine Similarity 0.384 0.436 0.375 are presented in the supplementary.
RougeL 0.234 0.262 0.198
CIDEr 0.314 0.455 0.197
More results, including qualitative examples, for var-
F-BertScore 0.868 0.879 0.861 ious settings are included in the Supplementary Sec-
Table 5 Comparison of proposed Generation-Refinement
tion. Since ViQAR is a completely generative task, ob-
Architecture with variations in num of Refinement modules jective evaluation is a challenge, as in any other gen-
erative methods. For comprehensive evaluation, we use
a suite of objective metrics typically used in related
We observe that our proposed model, which has one
vision-language tasks. We perform a detailed analysis
refinement module has the best results. Adding addi-
in the Supplementary material and show that even our
tional refinement modules causes the performance to go
successful results (qualitatively speaking) may have low
down. We hypothesize that the additional parameters
scores on objective evaluation metrics at times, since
(in a second Refinement module) in the model makes
generated sentences may not match a ground truth sen-
it harder for the network to learn. Removal of the re-
tence word-by-word. We hope that opening up this di-
finement module also causes performance to drop, sup-
mension of generated explanations will only motivate a
porting our claim on the usefulness for a Refinement
better metric in the near future.
module too. We also studied the classification accu-
racy in these variations, and observed that 1-refinement 7 Conclusion
model (original version of our method) with 11.9% ac- In this paper, we propose ViQAR, a novel task for gen-
curacy outperforms 0-refinement model (11.64%) and erating a multi-word answer and a rationale given an
2-refinement model (10.94%) for the same reasons. Fig. image and a question. Our work aims to go beyond
7 provides a few qualitative results with and without classical VQA by moving to a completely generative10 R. Dua, S. Kancheti, V. Balasubramanian
Criteria Generated Ground-truth
Mean ± std Mean ± std
How well-formed and grammatically correct is the answer? 4.15 ± 1.05 4.40 ± 0.87
How well-formed and grammatically correct is the rationale? 3.53 ± 1.26 4.26 ± 0.92
How relevant is the answer to the image-question pair? 3.60 ± 1.32 4.08 ± 1.03
How well does the rationale explain the answer with respect to the image-question pair? 3.04 ± 1.36 4.05 ± 1.10
Irrespective of the image-question pair, how well does the rationale explain the answer ? 3.46 ± 1.35 4.13 ± 1.09
Table 4 Results of human Turing test performed with 30 people who rated samples consisting of question, corresponding
answer and rationale on 5 criteria. For each criterion, a rating of 1 to 5 was given. Results show mean score and standard
deviation for each criterion for generated and ground truth samples.
Image
What are person1, person2,
What if person1 refused to Why is person2 putting her
Question Where are they at? person3, person4, and
shake the hand of person2? hand on person1?
person5 doing here?
Answer: they are studying a
Answer: person1 would push it Answer: person2 is dancing
Generation Answer: they are in a library class
off with 1
Module Reason: there are shelves of Reason: they are all sitting in
Reason: person1 is not wearing Reason: 2 is holding 1 s hand
books behind them a circle and there is a teacher
a shirt and person2 is not and is smiling
in front of them
Answer: they are in a liquor Answer: they are all to attend
Generation - Answer: he would be angry Answer: she wants to kiss him
store a funeral
Refinement Reason: 1 is angry and is not Reason: she is looking at him
Reason: there are shelves of Reason: they are all wearing
Module paying attention to 2 with a smile on her face
liquor bottles on the shelves black
Fig. 7 Qualitative results for our model with (in green, last row) and without (in red, penultimate row) Refinement module
Question: What are the men Question: Why are the people all Question: Where is this taking
Question: Where is this at?
doing? dressed up? place?
Answer: Person2 and Person1 are Answer: They are celebrating a Answer: City street. it is located in Answer: She is at the beach
cooking for food costume party a city Reason: There are surfboards in
Reason: They are holding a table Reason: They are all dressed in Reason: There are buildings in the the background and lots of
and the table has food nice dresses and in fancy clothes background and there is taxi cars surfboards everywhere
Fig. 8 Qualitative results on Visual7W dataset (note that there is no rationale provided in this dataset, and all above rationales
were generated by our model)
Criteria Generated
Mean ± std
How well-formed and grammatically correct is the answer? 3.98 ± 1.08
How well-formed and grammatically correct is the rationale? 3.80 ± 1.04
How relevant is the answer to the image-question pair? 4.11 ± 1.17
How well does the rationale explain the answer with respect to the image-question pair? 3.83 ± 1.23
Irrespective of the image-question pair, how well does the rationale explain the answer ? 3.83 ± 1.28
Table 6 Results of the Turing test on Visual7W dataset performed with 30 people who had to rate samples consisting of a
question and its corresponding answer and rationales on five criteria. For each criterion, a rating of 1 to 5 was given. The table
gives the mean score and standard deviation for each criterion for the generated samples.
paradigm. To solve ViQAR, we present an end-to-end model on the VCR dataset both qualitatively and quan-
generation-refinement architecture which is based on titatively, and our human Turing test showed results
the observation that answers and rationales are depen- comparable to the ground truth. We also showed that
dent on one another. We showed the promise of our this model can be transferred to tasks without groundBeyond VQA: Generating Multi-word Answer and Rationale to Visual Questions 11
truth rationale. We hope that our work will open up a Das A, Kottur S, Gupta K, Singh A, Yadav D, Lee S,
broader discussion around generative answers in VQA Moura JMF, Parikh D, Batra D (2018) Visual dialog.
and other deep neural network models in general. IEEE TPAMI 41
Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert:
Acknowledgements We acknowledge MHRD and DST,
Pre-training of deep bidirectional transformers for
Govt of India, as well as Honeywell India for partial sup-
language understanding. In: NAACL-HLT
port of this project through the UAY program, IITH/005.
Dua D, Wang Y, Dasigi P, Stanovsky G, Singh S, Gard-
We also thank Japan International Cooperation Agency
ner M (2019) Drop: A reading comprehension bench-
and IIT Hyderabad for the generous compute support.
mark requiring discrete reasoning over paragraphs.
In: NAACL-HLT
Forgues G, Pineau J (2014) Bootstrapping dialog sys-
tems with word embeddings
References
Fukui A, Park DH, Yang D, Rohrbach A, Darrell
T, Rohrbach M (2016) Multimodal compact bilin-
Agrawal A, Lu J, Antol S, Mitchell M, Zitnick CL,
ear pooling for visual question answering and visual
Parikh D, Batra D (2015) Vqa: Visual question an-
grounding. In: EMNLP
swering. ICCV
Gan Z, Cheng Y, Kholy AE, Li L, Liu J, Gao J (2019)
Anderson P, He X, Buehler C, Teney D, Johnson M,
Multi-step reasoning via recurrent dual attention for
Gould S, Zhang L (2017) Bottom-up and top-down
visual dialog. In: ACL
attention for image captioning and visual question
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D
answering. CVPR
(2016) Making the v in vqa matter: Elevating the role
Andreas J, Rohrbach M, Darrell T, Klein D (2016) Neu-
of image understanding in visual question answering.
ral module networks. 2016 IEEE Conference on Com-
CVPR
puter Vision and Pattern Recognition (CVPR) pp
Hendricks LA, Akata Z, Rohrbach M, Donahue J,
39–48
Schiele B, Darrell T (2016) Generating visual expla-
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick
nations. In: ECCV
CL, Parikh D (2015) Vqa: Visual question answering.
Hochreiter S, Schmidhuber J (1997) Long short-term
In: ICCV
memory. Neural Computation 9:1735–1780
Ayyubi HA, Tanjim MM, Kriegman DJ (2019) En-
Hu R, Andreas J, Rohrbach M, Darrell T, Saenko K
forcing reasoning in visual commonsense reasoning.
(2017) Learning to reason: End-to-end module net-
ArXiv abs/1910.11124
works for visual question answering. 2017 IEEE In-
Brad F (2019) Scene graph contextualization in visual
ternational Conference on Computer Vision (ICCV)
commonsense reasoning. In: ICCV 2019
pp 804–813
Cadène R, Ben-younes H, Cord M, Thome N (2019)
Hudson DA, Manning CD (2018) Compositional at-
Murel: Multimodal relational reasoning for visual
tention networks for machine reasoning. ArXiv
question answering. 2019 IEEE/CVF Conference on
abs/1803.03067
Computer Vision and Pattern Recognition (CVPR)
Jabri A, Joulin A, van der Maaten L (2016) Revisiting
pp 1989–1998
visual question answering baselines. In: ECCV
Cer D, Yang Y, yi Kong S, Hua N, Limtiaco N, John
Jang Y, Song Y, Yu Y, Kim Y, Kim G (2017) Tgif-qa:
RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar
Toward spatio-temporal reasoning in visual question
C, Sung YH, Strope B, Kurzweil R (2018) Universal
answering. 2017 IEEE Conference on Computer Vi-
sentence encoder. ArXiv
sion and Pattern Recognition (CVPR) pp 1359–1367
Chen K, Wang J, Chen L, Gao H, Xu W, Nevatia
Johnson J, Hariharan B, van der Maaten L, Fei-Fei L,
R (2015) ABC-CNN: an attention based convolu-
Zitnick CL, Girshick RB (2016) Clevr: A diagnostic
tional neural network for visual question answer-
dataset for compositional language and elementary
ing. CoRR abs/1511.05960, URL http://arxiv.
visual reasoning. CVPR
org/abs/1511.05960, 1511.05960
Johnson J, Hariharan B, van der Maaten L, Hoffman J,
Chen W, Gan Z, Li L, Cheng Y, Wang WWJ, Liu J
Fei-Fei L, Zitnick CL, Girshick RB (2017) Inferring
(2019) Meta module network for compositional visual
and executing programs for visual reasoning supple-
reasoning. ArXiv abs/1910.03230
mentary material
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes
Kim J, On KW, Lim W, Kim J, Ha J, Zhang B
A (2017) Supervised learning of universal sentence
(2016) Hadamard product for low-rank bilinear pool-
representations from natural language inference data.
ing. CoRR abs/1610.04325, URL http://arxiv.
CoRR12 R. Dua, S. Kancheti, V. Balasubramanian
org/abs/1610.04325, 1610.04325 Santoro A, Raposo D, Barrett DGT, Malinowski M,
Kim JH, Jun J, Zhang BT (2018) Bilinear attention Pascanu R, Battaglia PW, Lillicrap TP (2017) A sim-
networks. ArXiv abs/1805.07932 ple neural network module for relational reasoning.
Kiros R, Zhu Y, Salakhutdinov R, Zemel RS, Urtasun In: NIPS
R, Torralba A, Fidler S (2015) Skip-thought vectors. Selvaraju RR, Tendulkar P, Parikh D, Horvitz E,
In: NIPS Ribeiro MT, Nushi B, Kamar E (2020) Squinting
Lavie A, Agarwal A (2007) Meteor: An automatic met- at vqa models: Interrogating vqa models with sub-
ric for mt evaluation with high levels of correlation questions. ArXiv abs/2001.06927
with human judgments. In: WMT@ACL Shih KJ, Singh S, Hoiem D (2015) Where to look: Fo-
Lei J, Yu L, Bansal M, Berg TL (2018) Tvqa: Localized, cus regions for visual question answering. 2016 IEEE
compositional video question answering. In: EMNLP Conference on Computer Vision and Pattern Recog-
Li L, Gan Z, Cheng Y, Liu J (2019) Relation-aware nition (CVPR) pp 4613–4621
graph attention network for visual question answer- Storks S, Gao Q, Chai JY (2019) Commonsense rea-
ing. 2019 IEEE/CVF International Conference on soning for natural language understanding: A survey
Computer Vision (ICCV) pp 10312–10321 of benchmarks, resources, and approaches. ArXiv
Li Q, Fu J, Yu D, Mei T, Luo J (2018a) Tell-and-answer: Talmor A, Herzig J, Lourie N, Berant J (2018) Com-
Towards explainable visual question answering using monsenseqa: A question answering challenge target-
attributes and captions. In: EMNLP ing commonsense knowledge. In: NAACL-HLT
Li Q, Tao Q, Joty SR, Cai J, Luo J (2018b) Vqa-e: Thomason J, Gordon D, Bisk Y (2018) Shifting the
Explaining, elaborating, and enhancing your answers baseline: Single modality performance on visual nav-
for visual questions. In: ECCV igation and qa. In: NAACL-HLT
Lin BY, Chen X, Chen J, Ren X (2019a) Kagnet: Vedantam R, Zitnick CL, Parikh D (2014) Cider:
Knowledge-aware graph networks for commonsense Consensus-based image description evaluation.
reasoning. ArXiv CVPR
Lin CY (2004) Rouge: A package for automatic evalu- Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show
ation of summaries. In: ACL and tell: A neural image caption generator. CVPR
Lin J, Jain U, Schwing AG (2019b) Tab-vcr: Tags and Wu A, Zhu L, Han Y, Yang Y (2019a) Connective cogni-
attributes based vcr baselines. In: NeurIPS tion network for directional visual commonsense rea-
Lu J, Xiong C, Parikh D, Socher R (2016a) Knowing soning. In: NeurIPS
when to look: Adaptive attention via a visual sentinel Wu J, Hu Z, Mooney RJ (2019b) Generating question
for image captioning. CVPR relevant captions to aid visual question answering.
Lu J, Yang J, Batra D, Parikh D (2016b) Hierarchical In: ACL
question-image co-attention for visual question an- Xu H, Saenko K (2015) Ask, attend and answer: Ex-
swering. In: NIPS ploring question-guided spatial attention for visual
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: Pre- question answering. In: ECCV
training task-agnostic visiolinguistic representations Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdi-
for vision-and-language tasks. ArXiv nov R, Zemel RS, Bengio Y (2015) Show, attend and
Norcliffe-Brown W, Vafeias E, Parisot S (2018) Learn- tell: Neural image caption generation with visual at-
ing conditioned graph structures for interpretable vi- tention. In: ICML
sual question answering. In: NeurIPS Yang Z, He X, Gao J, Deng L, Smola AJ (2015)
Park DH, Hendricks LA, Akata Z, Rohrbach A, Schiele Stacked attention networks for image question an-
B, Darrell T, Rohrbach M (2018) Multimodal expla- swering. 2016 IEEE Conference on Computer Vision
nations: Justifying decisions and pointing to the evi- and Pattern Recognition (CVPR) pp 21–29
dence. CVPR Yi K, Wu J, Gan C, Torralba A, Kohli P, Tenenbaum
Patro BN, Pate S, Namboodiri VP (2020) Robust JB (2018) Neural-symbolic vqa: Disentangling rea-
explanations for visual question answering. ArXiv soning from vision and language understanding. In:
abs/2001.08730 NeurIPS
Rajani NF, Mooney RJ (2017) Ensembling visual ex- You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image
planations for vqa captioning with semantic attention. CVPR
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V Yu D, Fu J, Mei T, Rui Y (2017a) Multi-level attention
(2016) Self-critical sequence training for image cap- networks for visual question answering. CVPR
tioning. CVPR Yu W, Zhou J, Yu W, Liang X, Xiao N (2019) Hetero-
geneous graph learning for visual commonsense rea-Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions 13 soning. In: NeurIPS Yu Z, Yu J, Fan J, Tao D (2017b) Multi-modal factor- ized bilinear pooling with co-attention learning for visual question answering. 2017 IEEE International Conference on Computer Vision (ICCV) pp 1839– 1848 Zellers R, Bisk Y, Farhadi A, Choi Y (2018) From recog- nition to cognition: Visual commonsense reasoning. In: CVPR Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y (2019) Bertscore: Evaluating text generation with bert. ArXiv abs/1904.09675 Zheng Z, Wang W, Qi S, Zhu SC (2019) Reasoning vi- sual dialogs with structural and partial observations. In: CVPR Zhu Y, Groth O, Bernstein MS, Fei-Fei L (2015) Visual7w: Grounded question answering in images. CVPR
14 R. Dua, S. Kancheti, V. Balasubramanian
Supplementary Material: Beyond VQA: Gener- S3 Impact of Refinement Module
ating Multi-word Answer and Rationale to Vi-
Figure S4 provides a few examples to qualitatively com-
sual Questions
pare the model with and without the refinement mod-
In this supplementary section, we provide additional ule, in continuation to the discussion in Section 6. We
supporting information including: observe that the model without the refinement mod-
ule fails to generate answers and rationale for complex
– Additional qualitative results of our model (exten- image-question pairs. However, our proposed Generation-
sion of Section 5 of main paper) Refinement model is capable of generating a meaningful
– More results on transferring our model to the VQA answer with a supporting explanation. Hence the addi-
task (extension of results in Section 6 of main paper) tion of the refinement module to our model is useful to
– More results on impact of the Refinement module generate answer-rationale pairs to complex questions.
in our model, and a study on the effect of adding We also performed a study on the effect of adding
refinement module to VQA-E refinement to VQA models, particularly to VQA-E Li
– A discussion on existing objective evaluation met- et al. (2018b), which can be considered close to our
rics for this task, and the need to go beyond (exten- work since it provides explanations as a classification
sion of section 6 of our paper) problem (it classifies the explanation among a list of
– A study, using our attention maps, on the inter- options, unlike our model which generates the explana-
pretability of our results tion). To add refinement, we pass the last hidden state
of the LSTM that generates the explanation along with
S1 Additional Qualitative Results joint representation to another classification module.
However, we did not observe improvement in classifi-
In addition to the qualitative results presented in Sec 5 cation accuracy when the refinement module is added
of the main paper, Fig S1 presents more qualitative re- for such a model. This may be attributed to the fact
sults from our proposed model on the VCR dataset for that the VQA-E dataset consists largely of one-word
the ViQAR task. We observe that our model is capable of answers to visual questions. We infer that the interplay
generating answer-rationale pairs to complex subjective of answer and rationale, which is important to generate
questions of the type: Explanation (why, how come), a better answer and provide justification, is more useful
Activity (doing, looking, event), Temporal (happened, in multi-word answer settings which is the focus of this
before, after, etc), Mental (feeling, thinking, love, up- work.
set), Scene (where, time) and Hypothetical sentences
(if, would, could). For completeness of understanding, S4 On Objective Evaluation Metrics for
we also show a few examples on which our model fails Generative Tasks: A Discussion
to generate a good answer-rationale pair in Fig S2. As
stated earlier in Section 5, even on these results, we ob- Since ViQAR is a completely generative task, objective
serve that our model does generate both answers and evaluation is a challenge, as in any other generative
rationales that are grammatically correct and complete. methods. Hence, for comprehensive evaluation, we use
Improving the semantic correctness of the generations a suite of several well-known objective evaluation met-
will be an important direction of future work. rics to measure the performance of our method quan-
titatively. There are various reasons why our approach
may seem to give relatively lower scores than typical
S2 More Qualitative Results on Transfer to results for these scores on other language processing
VQA Task tasks. Such evaluation metrics measure the similarity
between the generated and ground-truth sentences. For
As stated in Section 6 of the main paper, we also stud-
our task, there may be multiple correct answers and ra-
ied whether the proposed model, trained on the VCR
tionales, and each of them can be expressed in numer-
dataset, can provide answers and rationales to visual
ous ways. Figure S5 shows a few examples of images
questions in standard VQA datasets (which do not have
and questions along with their corresponding ground-
ground truth rationale provided). Figure S3 presents
truth, generated answer-rationale pair, and correspond-
additional qualitative results for ViQAR task on the Vi-
ing evaluation metric scores. We observe that gener-
sual7W dataset. We observe that our algorithm gener-
ated answers and rationales are relevant to the image-
alizes reasonably well to the other VQA dataset and
question pair but may be different from the ground-
generates answers and rationales relevant to the image-
truth answer-rationale pair. Hence, the evaluation met-
question pair, without any explicit training for this dataset.
rics reported here have low scores even when the results
This adds a promising dimension to this work.You can also read