Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning

Page created by Josephine Pearson
 
CONTINUE READING
Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning
Attacking Visual Language Grounding with Adversarial Examples:
                                                         A Case Study on Neural Image Captioning
                                             Hongge Chen1* , Huan Zhang23* , Pin-Yu Chen3 , Jinfeng Yi4 , and Cho-Jui Hsieh2
                                                                   1
                                                                     MIT, Cambridge, MA 02139, USA
                                                                   2
                                                                     UC Davis, Davis, CA 95616, USA
                                                                     3
                                                                       IBM Research, NY 10598, USA
                                                                     4
                                                                       JD AI Research, Beijing, China
                                                                          chenhg@mit.edu, ecezhang@ucdavis.edu
                                                          pin-yu.chen@ibm.com, yijinfeng@jd.com, chohsieh@ucdavis.edu
                                                               *
                                                                   Hongge Chen and Huan Zhang contribute equally to this work

                                                              Abstract                           takes an image as an input and generates a lan-
                                                                                                 guage caption that best describes its visual con-
arXiv:1712.02051v2 [cs.CV] 22 May 2018

                                              Visual language grounding is widely stud-
                                              ied in modern neural image caption-                tents, and has many important applications such
                                              ing systems, which typically adopts an             as developing image search engines with complex
                                              encoder-decoder framework consisting of            natural language queries, building AI agents that
                                              two principal components: a convolu-               can see and talk, and promoting equal web ac-
                                              tional neural network (CNN) for image              cess for people who are blind or visually impaired.
                                              feature extraction and a recurrent neural          Modern image captioning systems typically adopt
                                              network (RNN) for language caption gen-            an encoder-decoder framework composed of two
                                              eration. To study the robustness of lan-           principal modules: a convolutional neural network
                                              guage grounding to adversarial perturba-           (CNN) as an encoder for image feature extraction
                                              tions in machine vision and perception,            and a recurrent neural network (RNN) as a decoder
                                              we propose Show-and-Fool, a novel al-              for caption generation. This CNN+RNN archi-
                                              gorithm for crafting adversarial examples          tecture includes popular image captioning mod-
                                              in neural image captioning. The pro-               els such as Show-and-Tell (Vinyals et al., 2015),
                                              posed algorithm provides two evaluation            Show-Attend-and-Tell (Xu et al., 2015) and Neu-
                                              approaches, which check whether neural             ralTalk (Karpathy and Li, 2015).
                                              image captioning systems can be mislead               Recent studies have highlighted the vulnerabil-
                                              to output some randomly chosen captions            ity of CNN-based image classifiers to adversarial
                                              or keywords. Our extensive experiments             examples: adversarial perturbations to benign im-
                                              show that our algorithm can successfully           ages can be easily crafted to mislead a well-trained
                                              craft visually-similar adversarial examples        classifier, leading to visually indistinguishable ad-
                                              with randomly targeted captions or key-            versarial examples to human (Szegedy et al., 2014;
                                              words, and the adversarial examples can            Goodfellow et al., 2015). In this study, we in-
                                              be made highly transferable to other image         vestigate a more challenging problem in visual
                                              captioning systems. Consequently, our ap-          language grounding domain that evaluates the ro-
                                              proach leads to new robustness implica-            bustness of multimodal RNN in the form of a
                                              tions of neural image captioning and novel         CNN+RNN architecture, and use neural image
                                              insights in visual language grounding.             captioning as a case study. Note that crafting ad-
                                                                                                 versarial examples in image captioning tasks is
                                         1    Introduction
                                                                                                 strictly harder than in well-studied image classifi-
                                         In recent years, language understanding grounded        cation tasks, due to the following reasons: (i) class
                                         in machine vision and perception has made re-           attack v.s. caption attack: unlike classification
                                         markable progress in natural language processing        tasks where the class labels are well defined, the
                                         (NLP) and artificial intelligence (AI), such as im-     output of image captioning is a set of top-ranked
                                         age captioning and visual question answering. Im-       captions. Simply treating different captions as dis-
                                         age captioning is a multimodal learning task and        tinct classes will result in an enormous number
                                         has been used to study the interaction between lan-     of classes that can even precede the number of
                                         guage and vision models (Shekhar et al., 2017). It      training images. In addition, semantically similar
Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning
1. Targeted caption method: Given a targeted
                                                            caption, craft adversarial perturbations to any
                                                            image such that its generated caption matches
                                                            the targeted caption.
                                                         2. Targeted keyword method: Given a set of
                                                            keywords, craft adversarial perturbations to
                                                            any image such that its generated caption
                                                            contains the specified keywords. The cap-
                                                            tioning model has the freedom to make sen-
                                                            tences with target keywords in any order.
                                                       As an illustration, Figure 1 shows an adversarial
                                                       example crafted by Show-and-Fool using the tar-
                                                       geted caption method. The adversarial perturba-
                                                       tions are visually imperceptible while can success-
                                                       fully mislead Show-and-Tell to generate the tar-
                                                       geted captions. Interestingly and perhaps surpris-
                                                       ingly, our results pinpoint the Achilles heel of the
                                                       language and vision models used in the tested im-
                                                       age captioning systems. Moreover, the adversar-
                                                       ial examples in neural image captioning highlight
                                                       the inconsistency in visual language grounding be-
                                                       tween humans and machines, suggesting a possi-
                                                       ble weakness of current machine vision and per-
Figure 1: Adversarial examples crafted by Show-        ception machinery. Below we highlight our major
and-Fool using the targeted caption method. The        contributions:
target captioning model is Show-and-Tell (Vinyals
et al., 2015), the original images are selected from   • We propose Show-and-Fool, a novel optimiza-
the MSCOCO validation set, and the targeted cap-         tion based approach to crafting adversarial ex-
tions are randomly selected from the top-1 inferred      amples in image captioning. We provide two
caption of other validation images.                      types of adversarial examples, targeted caption
                                                         and targeted keyword, to analyze the robustness
captions can be expressed in different ways and          of neural image captioners. To the best of our
hence should not be viewed as different classes;         knowledge, this is the very first work on craft-
and (ii) CNN v.s. CNN+RNN: attacking RNN                 ing adversarial examples for image captioning.
models is significantly less well-studied than at-     • We propose powerful and generic loss functions
tacking CNN models. The CNN+RNN architec-                that can craft adversarial examples and evaluate
ture is unique and beyond the scope of adversarial       the robustness of the encoder-decoder pipelines
examples in CNN-based image classifiers.                 in the form of a CNN+RNN architecture. In par-
                                                         ticular, our loss designed for targeted keyword
   In this paper, we tackle the aforementioned           attack only requires the adversarial caption to
challenges by proposing a novel algorithm called         contain a few specified keywords; and we al-
Show-and-Fool. We formulate the process of               low the neural network to make meaningful sen-
crafting adversarial examples in neural image cap-       tences with these keywords on its own.
tioning systems as optimization problems with          • We conduct extensive experiments on the
novel objective functions designed to adopt the          MSCOCO dataset. Experimental results show
CNN+RNN architecture. Specifically, our objec-           that our targeted caption method attains a 95.8%
tive function is a linear combination of the dis-        attack success rate when crafting adversarial ex-
tortion between benign and adversarial examples          amples with randomly assigned captions. In ad-
as well as some carefully designed loss functions.       dition, our targeted keyword attack yields an
The proposed Show-and-Fool algorithm provides            even higher success rate. We also show that
two approaches to craft adversarial examples in          attacking CNN+RNN models is inherently dif-
neural image captioning under different scenarios:       ferent and more challenging than only attacking
Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning
CNN models.                                           et al., 2016).
• We also show that Show-and-Fool can produce              The recent work in (Shekhar et al., 2017)
  highly transferable adversarial examples: an          touched upon the robustness of neural image cap-
  adversarial image generated for fooling Show-         tioning for language grounding by showing its in-
  and-Tell can also fool other image captioning         sensitivity to one-word (foil word) changes in the
  models, leading to new robustness implications        language caption, which corresponds to the untar-
  of neural image captioning systems.                   geted attack category in adversarial examples. In
                                                        this paper, we focus on the more challenging tar-
2   Related Work                                        geted attack setting that requires to fool the cap-
                                                        tioning models and enforce them to generate pre-
In this section, we review the existing work on vi-     specified captions or keywords.
sual language grounding, with a focus on neural
image captioning. We also review related work           3     Methodology of Show-and-Fool
on adversarial attacks on CNN-based image clas-
sifiers. Due to space limitations, we defer the sec-    3.1    Overview of the Objective Functions
ond part to the supplementary material.
                                                        We now formally introduce our approaches to
   Visual language grounding represents a fam-          crafting adversarial examples for neural image
ily of multimodal tasks that bridge visual and          captioning. The problem of finding an adversar-
natural language understanding. Typical exam-           ial example for a given image I can be cast as the
ples include image and video captioning (Karpa-         following optimization problem:
thy and Li, 2015; Vinyals et al., 2015; Donahue
et al., 2015b; Pasunuru and Bansal, 2017; Venu-                     min c · loss(I + δ) + kδk22
gopalan et al., 2015), visual dialog (Das et al.,                    δ
2017; De Vries et al., 2017), visual question an-                   s.t. I + δ ∈ [−1, 1]n .              (1)
swering (Antol et al., 2015; Fukui et al., 2016;
Lu et al., 2016; Zhu et al., 2017), visual story-       Here δ denotes the adversarial perturbation to I.
telling (Huang et al., 2016), natural question gen-     kδk22 = k(I + δ) − Ik22 is an `2 distance metric
eration (Mostafazadeh et al., 2017, 2016), and im-      between the original image and the adversarial im-
age generation from captions (Mansimov et al.,          age. loss(·) is an attack loss function which takes
2016; Reed et al., 2016). In this paper, we focus on    different forms in different attacking settings. We
studying the robustness of neural image captioning      will provide the explicit expressions in Sections
models, and believe that the proposed method also       3.2 and 3.3. The term c > 0 is a pre-specified reg-
sheds lights on robustness evaluation for other vi-     ularization constant. Intuitively, with larger c, the
sual language grounding tasks using a similar mul-      attack is more likely to succeed but at the price of
timodal RNN architecture.                               higher distortion on δ. In our algorithm, we use
   Many image captioning methods based on deep          a binary search strategy to select c. The box con-
neural networks (DNNs) adopt a multimodal RNN           straint on the image I ∈ [−1, 1]n ensures that the
framework that first uses a CNN model as the            adversarial example I + δ ∈ [−1, 1]n lies within a
encoder to extract a visual feature vector, fol-        valid image space.
lowed by a RNN model as the decoder for cap-               For the purpose of efficient optimization, we
tion generation. Representative works under this        convert the constrained minimization problem in
framework include (Chen and Zitnick, 2015; De-          (1) into an unconstrained minimization problem
vlin et al., 2015; Donahue et al., 2015a; Karpa-        by introducing two new variables y ∈ Rn and
thy and Li, 2015; Mao et al., 2015; Vinyals et al.,     w ∈ Rn such that
2015; Xu et al., 2015; Yang et al., 2016; Liu et al.,
2017a,b), which are mainly differed by the under-       y = arctanh(I) and w = arctanh(I + δ) − y,
lying CNN and RNN architectures, and whether
or not the attention mechanisms are considered.         where arctanh denotes the inverse hyperbolic tan-
Other lines of research generate image captions         gent function and is applied element-wisely. Since
using semantic information or via a compositional       tanh(yi + wi ) ∈ [−1, 1], the transformation will
approach (Fang et al., 2015; Gan et al., 2017; Tran     automatically satisfy the box constraint. Conse-
et al., 2016; Jia et al., 2015; Wu et al., 2016; You    quently, the constrained optimization problem in
Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning
(1) is equivalent to                                               probability (5) as a loss function. The inputs of
                                                                   the RNN are the first N − 1 words of the targeted
       minw∈Rn      c · loss(tanh(w + y))                 (2)      caption (S1 , S2 , ..., SN −1 ).
                    +k tanh(w + y) −         tanh(y)k22 .
                                                                      lossS,log-prob (I + δ) = − log P (S|I + δ)
In the following sections, we present our designed                              N
loss functions for different attack settings.
                                                                                X                                                                (5)
                                                                        =−            log P (St |I + δ, S1 , ..., St−1 ).
                                                                                t=2
3.2     Targeted Caption Method
Note that a targeted caption is denoted by                         Applying (5) to (2), the formulation of targeted
                                                                   caption method given a targeted caption S is:
            S = (S1 , S2 , ..., St , ..., SN ),
                                                                          min c · lossS,log prob (tanh(w + y))
where St indicates the index of the t-th word in                         w∈Rn

the vocabulary list V, S1 is a start symbol and SN                               + k tanh(w + y) − tanh(y)k22 .
indicates the end symbol. N is the length of cap-
tion S, which is not fixed but does not exceed a                    Alternatively, using the definition of the soft-
predefined maximum caption length. To encour-                      max function,
age the neural image captioning system to output                                             N
the targeted caption S, one needs to ensure the log                                                    (St0 )                 (i)
                                                                                             X                         X
                                                                            0
                                                                   log P (S |I + δ) =              [zt           − log(  exp(zt ))]
probability of the caption S conditioned on the im-                                          t=2                           i∈V
age I + δ attains the maximum value among all                                                N
                                                                                                       (St0 )
                                                                                             X
possible captions, that is,                                                              =         zt           − constant,                      (6)
                                                                                             t=2
      log P (S|I + δ) = max
                         0
                            log P (S 0 |I + δ),           (3)
                            S ∈Ω
                                                                   (3) can be simplified as
where Ω is the set of all possible captions. It is
                                                                                              N                                N
also common to apply the chain rule to the joint                                              X          (St )
                                                                                                                               X        (St0 )
                                                                     log P (S|I + δ) ∝                 zt         = max                zt        .
probability and we have                                                                                              0  S ∈Ω
                                                                                                 t=2                           t=2
                      N
                      X                                                                                         (S )
log P (S 0 |I+δ) =          log P (St0 |I+δ, S10 , ..., St−1
                                                         0
                                                             ).       Instead of making each zt t as large as possi-
                      t=2                                          ble, it is sufficient to require the target word St
                                                                   to attain the largest (top-1) logit (or probability)
  In neural image captioning networks,
                                                                   among all the words in the vocabulary at position
                          0 ) is usually computed
p(St0 |I + δ, S10 , ..., St−1
                                                                   t. In other words, we aim to minimize the differ-
by a RNN/LSTM cell f , with its hidden state ht−1
           0 :                                                     ence between the maximum logit except St , de-
and input St−1                                                                                 (k)
                                                                   noted by maxk∈V,k6=St {zt }, and the logit of St ,
                  0                                                              (S )
  zt = f (ht−1 , St−1 ) and pt = softmax(zt ), (4)                 denoted by zt t . We also propose a ramp function
                (1) (2)      (|V|)
                                                                   on top of this difference as the final loss function:
where zt := [zt , zt , ..., zt ] ∈ R|V| is a vec-
tor of the logits (unnormalized probabilities) for                                        N −1
                                                                                                                                 (k)        (St )
                                                                                          X
each possible word in the vocabulary. The vector                  lossS,logits (I+δ) =           max{−, max{zt }−zt                                 },
                                                                                                                       k6=St
pt represents a probability distribution on V with                                         t=2
                   (i)                                                                                      (7)
each coordinate pt defined as:                              where  > 0 is a confidence level accounting for
                                                                                         (k)        (S )
         (i)
        pt := P (St0 = i|I + δ, S10 , ..., St−1
                                            0
                                                ).          the gap between maxk6=St {zt } and zt t . When
                                                              (S )                (k)
                                                            zt t > maxk6=St {zt } + , the corresponding
Following the definition of softmax function:               term in the summation will be kept at − and does
                                        0
                                      (S )             (i) not contribute to the gradient of the loss function,
                                              X
P (St0 |I+δ, S10 , ..., St−1
                         0
                             ) = exp(zt t )/      exp(zt ). encouraging the optimizer to focus on minimizing
                                              i∈V                               (S )
                                                            other terms where zt t is not large enough.
   Intuitively, to maximize the targeted caption’s             Applying the loss (7) to (1), the final formula-
probability, we can directly use its negative log           tion of targeted caption method given a targeted
Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning
caption S is                                                    only require the presence of specified keywords in
                                                                 the final caption. To bridge the gap, we use the
                  N −1
                  X                      (k)         (St )       originally inferred caption S 0 = (S10 , · · · , SN0 )
       minn c ·          max{−, max{zt } − zt               }
       w∈R                       k6=St                           from the benign image as the initial input to RNN.
                  t=2
                                                                 Specifically, after minimizing (8) for T iterations,
             + k tanh(w + y) − tanh(y)k22 .                      we run inference on I + δ and set the RNN’s input
                                                                 S 1 as its current top-1 prediction, and continue this
    We note that (Carlini and Wagner, 2017) has re-
                                                                 process. With this iterative optimization process,
 ported that in CNN-based image classification, us-
                                                                 the desired keywords are expected to gradually ap-
 ing logits in the attack loss function can produce
                                                                 pear in top-1 prediction.
 better adversarial examples than using probabili-
                                                                    Another challenge arises in targeted keyword
 ties, especially when the target network deploys
                                                                 method is the problem of “keyword collision”.
 some gradient masking schemes such as defensive
                                                                 When the number of keywords M ≥ 2, more
 distillation (Papernot et al., 2016b). Therefore, we
                                                                 than one keywords may have large values of
 provide both logit-based and probability-based at-                             (k)     (K )
 tack loss functions for neural image captioning.                maxk6=Kj {zt } − zt j at a same position t. For
                                                                 example, if dog and cat are top-2 predictions for
 3.3    Targeted Keyword Method                                  the second word in a caption, the caption can ei-
 In addition to generating an exact targeted cap-                ther start with “A dog ...” or “A cat ...”. In this
 tion by perturbing the input image, we offer an                 case, despite the loss (8) being very small, a cap-
 intermediate option that aims at generating cap-                tion with both dog and cat can hardly be gener-
 tions with specific keywords, denoted by K :=                   ated, since only one word is allowed to appear at
 {K1 , · · · , KM } ⊂ V. Intuitively, finding an ad-             the same position. To alleviate this problem, we
 versarial image generating a caption with specific              define a gate function gt,j (x) which masks off all
 keywords might be easier than generating an exact               the other keywords when a keyword becomes top-
 caption, as we allow more degree of freedom in                  1 at position t:
 caption generation. However, as we need to ensure                            (
                                                                                                      (i)
 a valid and meaningful inferred caption, finding an                             A, if arg maxi∈V zt ∈ K \ {Kj }
                                                                 gt,j (x) =
 adversarial example with specific keywords in its                               x, otherwise,
 caption is difficult in an optimization perspective.            where A is a predefined value that is significantly
 Our target keyword method can be used to investi-               larger than common logits values. Then (8) be-
 gate the generalization capability of a neural cap-             comes:
 tioning system given only a few keywords.
                                                                 M
    In our method, we do not require a target key-               X                                  (k)      (Kj )
 word Kj , j ∈ [M ] to appear at a particular po-                       min {gt,j (max{−, max {zt } − zt            })}.
                                                                        t∈[N ]              k6=Kj
                                                                  j=1
 sition. Instead, we want a loss function that al-
                                                                                                                      (9)
 lows Kj to become the top-1 prediction (plus a
 confidence margin ) at any position. Therefore,                The log-prob loss for targeted keyword method is
 we propose to use the minimum of the hinge-like                 discussed in the Supplementary Material.
 loss terms over all t ∈ [N ] as an indication of Kj
 appearing at any position as the top-1 prediction,              4     Experiments
 leading to the following loss function:
                                                                 4.1    Experimental Setup and Algorithms
              M                                          We performed extensive experiments to test the ef-
                                               (k)       (Kj )
              X
lossK,logits =      min {max{−,max {zt }−zt         }}. fectiveness of our Show-and-Fool algorithm and
                   t∈[N ]          k6=Kj
               j=1                                       study the robustness of image captioning systems
                                                   (8)   under different problem settings. In our experi-
   We note that the loss functions in (4) and (5)        ments1 , we use the pre-trained TensorFlow imple-
                        0
 require an input St−1 to predict zt for each t ∈        mentation2 of Show-and-Tell (Vinyals et al., 2015)
 {2, . . . , N }. For the targeted caption method, we
                                                             1
 use the targeted caption S as the input of RNN.               Our source code is available at: https://github.com/
                                                         huanzhang12/ImageCaptioningAttack
 In contrast, for the targeted keyword method we             2
                                                               https://github.com/tensorflow/models/tree/master/
 no longer know the exact targeted sentence, but         research/im2txt
Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning
with Inception-v3 as the CNN for visual feature        Table 1: Summary of targeted caption method
extraction. Our testbed is Microsoft COCO (Lin         (Section 3.2) and targeted keyword method (Sec-
et al., 2014) (MSCOCO) data set. Although some         tion 3.3) using logits loss. The `2 distortion of
more recent neural image captioning systems can        adversarial noise kδk2 is averaged over success-
achieve better performance than Show-and-Tell,         ful adversarial examples. For comparison, we also
they share a similar framework that uses CNN           include CNN based attack methods (Section 4.5).
for feature extraction and RNN for caption gen-
eration, and Show-and-Tell is the vanilla version            Experiments        Success Rate     Avg. kδk2
                                                           targeted caption        95.8%           2.213
of this CNN+RNN architecture. Indeed, we find                 1-keyword            97.1%           1.589
that the adversarial examples on Show-and-Tell                2-keyword            97.5%           2.363
are transferable to other image captioning mod-               3-keyword            96.0%           2.626
                                                            C&W on CNN             22.4%           2.870
els such as Show-Attend-and-Tell (Xu et al., 2015)        I-FGSM on CNN            34.5%          15.596
and NeuralTalk23 , suggesting that the attention
mechanism and the choice of CNN and RNN ar-            Table 2: Statistics of the 4.2% failed adversarial
chitectures do not significantly affect the robust-    examples using the targeted caption method and
ness. We also note that since Show-and-Fool is         logits loss (7). All correlation scores are computed
the first work on crafting adversarial examples for    using the top-5 inferred captions of an adversar-
neural image captioning, to the best of our knowl-     ial image and the targeted caption (higher score
edge, there is no other method for comparison.         means better targeted attack performance).
   We use ADAM to minimize our loss functions
and set the learning rate to 0.005. The number of             c           1       10      102       103       104
                                                        `2 Distortion   1.726    3.400   7.690     16.03     23.31
iterations is set to 1, 000. All the experiments are      BLEU-1        .567     .725    .679      .701      .723
performed on a single Nvidia GTX 1080 Ti GPU.             BLEU-2        .420     .614    .559      .585      .616
For targeted caption and targeted keyword meth-           BLEU-3        .320     .509    .445      .484      .514
                                                          BLEU-4        .252     .415    .361      .402      .417
ods, we perform a binary search for 5 times to find       ROUGE         .502     .664    .629      .638      .672
the best c: initially c = 1, and c will be increased     METEOR         .258     .407    .375      .403      .399
by 10 times until a successful adversarial example
is found. Then, we choose a new c to be the aver-      output relevant captions learned from the train-
age of the largest c where an adversarial example      ing set. For instance, the captioning model can-
can be found and the smallest c where an adversar-     not generate a passive-voice sentence if the model
ial example cannot be found. We fix  = 1 except       was never trained on such sentences. Therefore,
for transferability experiments. For each experi-      we need to ensure that the targeted caption lies in
ment, we randomly select 1,000 images from the         the space where the captioning system can pos-
MSCOCO validation set. We use BLEU-1 (Pa-              sibly generate. To address this issue, we use the
pineni et al., 2002), BLEU-2, BLEU-3, BLEU-            generated caption of a randomly selected image
4, ROUGE (Lin, 2004) and METEOR (Lavie and             (other than the image under investigation) from
Agarwal, 2005) scores to evaluate the correlations     MSCOCO validation set as the targeted caption S.
between the inferred captions and the targeted cap-    The use of a generated caption as the targeted cap-
tions. These scores are widely used in NLP com-        tion excludes the effect of out-of-domain caption-
munity and are adopted by image captioning sys-        ing, and ensures that the target caption is within
tems for quality assessment. Throughout this sec-      the output space of the captioning network.
tion, we use the logits loss (7)(9). The results of
using the log-prob loss (5) are similar and are re-       Here we use the logits loss (7) plus a `2 distor-
ported in the supplementary material.                  tion term (as in (2)) as our objective function. A
                                                       successful adversarial example is found if the in-
4.2      Targeted Caption Results                      ferred caption after adding the adversarial pertur-
Unlike the image classification task where all pos-    bation δ is exactly the same as the targeted caption.
sible labels are predefined, the space of possible     In our setting, 1,000 ADAM iterations take about
captions in a captioning system is almost infinite.    38 seconds for one image. The overall success
However, the captioning system is only able to         rate and average distortion of adversarial perturba-
                                                       tion δ are shown in Table 1. Among all the tested
   3
       https://github.com/karpathy/neuraltalk2         images, our method attains 95.8% attack success
Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning
rate. Moreover, our adversarial examples have
small `2 distortions and are visually identical to
the original images, as displayed in Figure 1. We
also examine the failed adversarial examples and
summarize their statistics in Table 2. We find that
their generated captions, albeit not entirely identi-
cal to the targeted caption, are in fact highly corre-
lated to the desired one. Overall, the high success
rate and low `2 distortion of adversarial examples
clearly show that Show-and-Tell is not robust to
targeted adversarial perturbations.

4.3   Targeted Keyword Results
In this task, we use (9) as our loss function, and       Figure 2: An adversarial example (kδk2 = 1.284)
choose the number of keywords M = {1, 2, 3}.             of an cake image crafted by the Show-and-Fool
We run an inference step on I + δ every T = 5            targeted keyword method with three keywords -
iterations, and use the top-1 caption as the input       “dog”, “cat” and “frisbee”.
of RNN/LSTMs. Similar to Section 4.2, for each           Table 3: Percentage of partial success with differ-
image the targeted keywords are selected from the        ent c in the 4.0% failed images that do not contain
caption generated by a randomly selected valida-         all the 3 targeted keywords.
tion set image. To exclude common words like
“a”, “the”, “and”, we look up each word in the               c    Avg. kδk2   M0 ≥ 1   M0 = 2    Avg. M 0
targeted sentence and only select nouns, verbs, ad-          1       2.49      72.4%   34.5%       1.07
                                                            10       5.40      82.7%   37.9%       1.21
jectives or adverbs. We say an adversarial image is
                                                            102     12.95      93.1%   58.6%       1.52
successful when its caption contains all specified          103     24.77      96.5%   51.7%       1.48
keywords. The overall success rate and average              104     29.37     100.0%   58.6%       1.59
distortion are shown in Table 1. When compared
to the targeted caption method, targeted keyword         learning model may also be effective against an-
method achieves an even higher success rate (at          other model, even if the two models have dif-
least 96% for 3-keyword case and at least 97%            ferent architectures (Papernot et al., 2016a; Liu
for 1-keyword and 2-keyword cases). Figure 2             et al., 2017c). However, unlike image classifica-
shows an adversarial example crafted from our            tion where correct labels are made explicit, two
targeted keyword method with three keywords -            different image captioning systems may generate
“dog”, “cat” and “frisbee”. Using Show-and-Fool,         quite different, yet semantically similar, captions
the top-1 caption of a cake image becomes “A dog         for the same benign image. In image caption-
and a cat are playing with a frisbee” while the ad-      ing, we say an adversarial example is transfer-
versarial image remains visually indistinguishable       able when the adversarial image found on model
to the original one. When M = 2 and 3, even if we        A with a target sentence SA can generate a similar
cannot find an adversarial image yielding all spec-      (rather than exact) sentence SB on model B.
ified keywords, we might end up with a caption
that contains some of the keywords (partial suc-            In our setting, model A is Show-and-Tell, and
cess). For example, when M = 3, Table 3 shows            we choose Show-Attend-and-Tell (Xu et al., 2015)
the number of keywords appeared in the captions          as model B. The major differences between
(M 0 ) for those failed examples (not all 3 targeted     Show-and-Tell and Show-Attend-and-Tell are the
keywords are found). These results clearly show          addition of attention units in LSTM network for
that the 4% failed examples are still partially suc-     caption generation, and the use of last convolu-
cessful: the generated captions contain about 1.5        tional layer (rather than the last fully-connected
targeted keywords on average.                            layer) feature maps for feature extraction. We
                                                         use Inception-v3 as the CNN architecture for both
4.4   Transferability of Adversarial Examples            models and train them on the MSCOCO 2014 data
It has been shown that in image classification           set. However, their CNN parameters are different
tasks, adversarial examples found for one machine        due to the fine-tuning process.
Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning
Table 4: Transferability of adversarial examples from Show-and-Tell to Show-Attend-and-Tell, using
different  and c. ori indicates the scores between the generated captions of the original images and the
transferred adversarial images on Show-Attend-and-Tell. tgt indicates the scores between the targeted
captions on Show-and-Tell and the generated captions of transferred adversarial images on Show-Attend-
and-Tell. A smaller ori or a larger tgt value indicates better transferability. mis measures the differences
between captions generated by the two models given the same benign image (model mismatch). When
C = 1000,  = 10, tgt is close to mis, indicating the discrepancy between adversarial captions on the two
models is mostly bounded by model mismatch, and the adversarial perturbation is highly transferable.
                          =1                                    =5                                 = 10
             C=10        C=100        C=1000        C=10        C=100       C=1000       C=10       C=100       C=1000
          ori    tgt   ori    tgt   ori    tgt   ori    tgt   ori   tgt   ori   tgt   ori   tgt   ori   tgt   ori   tgt   mis
BLEU-1    .474 .395    .384 .462    .347 .484    .441 .429    .368 .488   .337 .527   .431 .421   .360 .485   .339 .534   .649
BLEU-2    .337 .236    .230 .331    .186 .342    .300 .271    .212 .343   .175 .389   .287 .266   .204 .342   .174 .398   .521
BLEU-3    .256 .154    .151 .224    .114 .254    .220 .184    .135 .254   .103 .299   .210 .185   .131 .254   .102 .307   .424
BLEU-4    .203 .109    .107 .172    .077 .198    .170 .134    .093 .197   .068 .240   .162 .138   .094 .197   .066 .245   .352
ROUGE     .463 .371    .374 .438    .336 .465    .429 .402    .359 .464   .329 .502   .421 .398   .351 .463   .328 .507   .604
METEOR    .201 .138    .139 .180    .118 .201    .177 .157    .131 .199   .110 .228   .172 .157   .127 .202   .110 .232   .300
kδk2         3.268        4.299        4.474        7.756       10.487      10.952      15.757      21.696      21.778

                                                                  ROUGE and METEOR scores in Table 4 under
                                                                  the mis column. To evaluate the effectiveness of
                                                                  transferred adversarial examples, we measure the
                                                                  scores for two set of captions: (i) the captions of
                                                                  original images and the captions of transferred ad-
                                                                  versarial images, both generated by Show-Attend-
                                                                  and-Tell (shown under column ori in Table 4); and
                                                                  (ii) the targeted captions for generating adversarial
                                                                  examples on Show-and-Tell, and the captions of
                                                                  the transferred adversarial image on Show-Attend-
                                                                  and-Tell (shown under column tgt in Table 4).
                                                                  Small values of ori suggest that the adversarial
Figure 3: A highly transferable adversarial exam-                 images on Show-Attend-and-Tell generate signif-
ple (kδk2 = 15.226) crafted by Show-and-Tell tar-                 icantly different captions from original images’
geted caption method, transfers to Show-Attend-                   captions. Large values of tgt suggest that the ad-
and-Tell, yielding similar adversarial captions.                  versarial images on Show-Attend-and-Tell gener-
                                                                  ate similar adversarial captions as on the Show-
   To investigate the transferability of adversarial              and-Tell model. We find that increasing c or 
examples in image captioning, we first use the tar-               helps to enhance transferability at the cost of larger
geted caption method to find adversarial examples                 (but still acceptable) distortion. When C = 1, 000
for 1,000 images in model A with different c and ,               and  = 10, Show-and-Fool achieves the best
and then transfer successful adversarial examples                 transferability results: tgt is close to mis, indicat-
(which generate the exact target captions on model                ing that the discrepancy between adversarial cap-
A) to model B. The generated captions by model                    tions on the two models is mostly bounded by the
B are recorded for transferability analysis. The                  intrinsic model mismatch rather than the transfer-
transferability of adversarial examples depends on                ability of adversarial perturbations, and implying
two factors: the intrinsic difference between two                 that the adversarial perturbations are easily trans-
models even when the same benign image is used                    ferable. In addition, the adversarial examples gen-
as the input, i.e., model mismatch, and the trans-                erated by our method can also fool NeuralTalk2.
ferability of adversarial perturbations.                          When c = 104 ,  = 10, the average `2 distortion,
                                                                  BLEU-4 and METEOR scores between the origi-
   To measure the mismatch between Show-and-
                                                                  nal and transferred adversarial captions are 38.01,
Tell and Show-Attend-and-Tell, we generate cap-
                                                                  0.440 and 0.473, respectively. The high transfer-
tions of the same set of 1,000 original images
                                                                  ability of adversarial examples crafted by Show-
from both models, and report their mutual BLEU,
Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning
and-Fool also indicates the problem of common          clude ‘broccoli’,‘cruciferous’,‘vegetable’,‘veggie’
robustness leakage between different neural image      and all other following words. Note that this cri-
captioning models.                                     terion of success is much weaker than the crite-
                                                       rion we use in the targeted caption method, since a
4.5   Attacking Image Captioning v.s.                  caption with the targeted image’s hypernyms does
      Attacking Image Classification                   not necessarily leads to similar meaning of the tar-
                                                       geted image’s captions. To achieve higher attack
In this section we show that attacking image cap-      success rates, we allow relatively larger distortions
tioning models is inherently more challenging          and set ∞ = 0.3 (maximum `∞ distortion) in I-
than attacking image classification models. In the     FGSM and κ = 10, C = 100 in C&W. How-
classification task, a targeted attack usually be-     ever, as shown in Table 1, the attack success rates
comes harder when the number of labels increases,      are only 34.5% for I-FGSM and 22.4% for C&W,
since an attack method needs to change the classi-     respectively, which are much lower than the suc-
fication prediction to a specific label over all the   cess rates of our methods despite larger distor-
possible labels. In the targeted attack on image       tions. This result further confirms that perform-
captioning, if we treat each caption as a label,       ing targeted attacks on neural image captioning re-
we need to change the original label to a specific     quires a careful design (as proposed in this paper),
one over an almost infinite number of possible la-     and attacking image captioning systems is not a
bels, corresponding to a nearly zero volume in the     trivial extension to attacking image classifiers.
search space. This constraint forces us to develop
non-trivial methods that are significantly different   5   Conclusion
from the ones designed for attacking image classi-
fication models.                                       In this paper, we proposed a novel algorithm,
   To verify that the two tasks are inherently dif-    Show-and-Fool, for crafting adversarial examples
ferent, we conducted additional experiments on         and providing robustness evaluation of neural im-
attacking only the CNN module using two state-         age captioning. Our extensive experiments show
of-the-art image classification attacks on Ima-        that the proposed targeted caption and keyword
geNet dataset. Our experiment setup is as fol-         methods yield high attack success rates while the
lows. Each selected ImageNet image has a la-           adversarial perturbations are still imperceptible to
bel corresponding to a WordNet synset ID. We           human eyes. We further demonstrate that Show-
randomly selected 800 images from ImageNet             and-Fool can generate highly transferable adver-
dataset such that their synsets have at least one      sarial examples. The high-quality and transferable
word in common with Show-and-Tell’s vocabu-            adversarial examples in neural image captioning
lary, while ensuring the Inception-v3 CNN (Show-       crafted by Show-and-Fool highlight the inconsis-
and-Tell’s CNN) classify them correctly. Then,         tency in visual language grounding between hu-
we perform Iterative Fast Gradient Sign Method         mans and machines, suggesting a possible weak-
(I-FGSM) (Kurakin et al., 2017) and Carlini and        ness of current machine vision and perception ma-
Wagner’s (C&W) attack (Carlini and Wagner,             chinery. We also show that attacking neural image
2017) on these images. The attack target la-           captioning systems are inherently different from
bels are randomly chosen and their synsets also        attacking CNN-based image classifiers.
have at least one word in common with Show-               Our method stands out from the well-studied
and-Tell’s vocabulary. Both I-FGSM and C&W             adversarial learning on image classifiers and CNN
achieve 100% targeted attack success rate on the       models. To the best of our knowledge, this is the
Inception-v3 CNN. These adversarial examples           very first work on crafting adversarial examples
were further employed to attack Show-and-Tell          for neural image captioning systems. Indeed, our
model. An attack is considered successful if any       Show-and-Fool algorithm1 can be easily extended
word in the targeted label’s synset or its hyper-      to other applications with RNN or CNN+RNN ar-
nyms up to 5 levels is presented in the resulting      chitectures. We believe this paper provides poten-
caption. For example, for the chain of hypernyms       tial means to evaluate and possibly improve the ro-
‘broccoli’⇒‘cruciferous vegetable’⇒‘vegetable,         bustness (for example, by adversarial training or
veggie, veg’⇒‘produce, green goods, green gro-         data augmentation) of a wide range of visual lan-
ceries, garden truck’⇒‘food, solid food’, we in-       guage grounding and other NLP models.
Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning
References                                                Akira Fukui, Dong Huk Park, Daylen Yang, Anna
                                                            Rohrbach, Trevor Darrell, and Marcus Rohrbach.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-         2016. Multimodal compact bilinear pooling for
   garet Mitchell, Dhruv Batra, C Lawrence Zitnick,         visual question answering and visual grounding.
   and Devi Parikh. 2015. VQA: Visual question an-          In Proceedings of the 2016 Conference on Em-
   swering. In Proceedings of the IEEE International        pirical Methods in Natural Language Processing
   Conference on Computer Vision, pages 2425–2433.          (EMNLP), pages 457–468.
Nicholas Carlini and David Wagner. 2017. Towards          Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu,
  evaluating the robustness of neural networks. In          Kenneth Tran, Jianfeng Gao, Lawrence Carin, and
  IEEE Symposium on Security and Privacy, pages             Li Deng. 2017. Semantic compositional networks
  39–57.                                                    for visual captioning. In Proceedings of the IEEE
Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi,           Conference on Computer Vision and Pattern Recog-
   and Cho-Jui Hsieh. 2018. EAD: elastic-net attacks        nition, pages 5630–5639.
   to deep neural networks via adversarial examples.      Ian J Goodfellow, Jonathon Shlens, and Chris-
   AAAI.                                                     tian Szegedy. 2015.     Explaining and harness-
Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi,            ing adversarial examples. ICLR; arXiv preprint
   and Cho-Jui Hsieh. 2017. ZOO: zeroth order opti-          arXiv:1412.6572.
   mization based black-box attacks to deep neural net-   Ting-Hao Kenneth Huang, Francis Ferraro, Nasrin
   works without training substitute models. In Pro-        Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Ja-
   ceedings of the 10th ACM Workshop on Artificial In-      cob Devlin, Ross Girshick, Xiaodong He, Pushmeet
   telligence and Security, AISec@CCS, pages 15–26.         Kohli, Dhruv Batra, et al. 2016. Visual storytelling.
Xinlei Chen and C. Lawrence Zitnick. 2015. Mind’s           In Proceedings of the 2016 Conference of the North
  eye: A recurrent visual representation for image cap-     American Chapter of the Association for Computa-
  tion generation. In CVPR, pages 2422–2431.                tional Linguistics: Human Language Technologies
                                                            (NAACL-HLT), pages 1233–1239.
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi
  Singh, Deshraj Yadav, José MF Moura, Devi Parikh,       Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne
  and Dhruv Batra. 2017. Visual dialog. In Proceed-         Tuytelaars. 2015. Guiding the long-short term mem-
  ings of the IEEE Conference on Computer Vision            ory model for image caption generation. In Com-
  and Pattern Recognition (CVPR), volume 2.                 puter Vision (ICCV), 2015 IEEE International Con-
                                                            ference on, pages 2407–2415. IEEE.
Harm De Vries, Florian Strub, Sarath Chandar, Olivier
  Pietquin, Hugo Larochelle, and Aaron Courville.         Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-
  2017. Guesswhat?! visual object discovery through         semantic alignments for generating image descrip-
  multi-modal dialogue. In CVPR.                            tions. In CVPR, pages 3128–3137.

Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta,         Alexey Kurakin, Ian Goodfellow, and Samy Bengio.
   Li Deng, Xiaodong He, Geoffrey Zweig, and Mar-           2017. Adversarial machine learning at scale. ICLR;
   garet Mitchell. 2015. Language models for image          arXiv preprint arXiv:1611.01236.
   captioning: The quirks and what works. In Proceed-     Alon Lavie and Abhaya Agarwal. 2005. Meteor: An
   ings of the 53rd Annual Meeting of the Association       automatic metric for mt evaluation with improved
   for Computational Linguistics and the 7th Interna-       correlation with human judgments. In Proceedings
   tional Joint Conference on Natural Language Pro-         of the EMNLP 2011 Workshop on Statistical Ma-
   cessing (Volume 2: Short Papers), volume 2, pages        chine Translation, pages 65–72.
   100–105.
                                                          Chin-Yew Lin. 2004. Rouge: A package for auto-
Jeff Donahue, Lisa Anne Hendricks, Sergio Guadar-           matic evaluation of summaries. In Text summariza-
   rama, Marcus Rohrbach, Subhashini Venugopalan,           tion branches out: Proceedings of the ACL-04 work-
   Trevor Darrell, and Kate Saenko. 2015a. Long-term        shop, volume 8. Barcelona, Spain.
   recurrent convolutional networks for visual recogni-
   tion and description. In CVPR, pages 2625–2634.        Tsung-Yi Lin, Michael Maire, Serge Belongie, James
                                                            Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadar-        and C Lawrence Zitnick. 2014. Microsoft coco:
   rama, Marcus Rohrbach, Subhashini Venugopalan,           Common objects in context. In European confer-
   Kate Saenko, and Trevor Darrell. 2015b. Long-term        ence on computer vision, pages 740–755. Springer.
   recurrent convolutional networks for visual recogni-
   tion and description. In CVPR, pages 2625–2634.        Chenxi Liu, Junhua Mao, Fei Sha, and Alan L Yuille.
                                                            2017a. Attention correctness in neural image cap-
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K          tioning. In AAAI, pages 4176–4182.
  Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xi-
  aodong He, Margaret Mitchell, John C Platt, et al.      Feng Liu, Tao Xiang, Timothy M Hospedales, Wankou
  2015. From captions to visual concepts and back.          Yang, and Changyin Sun. 2017b. Semantic regular-
  In CVPR, pages 1473–1482.                                 isation for recurrent image annotation. CVPR.
Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song.        Scott Reed, Zeynep Akata, Xinchen Yan, Lajanu-
  2017c. Delving into transferable adversarial exam-        gen Logeswaran, Bernt Schiele, and Honglak Lee.
  ples and black-box attacks. ICLR; arXiv preprint          2016. Generative adversarial text to image synthe-
  arXiv:1611.02770.                                         sis. In International Conference on Machine Learn-
                                                            ing, pages 1060–1069.
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi
   Parikh. 2016.      Hierarchical question-image co-     Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich,
   attention for visual question answering. In Advances     Aurelie Herbelot, Moin Nabi, Enver Sangineto, Raf-
   In Neural Information Processing Systems (NIPS),         faella Bernardi, et al. 2017. Foil it! Find one mis-
   pages 289–297.                                           match between image and language caption. In An-
                                                            nual Meeting of the Association for Computational
Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and         Linguistics (ACL).
  Ruslan Salakhutdinov. 2016. Generating images
  from captions with attention. ICLR; arXiv preprint      Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,
  arXiv:1511.02793.                                         Joan Bruna, Dumitru Erhan, Ian Goodfellow, and
                                                            Rob Fergus. 2014. Intriguing properties of neural
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and                networks. ICLR;arXiv preprint arXiv:1312.6199.
  Alan L. Yuille. 2015. Deep captioning with mul-
                                                          Kenneth Tran, Xiaodong He, Lei Zhang, and Jian Sun.
  timodal recurrent neural networks (m-rnn). ICLR;
                                                            2016. Rich image captioning in the wild. In Com-
  arXiv preprint arXiv:1412.6632.
                                                            puter Vision and Pattern Recognition Workshops
                                                            (CVPRW), 2016 IEEE Conference on, pages 434–
Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi,
                                                            441. IEEE.
  Omar Fawzi, and Pascal Frossard. 2017. Universal
  adversarial perturbations. In CVPR.                     Subhashini Venugopalan, Huijuan Xu, Jeff Donahue,
                                                            Marcus Rohrbach, Raymond J. Mooney, and Kate
Nasrin Mostafazadeh, Chris Brockett, Bill Dolan,            Saenko. 2015. Translating videos to natural lan-
  Michel Galley, Jianfeng Gao, Georgios Sp-                 guage using deep recurrent neural networks. In
  ithourakis, and Lucy Vanderwende. 2017. Image-            NAACL-HLT, pages 1494–1504.
  grounded conversations: Multimodal context for
  natural question and response generation. In Pro-       Oriol Vinyals, Alexander Toshev, Samy Bengio, and
  ceedings of the Eighth International Joint Confer-        Dumitru Erhan. 2015. Show and tell: A neural im-
  ence on Natural Language Processing (Volume 1:            age caption generator. In CVPR, pages 3156–3164.
  Long Papers), volume 1, pages 462–472.
                                                          Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick,
Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Mar-        and Anton van den Hengel. 2016. What value do ex-
  garet Mitchell, Xiaodong He, and Lucy Vander-             plicit high level concepts have in vision to language
  wende. 2016. Generating natural questions about           problems? In CVPR, pages 203–212.
  an image. In Proceedings of the 54th Annual Meet-
  ing of the Association for Computational Linguistics    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun
  (Volume 1: Long Papers), volume 1, pages 1802–            Cho, Aaron C. Courville, Ruslan Salakhutdinov,
  1813.                                                     Richard S. Zemel, and Yoshua Bengio. 2015. Show,
                                                            attend and tell: Neural image caption generation
Nicolas Papernot, Patrick McDaniel, and Ian Good-           with visual attention. In ICML, pages 2048–2057.
  fellow. 2016a. Transferability in machine learning:
  from phenomena to black-box attacks using adver-        Zhilin Yang, Ye Yuan, Yuexin Wu, William W. Cohen,
  sarial samples. arXiv preprint arXiv:1605.07277.          and Ruslan Salakhutdinov. 2016. Review networks
                                                            for caption generation. In NIPS, pages 2361–2369.
Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh         Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang,
  Jha, and Ananthram Swami. 2016b. Distillation as          and Jiebo Luo. 2016. Image captioning with seman-
  a defense to adversarial perturbations against deep       tic attention. In CVPR, pages 4651–4659.
  neural networks. In Security and Privacy (SP), 2016
  IEEE Symposium on, pages 582–597. IEEE.                 Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexan-
                                                            der G Hauptmann. 2017. Uncovering the temporal
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-         context for video question answering. International
  Jing Zhu. 2002. Bleu: a method for automatic eval-        Journal of Computer Vision, 124(3):409–421.
  uation of machine translation. In annual meeting
  on association for computational linguistics (ACL),
  pages 311–318.

Ramakanth Pasunuru and Mohit Bansal. 2017. Multi-
  task video captioning with video and entailment
  generation. In Annual Meeting of the Association
  for Computational Linguistics (ACL), pages 1273–
  1283.
Supplementary Material                                  7   More Adversarial Examples with
                                                            Logits Loss
6   Related Work on Adversarial Attacks                 Figure 4 shows another successful example with
    to CNN-based Image Classifiers                      targeted caption method. Figures 5, 6 and 7 show
                                                        three adversarial examples generated by the pro-
                                                        posed 3-keyword method. The adversarial exam-
Despite the remarkable progress, CNNs have been         ples generated by our methods have small L2 dis-
shown to be vulnerable to adversarial examples          tortions and are visually indistinguishable from
(Szegedy et al., 2014; Goodfellow et al., 2015;         the original images. One advantage of using logits
Carlini and Wagner, 2017). In image classifica-         losses is that it helps to bypass defensive distilla-
tion, an adversarial example is an image that is vi-    tion by overcoming the gradient vanishing prob-
sually indistinguishable to the original image but      lem. To see this, the partial derivative of the soft-
can cause a CNN model to misclassify. With dif-         max function
ferent objectives, adversarial attacks can be di-                                     X
vided into two categories, i.e., untargeted attack                p(j) = exp(z (j) )/    exp(z (i) ),
and targeted attack. In the literature, a success-                                       i∈V
ful untargeted attack refers to finding an adver-
                                                        is given by
sarial example that is close to the original exam-
ple but yields different class prediction. For tar-                       ∂p(j)
geted attack, a target class is specified and the ad-                            = p(j) (1 − p(j) ),       (10)
                                                                          ∂z (j)
versarial example is considered successful when
the predicted class matches the target class. Sur-      which vanishes as p(j) → 0 or p(j) → 1. The de-
prisingly, adversarial examples can also be crafted     fensive distillation method [30] uses a large distil-
even when the parameters of target CNN model            lation temperature in the training process and re-
are unknown to an attacker (Liu et al., 2017c; Chen     moves it in the inference process. This makes the
et al., 2017). In addition, adversarial examples        inference probability p(j) close to 0 or 1, thus leads
crafted from one image classification model can         to a vanished gradient problem. However, by us-
be made transferable to other models (Liu et al.,       ing the proposed logits loss (7), before the word at
2017c; Papernot et al., 2016a), and there exists a      position t in target sentence S reaches top-1 prob-
universal adversarial perturbation that can lead to     ability, we have
misclassification of natural images with high prob-
ability (Moosavi-Dezfooli et al., 2017).                              ∂
                                                                      (St )
                                                                              lossS,logits (I + δ) = −1.   (11)
                                                                  ∂zt
   Without loss of generality, there are two fac-
tors contributing to crafting adversarial examples                                                         (S )
                                                        It is evident that the gradient (with regard to zt t )
in image classification: (i) a distortion metric be-
                                                        becomes a constant now, since it equals to −1
tween the original and adversarial examples that                 (S )                 (k)
regularizes visual similarity. Popular choices are      when zt t < maxk6=St {zt } + , and 0 other-
the L∞ , L2 and L1 distortions (Kurakin et al.,         wise.
2017; Carlini and Wagner, 2017; Chen et al.,
                                                        8   Targeted Caption Results with Log
2018); and (ii) an attack loss function account-
                                                            Probability Loss
ing for the success of adversarial examples. For
finding adversarial examples in neural image cap-       In this experiment, we use the log probability loss
tioning, while the distortion metric can be iden-       (5) plus a L2 distortion term (as in (2)) as our ob-
tical, the attack loss function used in image clas-     jective function. Similar to the previous experi-
sification is invalid, since the number of possible     ments, a successful adversarial example is found
captions easily outnumbers the number of image          if the inferred caption after adding the adversar-
classes, and captions with similar meaning should       ial perturbation δ exactly matches the targeted cap-
not be considered as different classes. One of our      tion. The overall success rate and average distor-
major contributions is to design novel attacking        tion of adversarial perturbation δ are shown in Ta-
loss functions to handle the CNN+RNN architec-          ble 5. Among all the tested images, our log-prob
tures in neural image captioning tasks.                 loss attains 95.4% success rate, which is about the
Figure 4: Adversarial example (kδk2 = 2.977) of          Figure 6: Adversarial example (kδk2 = 1.188) of
an elephant image crafted by the Show-and-Fool           a giraffe image crafted by the Show-and-Fool tar-
targeted caption method with the target caption “A       geted keyword method with three keywords: “soc-
black and white photo of a group of people”.             cer”, “group” and “playing”.

Figure 5: Adversarial example (kδk2 = 2.979)             Figure 7: Adversarial example (kδk2 = 1.178)
of an clock image crafted by the Show-and-Fool           of a bus image crafted by the Show-and-Fool
targeted keyword method with three keywords:             targeted keyword method with three keywords:
“meat”, “white” and “topped”.                            “tub”, “bathroom” and “sink”.

same as using logits loss. Besides, similar to us-       9    Targeted Keyword Results with Log
ing logits loss, the adversarial examples generated           Probability Loss
by using log-prob loss also yield small L2 distor-
tions. In Table 6, we summarize the statistics of        Similar to the logits loss, the log-prob loss does
the failed adversarial examples. It shows that their     not require a particular position for the target key-
generated captions, though not entirely identical to     words Kj , j ∈ [M ]. Instead, it encourages Kj to
the targeted caption, are also highly relevant to the    become the top-1 prediction at its most probable
target captions.                                         position:
   In our experiments, log probability loss exhibits
a similar performance as the logits loss, as our tar-
                                                                              M
get model is undefended and the gradient vanish-                              X                 (i)
                                                         lossK,log-prob = −         log(max {pt }).      (12)
ing problem of softmax is not significant. How-                                        t∈[N ]
                                                                              j=1
ever, when evaluating the robustness of a general
image captioning model, it is recommended to use
the logits loss as it does not suffer from potentially   To tackle the “keyword collision” problem, we
vanished gradients and can reveal the intrinsic ro-                                   0 to avoid the key-
                                                         also employ a gate function gt,j
bustness of the model.                                   words appearing at the positions where the most
Figure 8: A highly transferable adversarial exam-        Figure 10: A highly transferable adversarial exam-
ple of a biking image (kδk2 = 12.391) crafted            ple of a desk image (kδk2 = 12.810) crafted from
from Show-and-Tell using the targeted caption            Show-and-Tell using the targeted caption method
method and then transfers to Show-Attend-and-            and then transfers to Show-Attend-and-Tell, yield-
Tell, yielding similar adversarial captions.             ing similar adversarial captions.
                                                         Table 5: Summary of targeted caption method and
                                                         targeted keyword method using log-prob loss. The
                                                         L2 distortion kδk2 is averaged over successful ad-
                                                         versarial examples.

                                                                 Experiments      Success Rate   Avg. kδk2
                                                               targeted caption      95.4%         1.858
                                                                  1-keyword          99.2%         1.311
                                                                  2-keyword          96.9%         2.023
                                                                  3-keyword          95.7%         2.120

                                                         the maximum number of iterations is met. With
Figure 9: A highly transferable adversarial exam-        this iterative optimization process, the probabil-
ple of a snowboarding image (kδk2 = 14.320)              ities of the desired keywords gradually increase,
crafted from Show-and-Tell using the targeted            and finally become the top-1 predictions.
caption method and then transfers to Show-                  The overall success rate and average distortion
Attend-and-Tell, yielding similar adversarial cap-       are shown in Table 5. Table 7 summarizes the
tions.                                                   number of keywords (M 0 ) appeared in the cap-
                                                         tions for those failed examples when M = 3,
probable word is already a keyword:                      i.e., the examples that not all the 3 targeted key-
           (
                               (i)
                                                         words are found. They account only 4.3% of all
  0          0, if arg maxi∈V pt ∈ K \ {Kj }             the tested images. Table 7 clearly shows that when
gt,j (x) =
             x, otherwise                                c is properly chosen, more than 90% of the failed
                                                         examples contain at least 1 targeted keyword, and
The loss function (12) then becomes:
                                                         more than 60% of the failed examples contain 2
                          M                              targeted keywords. This result verifies that even
                                               (i)
                          X
                                          0
  lossK 0 ,log-prob = −         log(max {gt,j (pt )}).   the failed examples are reasonably good attacks.
                                    t∈[N ]
                          j=1
                                                (13)     10   Transferability of Adversarial
In our methods, the initial input is the originally           Examples with Log Probability Loss
inferred caption S 0 from the benign image, and
after minimizing (13) for T iterations, we run in-       Similar to the experiments in Section 4.4, to as-
ference on I + δ and set the RNN’s input S 1 as          sess the transferability of adversarial examples, we
its current top-1 prediction, and repeat this proce-     first use the targeted caption method with log-prob
dure until all the targeted keywords are found or        loss to find adversarial examples for 1,000 images
Table 6: Statistics of the 4.6% failed adversarial         Table 8: Transferability of adversarial examples
examples using the targeted caption method and             from Show-and-Tell to Show-Attend-and-Tell, us-
log-prob loss (5). All correlation scores are com-         ing different c. Unlike Table 4, the adversarial
puted using the top-5 inferred captions of an ad-          examples in this table are found using the log-
versarial image and the targeted caption (a higher         prob loss and there is no parameter . Similarly,
score indicates a better targeted attack perfor-           a smaller ori or a larger tgt value indicates better
mance).                                                    transferability.

      c            1       10      102      103     104                     C=10            C=100                C=1000
 L2 Distortion   1.503    2.637   5.085    11.15   19.69                 ori    tgt       ori    tgt           ori    tgt       mis
   BLEU-1        .650     .792    .775     .802    .800     BLEU-1       .540 .391        .442 .435            .374 .500        .657
   BLEU-2        .521     .690    .671     .711    .701     BLEU-2       .415 .224        .297 .280            .217 .357        .529
   BLEU-3        .416     .595    .564     .622    .611     BLEU-3       .335 .143        .218 .193            .137 .268        .430
   BLEU-4        .354     .515    .485     .542    .531     BLEU-4       .280 .101        .170 .142            .095 .207        .357
   ROUGE         .616     .764    .746     .776    .772     ROUGE        .525 .364        .430 .411            .362 .474        .609
  METEOR         .362     .493    .469     .511    .498     METEOR       .240 .132        .179 .162            .135 .209        .303
                                                            kδk2            2.433            4.612                10.88
Table 7: Percentage of partial success using log-
prob loss with different c in the 4.3% failed images
that do not contain all the 3 targeted keywords.

      c    Avg. kδk2     M0 ≥ 1   M0 = 2     Avg. M 0
      1      2.22        69.7%    27.3%        0.97
     10      5.03        87.9%    57.6%        1.45
     102     10.98       93.9%    63.6%        1.58                         a(0.75)            woman(0.70)            sitting(0.55)
     103     18.52       93.9%    57.6%        1.52
     104     26.04       90.9%    60.6%        1.52

in Show-and-Tell model (model A) with differ-              on(0.16)         a(0.29)            bicycle(0.83)          with(0.63)

ent c. We then transfer successful adversarial ex-
amples, i.e., the examples that generate the exact
target captions on model A, to Show-Attend-and-
                                                           a(0.40)          dog(0.65)          .(0.46)
Tell model (model B). The generated captions by
model B are recorded for transferability analysis.
The results for transferability using log-prob loss
is summarized in Table 8. The definitions of tgt,                         (a) Original Image of Figure 8
                                                                            a(0.99)            white(0.96)            and(0.80)
ori and mis are the same as those in Table 4. Com-
paring with Table 4 (C = 1000,  = 10), the log
probability loss shows inferior ori and tgt values,
indicating that the additional parameter  in the          white(0.78)      slice(0.93)        of(0.86)               pizza(0.79)
logits loss helps improve transferability.

11    Attention on Original and
      Transferred Adversarial Images                       on(0.74)         a(0.54)            table(0.67)            .(0.64)

Figures 11, 12 and 13 show the original and ad-
versarial images’ attentions over time. In the orig-
inal images, the Show-Attend-and-Tell model’s at-                        (b) Adversarial Image of Figure 8
tentions align well with human perception. How-
ever, the transferred adversarial images obtained          Figure 11: Original and transferred adversarial im-
on Show-and-Tell model yield significantly mis-            age’s attention over time on Figure 8. The high-
aligned attentions.                                        lighted area shows the attention change as the
                                                           model generates each word.
You can also read