Catching Out-of-Context Misinformation with Self-supervised Learning

 
CONTINUE READING
Catching Out-of-Context Misinformation with Self-supervised Learning
Catching Out-of-Context Misinformation with Self-supervised Learning

                                                              Shivangi Aneja1              Christoph Bregler2                   Matthias Nießner1
                                                                       1                                                    2
                                                                           Technical University of Munich                       Google AI
arXiv:2101.06278v2 [cs.CV] 27 Jan 2021

                                         Figure 1: Our method takes as input an image and two captions from different sources, and we predict whether the image has
                                         been used out of context. Here, both captions align with same object(s) in the image; i.e., Donald Trump and Angela Merkel
                                         on the left and President Obama and Dr. Fauci on the right. If the two captions are semantically different from each other
                                         (right), we consider them as out of context; otherwise not (left). Our method automatically detects such image use.

                                                                 Abstract                                    1. Introduction

                                            Despite the recent attention to DeepFakes and other                 In recent years, the computer vision community as well
                                         forms of image manipulations, one of the most prevalent             as the general public focused on new misuses of media ma-
                                         ways to mislead audiences is the use of unaltered images            nipulations including so-called DeepFakes [13, 14, 30, 33]
                                         in a new but false context. To address these challenges             and how they aid the spread of misinformation in news and
                                         and support fact-checkers, we propose a new method that             social media platforms. At the same time, researchers have
                                         automatically detects out-of-context image and text pairs.          developed impressive media forensic methods to automat-
                                         Our core idea is a self-supervised training strategy where          ically detect these manipulations [37, 28, 23, 48, 52, 11,
                                         we only need images with matching (and non-matching)                1, 2, 22, 43, 3, 10]. However, despite the importance of
                                         captions from different sources. At train time, our method          DeepFakes and other visual manipulation methods, one of
                                         learns to selectively align individual objects in an image          the most prevalent ways to mislead audiences is the use of
                                         with textual claims, without explicit supervision. At test          unaltered images in a new but false context [15].
                                         time, we check for a given text pair if both texts correspond          The danger of out-of-context images is that little techni-
                                         to same object(s) in the image but semantically convey dif-         cal expertise is required, as one can simply take an image
                                         ferent descriptions, which allows us to make fairly accu-           from a different event and create highly convincing but po-
                                         rate out-of-context predictions. Our method achieves 82%            tentially misleading messages. At the same time, it is ex-
                                         out-of-context detection accuracy. To facilitate training our       tremely challenging to detect misinformation based on out-
                                         method, we created a large-scale dataset of 200K images             of-context images given that the visual content by itself is
                                         which we match with 450K textual captions from a variety            not manipulated; only the image-text combination creates
                                         of news websites, blogs, and social media posts; i.e., for          misleading or false information. In order to detect these
                                         each image, we obtained several captions.                           out-of-context images, several online fact-checking initia-
                                                                                                             tives have been launched by news rooms and independent

                                                                                                         1
Catching Out-of-Context Misinformation with Self-supervised Learning
Figure 2: Out-of-context images: examples from our dataset, circulated on social media and online news platforms. The top
row (red) shows the false captions and bottom row (green) shows the true captions along with year published.

organizations, most of them currently being part of the In-            with textual descriptions to determine whether the image
ternational Factchecking Network [34]. However, they all               appears to be out-of-context with respect to textual claims.
heavily rely on manual human efforts to verify each post               Our method then learns to selectively align individual ob-
factually, and to determine if a fact-checking claim should            jects in an image with textual claims, without explicit out-
be labelled as an “out-of-context image”. Thus, automated              of-context supervision. At test time, we are able to correlate
techniques can aid and speed up verification of potentially            these alignment predictions between the two captions for
false claims for fact checkers.                                        the input image. If both texts correspond to same object(s)
   Seminal works along these lines focus on predicting the             but their meaning is semantically different, we infer that the
truthfulness of a claim based on certain evidence like sub-            image is used out-of-context.
ject, context, social network spread, prior history, etc. [45,             While this training strategy does not rely on manual
40]. However, these methods are limited only to linguistics            out-of-context annotations (except for benchmarking), we
domain with a focus on textual metadata in order to pre-               still require images with corresponding matching (and non-
dict the factuality of the claim. More recently, few meth-             matching) captions. To this end, we created a large-scale
ods [50, 53] have been proposed to detect the truth value of           dataset of over 200K images which we matched with 450K
claims about images by using image metadata, user com-                 textual captions from a variety of news websites, blogs, and
ments, URL domains, reliability of the source, etc. as addi-           social media posts. For each image, we obtained several
tional supervision. However, their performance is still lim-           captions, which we provided with the sources where we
ited, partly due to non-availability of large-scale datasets for       crawled them from. We further manually annotated a subset
the task. Additionally, we argue that the task of verifying            of 2,230 triplet pairs (an image and 2 captions) for bench-
the veracity of a claim is a difficult problem to solve (even          marking purposes only. In the end, our methods signifi-
for humans) and requires prior knowledge, background, and              cantly improves over alternatives, reaching over 82% de-
domain expertise.                                                      tection accuracy.
   To automatically address the out-of-context problem, we                In summary, our contributions are as follows:
propose a new data-driven method that takes as input an                   • To the best of our knowledge, this paper proposes the
image and two text captions. As output, we predict whether                  first automated method to detect out-of-context use of
the two captions referenced to the image are conflicting-                   images.
image-captions, and hence indicate an out-of-context case.
The core idea of our method is a self-supervised training                 • We introduce a self-supervised training strategy for
strategy where we only need images with matching and                        accurate out-of-context prediction while only using
non-matching captions from different sources; however, we                   match vs non-matched image captions.
do not need to rely on specific out-of-context annotations
which would be potentially difficult to annotate in large                 • We created a large dataset of 200K images which we
numbers. Using only matches vs non-matches as loss func-                    matched with 450K textual captions from a variety of
tion, we are able to learn co-occurrence patterns of images                 news websites, blogs, and social media posts.

                                                                   2
Catching Out-of-Context Misinformation with Self-supervised Learning
Figure 3: High-level overview of our dataset: (left) category-wise frequency distribution of the images; (right) word cloud
representation of captions and claims from the dataset.

2. Related Work                                                     ing number of false claims about images, a few meth-
                                                                    ods [19, 50, 39, 46, 53, 20] have been proposed recently. For
2.1. Fake News & Rumor Detection                                    instance, Khattar et al. [20] learn a variational autoencoder
    Fake news and rumor detection methods have a long his-          based on shared embedding space (textual and visual) with
tory [35, 21, 24, 25, 51, 36, 26, 27] and with the advent of        binary classifier to detect fake news. Jin et al. [19] use at-
deep learning, these techniques have accelerated with suc-          tention based RNNs to fuse multiple modalities to detect ru-
cess. Most fake news and rumor detection methods focus              mors/fake claims. Since these techniques are supervised in
on posts shared on microblogging platforms like Twitter.            nature, they require huge amounts of labelled data to work,
Kwon et al. [21] analyzed structural, temporal, and linguis-        which is difficult to obtain especially for false claims. We,
tic aspects of the user tweets and modelled them using SVM          however, propose a self-supervised method to achieve the
to detect the spread of rumors. Liu et al. [24] proposed an         goal.
algorithm to debunk rumors in real-time. Ma et al. [27]
examined propagation patterns in tweets and applied tree-           3. Out-of-Context Detection Dataset
structured recursive neural networks for rumor representa-
tion learning and classification. Wani et al. [47] evaluated           Our dataset is based on images from news articles and
a variety of text classification methods on Covid-19 Fake           social-media posts. We gather images from a wide-variety
news detection dataset [31].                                        of articles (Fig. 3), with special focus on topics where mis-
                                                                    information spread is prominent.
2.2. Automated Fact-Checking
                                                                    3.1. Dataset Collection
    In recent years, several automated fact-checking tech-
niques [45, 40, 16, 6, 29, 5, 41] have been developed to               We gathered our dataset from two primary sources,
reduce the manual fact-checking overhead. For instance,             news websites1 and fact-checking websites. We collect our
Wang et al. [45] created a dataset of short statements from         dataset in two steps: (1) First, using publicly available
several political speeches and designed a technique to detect       news channel API’s, such as from the New York Times [4],
fake claims by analyzing linguistic patterns in the speeches.       we scraped images along with the corresponding captions.
Vasileva et al. [41] proposed a technique to estimate check-        (2) We then reverse-searched these images using Google’s
worthiness of claims from political debates. Atanasova et           Cloud Vision API to find different contexts across the web
al. [6] propose a multi-task learning technique to classify         in which image is shared. This collection pipeline is ex-
veracity of claim and generate fact-checked explanations at         plained in Fig. 4. Thus, we obtain several captions per
the same time based on ruling comments.                             image with varying context that we use later to train our
                                                                    models; cf. 2. Note that we do not consider digitally-
2.3. Verifying Claims about Images                                  altered/fake images, our focus here is to detect misuse of
   Both fake news detection and automated fact-checking             real photographs. Finally, as we currently aim to detect
techniques are extremely important to combat the spread             conflicting-image-captions in the English language only, we
of misinformation and there are ample methods available             ignore captions from other languages.
to tackle this challenge. However, these methods tar-                  1 New York Times, CNN, Reuters, ABC, PBS, NBCLA, AP News,
get only textual claims and therefore cannot be directly            Sky News, Telegraph, Time, DenverPost, Washington Post, CBC News,
applied to claims about images. To detect the increas-              Guardian, Herald Sun, Independent, CS Gazette,

                                                                3
Catching Out-of-Context Misinformation with Self-supervised Learning
4. Proposed Method
                                                                        We consider a setting where for an image we have a set
                                                                     of associated captions; however, neither mapping for these
                                                                     captions with the objects in the image nor labels are pro-
                                                                     vided that classify which caption is out-of-context. We no-
                                                                     tice that in typical use cases of out-of-context images, dif-
                                                                     ferent captions often describe the same object(s) but with a
                                                                     different meaning. For example, Fig. 2 shows several fact-
Figure 4: Data collection: given a source image (left), we           checked examples where the two captions mean something
find other websites that use the same image with different           very different but they describe the same parts of the image.
captions (using Google Vision API’s reverse search). We              Our goal is to take advantage of these patterns to detect sce-
then scraped the caption from these websites by using text           narios and identify images used out-of-context.
from  tag if present, otherwise  at-
tribute corresponding to the image. We collect 2-4 captions          4.1. Image-Text Matching Model (Training)
per image.                                                               The main idea of our method is a self-supervised training
                                                                     strategy leveraging co-occurrences of an image and its ob-
                                                                     jects with several associated captions; i.e., we propose train-
3.2. Data Sources & Statistics                                       ing an image and text based model based only on matching
   We obtained our images primarily from the News chan-              and non-matching captions. To this end, we formulate a
nels and fact-checking website (Snopes). We scraped im-              scoring function to align objects in the image with the cap-
ages on a wide variety of topics ranging from politics, cli-         tion. Intuitively, an image-caption pair should have a high
mate change, environment, etc (see Fig. 3). For images               matching score if visual correspondences for the caption are
scraped from New York Times, we used publicly available              present in the image, and a low score if the caption is unre-
Article Search developer API [4], and for other new sources,         lated to the image. In order to infer this correlation, we first
we wrote our custom scrapers. For images gathered from               utilize a pre-trained Mask-RCNN [17] to detect bounding
news channels, we scraped corresponding image captions               boxes (objects) in the image.
as textual claim, and for Snopes, we scraped text written in             For each detected bounding box, we then feed these ob-
the  header, under the Fact Checks section of the             ject regions to the Object Encoder, for which we use a
website. In total, we obtain 200K unique training images             ResNet-50 backbone from a pre-trained Mask-RCNN fol-
and 2,230 validation images; see Tab. 1.                             lowed by RoIAlign, with our average pooling and two fully-
                                                                     connected layers applied on the top. As a a result, for each
                                                                     object, we obtain a 300-dimensional embedding vector.
      Split       Primary        No. of      Context
                                                                         In parallel, we feed a given caption (either matching
                  Source         Images     Annotation
                                                                     Cmatch or non-matching Crand ) into a pre-trained sentence
      Train   News Outlets1       200K           7                   embedding model. Specifically, we utilize Universal Sen-
       Val    NYT, Snopes         2,230          3                   tence Encoder (USE) [8], which is based on a state-of-the-
                                                                     art transformer [42] architecture and outputs 512-dim vec-
      Table 1: Statistics of our out-of-context dataset.             tor. We then process this vector with our Text Encoder
                                                                     (ReLU followed by one FC layer), which outputs a 300-
    Training Set: For training, we used images scraped               dimensional embedding vector for each caption (to match
from the news websites. We consider several news sources1            the dimension of the object embeddings).
to first gather the images and then scrape the captions corre-           We then compare the visual and language embeddings
sponding to these images from different websites. In total,          using a dot product between the i-th bounding box embed-
we gathered around 200K images with 450K captions.                   ding bi and the caption embedding c as a measure of sim-
    Validation Set: For validation, we used the images from          ilarity between image region i and caption C. The final
the fact-checking website Snopes and the news website New            image-caption score SIC is obtained through a max func-
York Times. We collected 2,230 images with two captions              tion:
per image. We build an in-house annotation tool to manu-
ally annotate these pairs with out-of-context labels. On av-                               N
                                                                                 SIC = max(bTi c), N = #bboxes.                  (1)
erage, it takes around 55 seconds to annotate every pair, and                             i=1

we spent 35 hours in total to annotate the entire validation            Our objective is to obtain higher scores for aligned
set. We ensured an equal distribution of both out-of-context         image-text pairs (i.e., if an image appeared with the text ir-
and not-out-of-context images in the val split.                      respective of the context) than misaligned image-text pairs

                                                                 4
Catching Out-of-Context Misinformation with Self-supervised Learning
Figure 5: Train time: first, a Mask-RCNN [17] backbone detects up to 10 bounding boxes in the image whose regions are
embedded through our Object Encoder (pre-trained ResNet-50 backbone followed by RoIAlign, with average pooling and
two FC layers applied on top) providing a fixed-size embedding (300-dim) for each object. In parallel, two captions – one
that appeared with the image in a post Cmatch (matching caption) and another caption sampled randomly from dataset Crand
(non-matching caption) – are encoded using the Universal Sentence Encoder model (USE) [8]. Those sentences embeddings
are then passed to a shared Text Encoder (ReLU followed by one FC layer) that embeds them in the same multi-modal space.
Similarities between object-caption pairs are computed with inner products (grayscale shows the score magnitude) and finally
reduced to scores following Eq. 1.

(i.e., some randomly-chosen text which did not appear with          Image-Caption1-Caption2 (I, C1 , C2 ) triplet is given as in-
the image). We train the model with max-margin loss                 put to the final parts of our detection model that predicts
(Eq. 2) on the image-caption scores obtained above (Eq. 1).         whether the image used as out-of-context with respect to the
Note that we keep the weights of the Mask-RCNN [17]                 captions. Based on the evidence that out-of-context pairs
backbone and the USE [8] model frozen, using these mod-             correspond to same object in the image (see Fig. 2), we pro-
els only for feature extraction.                                    pose a simple handcrafted rule to detect such images, i.e., if
            N                                                       two captions align with same object(s) in the image; but se-
          1 X          r     m                                      mantically convey different meanings, then the image with
       L=     max(0, (SIC − SIC ) + margin),             (2)
          N i                                                       its two captions is classified as having conflicting-image-
                                                                    captions. More specifically, we make use of the pre-trained
            r
   where SIC   denotes the image-caption score of the ran-          model as follows:
                   m
dom caption and SIC    denotes the image-caption score for
                                                                    (1) Using the Image-Text Matching model, we first com-
the matching caption. We refer to this model as Image-Text
                                                                    pute the visual correspondences of the objects in the image
Matching Model; the training setup is visualized in Fig. 5.
                                                                    for both captions. For every image-caption pair {ICj }, we
Note that during training, we do not aim to detect out-of-
                                                                    select the top k bounding boxes BICj (k) by ranking them
context images. We only focus on learning accurate image-
                                                                    on the basis of scores SICj obtained using Eq. 1; a high
caption alignments.
                                                                    score indicates strong alignment of caption with the object:
4.2. Out-of-Context Detection Model (Test Time)                                                 N         
                                                                                   BICj (k) = desc(SICj ) [: k]              (3)
    The resulting Image-Text matching model obtained from                                        i=1

training, as described above, provides an accurate represen-        (2) We leverage a state-of-the-art SBERT [44] model that
tation of how likely a caption aligns with an image. In addi-       is trained on a Sentence Textual Similarity (STS) task. The
tion, as we explicitly model the object-caption relationship,       SBERT model takes two captions C1 , C2 as input and out-
the max operator in Eq. 1 implicitly gives us a strong signal       puts a similarity score Ssim in the range [0, 1] indicating
which object(s) were selected to make that decision, thus           semantic similarity between the two captions (higher score
providing spatial knowledge. At test time we duplicate this         indicates same context):
pipeline. For a given image we run it for two captions -
these captions may or may not be semantically similar. The                             Ssim = STS(C1, C2)                     (4)

                                                                5
Catching Out-of-Context Misinformation with Self-supervised Learning
Figure 6: Test time: we take as input an image and two captions; we then use the trained Image-Text Matching model where
we first pick the top k bounding boxes (based on scores given by Eq. 1) and check the overlap for each of the k boxes for
caption1 with each of the k boxes from caption2. If IoU for any such pair > threshold t, we infer that image regions overlap.
Similarly, we compute textual overlap Ssim using pre-trained Sentence Similarity model SBert [44] and if Ssim < threshold
t, it implies that the two captions are semantically different, thus implying that image is used out of context.

   As a result, SBERT provides the semantic similarity be-            fed as input to the model but combined with text using self-
tween two captions, Ssim . In order to compute the visual             attention module. (3) Bbox, where only the detected objects
mapping of the two captions with the image, we use the                are fed as input to model instead of full image. For this
IoU overlap of the top k boxes for the two captions. We use           experiment, we use the ILSVRC 2012-pre-trained ResNet-
a threshold t = 0.5 for all our experiments, both for IoU             18 [18] and MS-COCO pre-trained Mask-RCNN [17] back-
overlap and text overlap. Finally, if the visual overlap be-          bone to encode images and one-layer LSTM model to en-
tween image regions for the two captions is over a certain            code text. Words are embedded using Glove [32] pre-
threshold IoU (BIC1 (k), BIC2 (k)) > t and the captions are           trained embeddings. Tab. 2 shows that using object-level
semantically different (Ssim < t), we classify them as out-           features (given by bounding boxes) gives the best perform-
of-context (OOC). A detailed explanation is given in Fig. 6           ing model. This is not surprising given that object regions
and is as follows:                                                    provide a richer feature representation for the entities in the
                                                                    caption compared to full image; but we also significantly
           True, if IoU BIC1 (k), BIC2 (k) > t &
                                                                     outperform a self-attention alternative.
  OOC =                Ssim (C1, C2) < t
           
              False, otherwise
           
                                                         (5)                  Img Features.     Object IoU      Match Acc.
                                                                                Bbox (GT)           0.36            0.89
5. Results                                                                     Full-Image           0.11            0.63
5.1. Visual Grounding of Objects                                              Self-Attention        0.16            0.78
                                                                               Bbox (Pred)          0.27            0.88
Quantitative Results: Our model is trained in a self-
supervised fashion only with matching and non-matching                Table 2: Ablation of different settings for visual grounding
captions. To quantitatively evaluate how well our model               in our self-supervised training setting (no loss on IoU).
learns the visual grounding of objects, we utilize the Ref-
COCO dataset [49] which has ground truth associations of
the captions with the object bounding boxes. Note, how-               Qualitative Results: We visualize grounding scores in
ever, that this evaluation is not our final task, but gives im-       Fig. 7 obtained by applying our image-text matching model
portant insights into our model design. We experiment with            to the image-caption pairs from the validation split. The
three different model settings: (1) Full-Image, where an im-          results indicate that our self-supervised matching strategy
age is fed as input to the model and directly combined with           learns sufficiently good alignment between objects and cap-
the text embedding. (2) Self-Attention, where an image is             tions to perform out-of-context image detection.

                                                                  6
Catching Out-of-Context Misinformation with Self-supervised Learning
Figure 7: Qualitative results of visual grounding of captions with the objects in the image. We show object-caption scores
for two conflicting captions per image; only scores for the top 3 bounding boxes per caption are shown to avoid clutter. For
every image, the caption on the top (green border) is the true caption and the caption at the bottom is the false caption (red
border). Positive scores (green) indicate high association of object with the caption and negative scores (red) indicates low
visual association.

5.2. Out-of-Context Evaluation                                         Text Embed     Match Acc.        Context Acc.
                                                                                                      k=1 k=2 k=3
Which is the best Text Embedding? To evaluate the
effect of different text embeddings, we experiment with:               Glove [32]         0.81       0.61    0.77    0.80
(1) Pre-trained word embeddings including Glove [32] and               FastText [7]       0.82       0.63    0.78    0.81
FastText [7] embedded via one-layer LSTM model and (2)                  USE [8]           0.81       0.73    0.81    0.82
the Transformer based Sentence embeddings proposed by
USE [8]. The results in Tab. 3 show that even though the          Table 3: Ablations with different pre-trained text embed-
match accuracy for all the methods is roughly the same            dings and different values of k, where k indicates how many
(82%), using sentence embeddings boosts our final out-            top scoring bounding boxes we are using per image.
of-context image detection accuracy of the model by 10%
(from 63% to 73%) by using only top 1 bounding box.
How much Training Data is needed? Next, we analyze                age of training data size (w.r.t. our full data of 200K); see
the effect of the available training data corpus for self-        Tab. 4. For these experiments, we use pretrained USE [8]
supervised training. We experiment with different percent-        embedding with one-layer FC model to encode text. We ob-

                                                              7
Catching Out-of-Context Misinformation with Self-supervised Learning
served that a larger training set improves the performance            proposed specifically for Rumor/Fake News Classification;
of the model significantly. For instance, training with full-         however, EmbraceNet [9] is a generic multi-modal classi-
dataset (200k images) improves the out-of-context detec-              fication method. Since neither of these methods perform
tion accuracy by an absolute 6% (from 67% to 73%) com-                self-supervised out-of-context image detection using ob-
pared to a model trained only 10% of the dataset (20K im-             ject features (using bounding boxes), a direct comparison
ages), thus benefiting from the large diversity in the dataset.       is not feasable, and we need to adapt these methods for
We also notice that a higher match accuracy leads to bet-             our task. Following our training setup (Sec. 4.1), we first
ter out-of-context detection accuracy. This suggests that             train these models for the binary task of image-text match-
our proposed self-supervised learning strategy (trained with          ing with the network architecture and losses proposed in
match vs no-match loss) effectively helps to improve out-             their original papers. During test time, we then use Grad-
of-context detection accuracy.                                        CAM [38] to construct bounding boxes around activated
                                                                      image regions and perform out-of-context detection as de-
  Dataset          Images      Match      Context Acc.                scribed in Sec. 4.2. The results in Tab. 6 show that our
  Size                         Acc.     k=1 k=2 k=3                   model outperforms previous fake news detection methods
                                                                      for out-of-context detection by a large margin of 13% (from
  Ours (10%)        20K        0.78     0.67     0.79    0.81         60% to 73%) only by using k = 1 bounding box. Over-
  Ours (20%)        40K        0.80     0.69     0.80    0.82         all, we achieve up to 82% out-of-context image detection
  Ours (50%)        100K       0.81     0.70     0.80    0.82         accuracy.
  Ours (100%)       200K       0.81     0.73     0.81    0.82

Table 4: Ablation with variations in number of training im-                  Method         Match Acc.       Context Acc.
ages w.r.t. our full corpus of 200K images. Using all avail-                                               k=1 k=2 k=3
able data achieves the best performance.                                   EANN [46]            0.60       0.56   0.58    0.58
                                                                         EmbraceNet [9]         0.68       0.57   0.63    0.70
   In addition, we examine whether adding images from                     Jin et al. [19]       0.75       0.60   0.67    0.72
RefCOCO [49] can further improve performance. For                              Ours             0.81       0.73   0.81    0.82
these experiments, we mix training images from our dataset
(200K images) and RefCoco [49] (50K images), resulting
in total 250K images. We run this analysis for all the text           Table 6: To compare against state of the art, we adapt three
embeddings; Tab. 5 shows that our model slightly improves             baseline methods with our train/test strategy. In the end,
(2-3%) on our out-of-context detection task.                          our method outperforms all other methods, resulting in 82%
                                                                      out-of-context detection accuracy.
 Image              Text        Match       Context Acc.
 Source                         Acc.      k=1 k=2 k=3
                                                                      6. Conclusions
 Ours           Glove [32]       0.81     0.61    0.77    0.80
 Combined                        0.82     0.64    0.78    0.81            We have introduced an automated method to detect out-
 Ours           FastText [7]     0.82     0.63    0.78    0.81        of-context images with respect to textual descriptions. Our
 Combined                        0.83     0.66    0.78    0.81        method takes as input an image and two captions, and
                                                                      we are able to predict out-of-context use, reaching up to
 Ours            USE [8]         0.81     0.73    0.81    0.82        82% detection accuracy. The key idea of our approach
 Combined                        0.82     0.76    0.82    0.82        is a self-supervised training strategy, which allows learn-
                                                                      ing strong localization features based only on matched vs
Table 5: Ablation with different text embeddings evaluating           non-matched captions, but without the need of explicit out-
the use of only our dataset (Ours) for training against feed-         of-context annotations. In order to provide the necessary
ing in another 50k images from RefCoco [49] (Combined),               training data, we further created a new dataset composed
which slightly improves performance.                                  of over 200K images for which we matched 450K cap-
                                                                      tions. Overall, we believe that our method is an important
Comparison with alternative Baselines: Finally, we com-               stepping stone towards addressing misinformation in online
pare our best performing model with other baselines, in               news and social media platforms, thus supporting and scal-
particular, methods that work on rumor detection. Most                ing up fact-checking work. In particular, we hope that our
other fake news detection methods are supervised, where               new dataset, which we will publish along with this work,
the model takes an image and a caption as input and pre-              will lay a foundation to continue research along these lines
dicts the class label. EANN [46] and Jin et al. [19] were             to help online journalism and improve social media.

                                                                  8
Catching Out-of-Context Misinformation with Self-supervised Learning
Acknowledgments                                                          [14] Faceswap gan, 2018.
                                                                         [15] Lisa Fazio. Out-of-context photos are a powerful low-tech
   This work is supported by a TUM-IAS Rudolf Mößbauer                       form of misinformation, 2020.
Fellowship and the ERC Starting Grant Scan2CAD
                                                                         [16] Maram Hasanain, Reem Suwaileh, T. Elsayed, A. Barrón-
(804724). We would also like to thank New York Times                          Cedeño, and Preslav Nakov. Overview of the clef-2019
Team for providing API support for dataset collection,                        checkthat! lab: Automatic identification and verification of
Google Cloud for GCP credits, Farhan Abid for dataset an-                     claims. task 2: Evidence and factuality. In CLEF, 2019.
notation and Angela Dai for video voiceover.                             [17] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn.
                                                                              In 2017 IEEE International Conference on Computer Vision
References                                                                    (ICCV), pages 2980–2988, 2017.
                                                                         [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
 [1] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao
                                                                              for image recognition. In 2016 IEEE Conference on Com-
     Echizen. Mesonet: a compact facial video forgery detection
                                                                              puter Vision and Pattern Recognition (CVPR), pages 770–
     network, Dec 2018.
                                                                              778, 2016.
 [2] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He,
                                                                         [19] Zhiwei Jin, Juan Cao, Han Guo, Yongdong Zhang, and Jiebo
     Koki Nagano, and Hao Li. Protecting World Leaders Against
                                                                              Luo. Multimodal fusion with recurrent neural networks for
     Deep Fakes. In Proceedings of the IEEE Conference on
                                                                              rumor detection on microblogs. In Proceedings of the 25th
     Computer Vision and Pattern Recognition (CVPR) Work-
                                                                              ACM International Conference on Multimedia, MM ’17,
     shops, page 8, Long Beach, CA, June 2019. IEEE.
                                                                              page 795–816, New York, NY, USA, 2017. Association for
 [3] Shivangi Aneja and Matthias Nießner. Generalized zero and
                                                                              Computing Machinery.
     few-shot transfer for facial forgery detection, 2020.
                                                                         [20] Dhruv Khattar, Jaipal Singh Goud, Manish Gupta, and Va-
 [4] NYT Developer API. Nyt developer api, 2020.
                                                                              sudeva Varma. Mvae: Multimodal variational autoencoder
 [5] Pepa Atanasova, Preslav Nakov, Lluı́s Màrquez i Villodre, A.            for fake news detection. The World Wide Web Conference,
     Barrón-Cedeño, Georgi Karadzhov, Tsvetomila Mihaylova,                 2019.
     Mitra Mohtarami, and James R. Glass. Automatic fact-
                                                                         [21] Sejeong Kwon, Meeyoung Cha, K. Jung, Wei Chen, and Y.
     checking using context and discourse information. Journal
                                                                              Wang. Prominent features of rumor propagation in online
     of Data and Information Quality (JDIQ), 11:1 – 27, 2019.
                                                                              social media. 2013 IEEE 13th International Conference on
 [6] Pepa Atanasova, Jakob Grue Simonsen, Christina Lioma,
                                                                              Data Mining, pages 1103–1108, 2013.
     and Isabelle Augenstein. Generating fact checking explana-
                                                                         [22] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong
     tions. In Proceedings of the 58th Annual Meeting of the As-
                                                                              Chen, Fang Wen, and Baining Guo. Face x-ray for more gen-
     sociation for Computational Linguistics, pages 7352–7364,
                                                                              eral face forgery detection. In Proceedings of the IEEE/CVF
     Online, July 2020. Association for Computational Linguis-
                                                                              Conference on Computer Vision and Pattern Recognition,
     tics.
                                                                              pages 5001–5010, 2020.
 [7] P. Bojanowski, E. Grave, Armand Joulin, and Tomas
     Mikolov. Enriching word vectors with subword information.           [23] Yuezun Li and Siwei Lyu. Exposing deepfake videos by de-
     Transactions of the Association for Computational Linguis-               tecting face warping artifacts, 2018.
     tics, 5:135–146, 2017.                                              [24] X. Liu, A. Nourbakhsh, Quanzhi Li, R. Fang, and S. Shah.
 [8] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,                         Real-time rumor debunking on twitter. In CIKM ’15, 2015.
     Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario              [25] J. Ma, Wei Gao, P. Mitra, Sejeong Kwon, Bernard J. Jansen,
     Guajardo-Cespedes, Steve Yuan, Chris Tar, Brian Strope,                  K. Wong, and Meeyoung Cha. Detecting rumors from mi-
     and Ray Kurzweil. Universal sentence encoder for English.                croblogs with recurrent neural networks. In IJCAI, 2016.
     In Proceedings of the 2018 Conference on Empirical Meth-            [26] J. Ma, Wei Gao, and K. Wong. Detect rumor and stance
     ods in Natural Language Processing: System Demonstra-                    jointly by neural multi-task learning. Companion Proceed-
     tions, pages 169–174, Brussels, Belgium, Nov. 2018. Asso-                ings of the The Web Conference 2018, 2018.
     ciation for Computational Linguistics.                              [27] Jing Ma, Wei Gao, and K. Wong. Rumor detection on twit-
 [9] Jun-Ho Choi and Jong-Seok Lee. Embracenet: A robust deep                 ter with tree-structured recursive neural networks. In ACL,
     learning architecture for multimodal classification. Informa-            2018.
     tion Fusion, 51:259–270, 2019.                                      [28] Huy H. Nguyen, Junichi Yamagishi, and Isao Echizen.
[10] Davide Cozzolino, Andreas Rössler, Justus Thies, Matthias               Capsule-forensics: Using capsule networks to detect forged
     Nießner, and Luisa Verdoliva. Id-reveal: Identity-aware                  images and videos. ICASSP 2019 - 2019 IEEE Interna-
     deepfake video detection, 2020.                                          tional Conference on Acoustics, Speech and Signal Process-
[11] Davide Cozzolino, Justus Thies, Andreas Rössler, Christian              ing (ICASSP), May 2019.
     Riess, Matthias Nießner, and Luisa Verdoliva. Forensictrans-        [29] Wojciech Ostrowski, Arnav Arora, Pepa Atanasova, and
     fer: Weakly-supervised domain adaptation for forgery detec-              Isabelle Augenstein. Multi-hop fact checking of political
     tion. arXiv preprint arXiv:1812.02510, 2018.                             claims, 2020.
[12] Michal Danilk. Python langdetect, 2020.                             [30] Britt Paris and Joan Donovan. Deepfakes and cheap fakes.
[13] Deepfakes, 2017.                                                         In United States of America: Data and Society, 2019.

                                                                     9
Catching Out-of-Context Misinformation with Self-supervised Learning
[31] Parth Patwa, Shivam Sharma, Srinivas PYKL, Vineeth                  [44] Bin Wang and C.C. Jay Kuo. SBERT-WK: A sentence em-
     Guptha, Gitanjali Kumari, Md Shad Akhtar, Asif Ekbal,                    bedding method by dissecting bert-based word models. IEEE
     Amitava Das, and Tanmoy Chakraborty. Fighting an info-                   ACM Trans. Audio Speech Lang. Process., 28:2146–2157,
     demic: Covid-19 fake news dataset, 2020.                                 2020.
[32] Jeffrey Pennington, Richard Socher, and Christopher D Man-          [45] William Yang Wang. “liar, liar pants on fire”: A new bench-
     ning. Glove: Global vectors for word representation. In                  mark dataset for fake news detection. In Proceedings of the
     EMNLP, volume 14, pages 1532–1543, 2014.                                 55th Annual Meeting of the Association for Computational
[33] Ivan Petrov, Daiheng Gao, Nikolay Chervoniy, Kunlin Liu,                 Linguistics (Volume 2: Short Papers), pages 422–426, Van-
     Sugasa Marangonda, Chris Umé, Mr. Dpfks, Carl Shift                     couver, Canada, July 2017. Association for Computational
     Facenheim, Luis RP, Jian Jiang, Sheng Zhang, Pingyu Wu,                  Linguistics.
     Bo Zhou, and Weiming Zhang. Deepfacelab: A simple, flex-            [46] Yaqing Wang, Fenglong Ma, Zhiwei Jin, Ye Yuan, Guangxu
     ible and extensible face swapping framework, 2020.                       Xun, Kishlay Jha, Lu Su, and Jing Gao. Eann: Event adver-
[34] Poynter. Poynter : The international fact-checking network,              sarial neural networks for multi-modal fake news detection.
     2020.                                                                    In Proceedings of the 24th ACM SIGKDD International Con-
[35] Vahed Qazvinian, Emily Rosengren, Dragomir R. Radev,                     ference on Knowledge Discovery, KDD ’18, page 849–857,
     and Q. Mei. Rumor has it: Identifying misinformation in                  New York, NY, USA, 2018. Association for Computing Ma-
     microblogs. In EMNLP, 2011.                                              chinery.
[36] N. Ruchansky, Sungyong Seo, and Y. Liu. Csi: A hybrid               [47] Apurva Wani, Isha Joshi, Snehal Khandve, Vedangi Wagh,
     deep model for fake news detection. Proceedings of the 2017              and Raviraj Joshi. Evaluating deep learning approaches for
     ACM on Conference on Information and Knowledge Man-                      covid19 fake news detection, 2021.
     agement, 2017.                                                      [48] Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes
[37] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies,             using inconsistent head poses. ICASSP 2019 - 2019 IEEE
     and M. Niessner. Faceforensics++: Learning to detect ma-                 International Conference on Acoustics, Speech and Signal
     nipulated facial images. In 2019 IEEE/CVF International                  Processing (ICASSP), May 2019.
     Conference on Computer Vision (ICCV), pages 1–11, 2019.             [49] Licheng Yu, Patrick Poirson, Shan Yang, C. Alexander Berg,
[38] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek                     and L. Tamara Berg. Modeling context in referring expres-
     Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba-                    sions. ECCV, 2016.
     tra. Grad-cam: Visual explanations from deep networks via           [50] Daniel Zhang, Lanyu Shang, Biao Geng, Shuyue Lai, Ke Li,
     gradient-based localization. In ICCV, pages 618–626. IEEE                Hongmin Zhu, Tanvir Amin, and Dong Wang. Fauxbuster: A
     Computer Society, 2017.                                                  content-free fauxtography detector using social media com-
[39] Lanyu Shang, Yang Zhang, Daniel Zhang, and D. Wang.                      ments. In Proceedings of IEEE BigData 2018, 2018.
     Fauxward: a graph neural network approach to fauxtogra-             [51] Zhe Zhao. Spotting icebergs by the tips: Rumor and persua-
     phy detection using social media comments. Social Network                sion campaign detection in social media. 2017.
     Analysis and Mining, 10:1–16, 2020.
                                                                         [52] Peng Zhou, Xintong Han, Vlad I. Morariu, and Larry S.
[40] James Thorne, Andreas Vlachos, Christos Christodoulopou-
                                                                              Davis. Two-stream neural networks for tampered face de-
     los, and Arpit Mittal. FEVER: a large-scale dataset for fact
                                                                              tection, Jul 2017.
     extraction and VERification. In Proceedings of the 2018
                                                                         [53] Dimitrina Zlatkova, Preslav Nakov, and Ivan Koychev. Fact-
     Conference of the North American Chapter of the Associa-
                                                                              checking meets fauxtography: Verifying claims about im-
     tion for Computational Linguistics: Human Language Tech-
                                                                              ages. In Proceedings of the 2019 Conference on Empirical
     nologies, Volume 1 (Long Papers), pages 809–819, New Or-
                                                                              Methods in Natural Language Processing and the 9th Inter-
     leans, Louisiana, June 2018. Association for Computational
                                                                              national Joint Conference on Natural Language Processing
     Linguistics.
                                                                              (EMNLP-IJCNLP), pages 2099–2108, Hong Kong, China,
[41] Slavena Vasileva, Pepa Atanasova, Lluı́s Màrquez, Alberto
                                                                              Nov. 2019. Association for Computational Linguistics.
     Barrón-Cedeño, and Preslav Nakov. It takes nine to smell
     a rat: Neural multi-task learning for check-worthiness pre-
     diction. In Proceedings of the International Conference on
     Recent Advances in Natural Language Processing (RANLP
     2019), pages 1229–1239, Varna, Bulgaria, Sept. 2019. IN-
     COMA Ltd.
[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
     reit, Llion Jones, Aidan N. Gomez, undefinedukasz Kaiser,
     and Illia Polosukhin. Attention is all you need. In Proceed-
     ings of the 31st International Conference on Neural Infor-
     mation Processing Systems, NIPS’17, page 6000–6010, Red
     Hook, NY, USA, 2017. Curran Associates Inc.
[43] L. Verdoliva. Media forensics and deepfakes: An overview.
     IEEE Journal of Selected Topics in Signal Processing,
     14(5):910–932, 2020.

                                                                    10
Appendix                                                             C. Setup & Evaluation Metrics
    In this supplemental document, we first explain termi-           Experimental Setup: Our dataset consists of 200K train-
nology used in our main paper; see Section A. In Section B,          ing and 2,230 validation images. Every image is associated
we document details regarding dataset cleanup and annota-            with one or more captions in the training set and only 2
tion. We then explain experimental details and metrics used          captions in the validation set. During training, we do not
for evaluation in our main paper in Section C. Finally, in           make use of out-of-context labels. We train our model only
Section D, we perform experiments on RefCOCO [49] to                 with match vs no match loss and evaluate for out-of-context
evaluate which configuration is most effective to learn bet-         image detection using the algorithm proposed in main
ter object-caption groundings.                                       paper. For all the experiments, we use a threshold value
                                                                     t = 0.5, to compute IoU overlap for image regions as well
A. Definitions                                                       as for textual similarity.
  • Caption: The fact-checking community uses multiple               Hyperparameters: We run all our experiments on a
    terms for the accompanying text that describes what is           Nvidia GeForce GTX 2080 Ti card. We use a learning rate
    in the image, including the term “caption” or “claim”.           of 1e-3 with Adam optimizer for our model with decay
    In this paper, we only use the word caption for the tex-         when validation loss plateaus for 5 consecutive epochs,
    tual image descriptions. We do not use the word claim.           early stopping with a patience of 10 successive epochs on
  • Conflicting-Image-Captions: Describing falsely what              validation loss. For other baselines, we use their default
    is supported by an image with the aim to mislead au-             hyperparameters.
    diences is often labelled by the fact-checkers as “mis-          Quantitative Results: In the main paper, we evaluate our
    leading context”, “out-of-context”, “transformed con-            method using three metrics explained below:
    text”, “context re-targeting”, “re-contextualization”,
    “misrepresentation”, and possibly other terms. A news            (1) Match Accuracy quantifies how well the caption
    article or social media post that is in question only                grounds with the objects in the image. To compute this
    shows the image with one caption and some fact-                      numerically, we pair every image I in the dataset with
    checking articles also only show a single instance of                a matching caption Cm and a random caption Cr and
    a false caption, but most fact-checks show at least two              compute scores for both the captions with the image.
    captions, usually one correct caption from the original              A higher score for matching caption (sm ) compared to
    dissemination of the image, and also the false caption.              random caption (sr ), i.e., sm > sr indicates correct
    We describe an image to be out-of-context relative to                prediction.
    two contradicting captions and we describe these two
    captions as conflicting-image-captions. However, we              (2) Object IoU evaluates how well the caption grounds
    do not aim to determine which on of the two captions                 with the correct object in the image. For example, if
    is false or true, or possibly if both captions are false.            the caption is “a woman buying groceries”, the caption
    This is up to future research.                                       should ground with particular “woman” in the image.
                                                                         For this, we select the object with maximum score and
                                                                         compare it with GT object (provided in the dataset)
B. Dataset Cleanup & Annotation                                          using IoU overlap. Note that we do not have these an-
                                                                         notations for our dataset, hence we evaluate this metric
   We build a web-based annotation tool based on Python
                                                                         on the RefCOCO [49] dataset.
Flask and MySQL database to cleanup and label the sam-
ples of our dataset. We first use Python’s langdetect [12]           (3) Out-of-Context (OOC) Accuracy evaluates the model
library based on Google’s language detection to get rid of               on the out-of-context classification task described in
non-English captions. We then cleanup the captions by re-                main paper We collected an equal number of samples
moving image/source credits by using our web tool. Finally,              for both the classes (Out-of-Context and Not-Out-of-
we annotate 2,230 pairs from our Validation Set. For every               Context) for fair evaluation.
image, up to 4 caption pairs with diverse vocabulary are an-
notated. Duplicate caption pairs are removed during annota-
tion from Validation split. However, we did not clean up up
duplicates from the training set. Several samples from our
                                                                     D. Additional Experiments
dataset along with the corresponding captions from differ-              We visualize the object-caption groundings for pairs that
ent sources are shown in Fig 8. A snapshot of the annotation         are not out of context from the validation set in Fig 10.
tool is shown in Fig. 9.

                                                                11
Figure 8: Dataset: images and captions from the train set (top row); samples from the validation set (bottom row).

Figure 9: Annotation tool: annotators can zoom-in the image with single left click (Col. 2). Navigation links to original posts
(Col. 3) are provided for verification. Next, annotators select context labels from the drop-down and update annotations.

D.1. How to Encode Text?                                             sitions thereby learning better visual grounding. Hence, we
                                                                     used text-embeddings for all our experiments in the main
   We evaluate which of the two text encodings (word-level
                                                                     paper.
or text-level) learn better object-caption groundings. For
word-level encoding, we compute fixed-size embeddings                D.2. Do Object-Detector Features Help?
for every word in the caption and combine with every de-
tected object during experiments. For text-level encoding,               We experiment if same backbone network (ResNet-
a single embedding is generated for the entire caption and           50 [18]) trained on different tasks (Image Classification and
combined with detected objects in the image. Results are             Object Detection) can help learn better object-level features
shown in Table 7. We notice that encoding the entire caption         ultimately learning better object-caption groundings. Re-
as single embedding consistently outperforms word-level              sults are shown in Table 8. We notice that Mask-RCNN
encodings given that text-level embeddings contain richer            pre-trained backbone performs better in comparison to Im-
information about certain object attributes like relative po-        ageNet pre-trained backbone verifying that object detector

                                                                12
Figure 10: Visual grounding of captions with the objects in the image for samples that our not out of context. We show
object-caption scores for two captions per image; only scores for the top 3 bounding boxes per caption are shown to avoid
clutter. Positive scores (green) indicate high association of object with the caption and negative scores (red) indicates low
visual association.

   Method        Embed. Type      Object IoU    Match Acc.
  Bbox (GT)          Word            0.33           0.85
                     Text            0.36           0.89
 Bbox (Pred)         Word            0.20           0.80
                     Text            0.27           0.88

Table 7: Ablations study of word vs sentence embed-                  Object Features     Bbox     Object IoU   Match Acc.
dings on RefCOCO dataset [49]. Text is embedded using
Glove [32] pre-trained embeddings. GT indicates Ground               ImageNet             GT         0.39          0.90
Truth bounding boxes used for experiments. Pred indicates            Mask-RCNN [17]                  0.45          0.92
bounding boxes predicted by Mask-RCNN used for experi-               ImageNet             Pred       0.32          0.88
ments.                                                               Mask-RCNN [17]                  0.38          0.91

                                                                    Table 8: We evaluate different backbones for our self-
features help learn better object-caption groundings. Hence,        supervised training setting.(GT) indicates we used ground
we used Mask-RNN pre-trained backbone as image encoder              truth boxes for experiments and (Pred) indicates we used
for our experiments in the main paper.                              bounding boxes predicted by pre-trained Mask-RCNN. On
                                                                    average, we have 12.6 bboxes per image (Random Chance
                                                                    = 0.07)

                                                               13
You can also read