An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog

Page created by Geraldine Myers
 
CONTINUE READING
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
An animated picture says at least a thousand words: Selecting Gif-based
                                                              Replies in Multimodal Dialog

                                                             Xingyao Wang                                     David Jurgens
                                                          University of Michigan                           University of Michigan
                                                         xingyaow@umich.edu                               jurgens@umich.edu

                                                                Abstract                           PizzaMagic: Ahhhhh!!! The EMNLP deadline
                                                                                                   is in 24 hours!!
                                             Online conversations include more than just                x CasualModel:
                                             text. Increasingly, image-based responses
                                             such as memes and animated gifs serve as
arXiv:2109.12212v1 [cs.CL] 24 Sep 2021

                                             culturally recognized and often humorous re-
                                             sponses in conversation. However, while NLP
                                             has broadened to multimodal models, conver-
                                             sational dialog systems have largely focused
                                             only on generating text replies. Here, we intro-
                                             duce a new dataset of 1.56M text-gif conversa-
                                             tion turns and introduce a new multimodal con-
                                             versational model P EPE THE K ING P RAWN
                                                                                                   Figure 1: Gif responses in conversation like the one
                                             for selecting gif-based replies. We demon-
                                                                                                   shown above are embodied dialog that use visual im-
                                             strate that our model produces relevant and
                                                                                                   agery to convey reactions and emotions. This paper de-
                                             high-quality gif responses and, in a large ran-
                                                                                                   velops a system to select the appropriate gif response
                                             domized control trial of multiple models re-
                                                                                                   to messages. (PDF best viewed with Adobe Acrobat)
                                             plying to real users, we show that our model
                                             replies with gifs that are significantly better re-
                                             ceived by the community.
                                                                                                      Conversation analysis is central to NLP and mul-
                                         1   Introduction                                          tiple approaches have analyzed this dialog struc-
                                         Conversations are central to many online social           ture (Jurafsky et al., 1998; Pareti and Lando, 2018;
                                         platforms. While most conversations are text-             Cohn et al., 2019) and developed conversational
                                         based, computer mediated dialog also affords al-          agents to engage with people (e.g., Fang et al.,
                                         ternative forms of communication, such as emoji           2018; Xu et al., 2020; Hong et al., 2020). Recent
                                         or stickers like bitmoji, that allow users to ex-         work has focused on generating open domain social
                                         press themselves (Tang and Hew, 2019; Konrad              chatbots that engage in sustained conversations in
                                         et al., 2020). Increasingly, these visual forms of        a natural way (Ram et al., 2018). Because many
                                         communication have become common in social                of these systems are designed to support voice-
                                         media (Bourlai and Herring, 2014; Highfield and           based dialog, they overlook non-textual forms of
                                         Leaver, 2016), with a notable use of the reaction gif     interaction used in social media conversations. In
                                         (Bakhshi et al., 2016; Miltner and Highfield, 2017).      parallel, multimodal NLP systems have been devel-
                                         These gifs are short video sequences that depict a        oped for image data, often focusing on image-to-
                                         particular scene and sometimes contain text that          text tasks such as image captioning (Melas-Kyriazi
                                         acts as a meta-commentary (Eppink, 2014). As a            et al., 2018; Sharma et al., 2018) and visual ques-
                                         result, conversations become multimodal where in-         tion answering (Antol et al., 2015; Huang et al.,
                                         dividuals reply to one another using combinations         2019; Khademi, 2020). More recent work has fo-
                                         of text and gifs (Figure 1). While conversational         cused on the reverse text-to-image dimension, such
                                         AI systems have been developed in a purely text-          as generating an image from a description (Niu
                                         based setting, such systems do not capture the full       et al., 2020; Ramesh et al., 2021). Our work unites
                                         multimodal behavior seen online. Here, we study           these two strands of research by integrating image-
                                         multimodal conversation by introducing new dialog         based communication into conversational agents.
                                         models for selecting gif replies in conversation.           Our paper offers three main contributions. First,
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
we propose the new task of selecting gif responses         though platforms like Twitter support gif replies,
in multimodal conversation analysis and introduce          these gifs are not canonicalized to identify which re-
a new dataset of 1,562,701 real-world conversa-            sponses correspond to the same gif. Therefore, we
tion turns with gif replies. Second, we introduce          construct a new dataset for this task by collecting
a new model P EPE THE K ING P RAWN that fuses              responses, matching their images, and augmenting
image and text-based features to select a relevant         this data with metadata about the gif, where possi-
gif response. In in-house experiments, we show             ble. A visual description of the whole procedure
that our model substantially outperforms strong            can be found in Appendix Figure 7.
baseline models at selecting the exact gif used in
real data and, in a manual test of the quality of the      3.1   Gif Response Data
best responses, achieves an nDCG of 0.8145 on the
annotated test set. Third, in a real-world test, we        Gifs have many uses (Miltner and Highfield, 2017)
deploy our model as a part of a large-scale random-        and so we use a two-step approach to collect data
ized controlled trial and show that the gif replies        that focus specifically on those likely to be used
produced by our model are more highly voted by             in conversation. First, gif responses are collected
the community. Data, code, and models are avail-           from Twitter by identifying all replies to English-
able at https://github.com/xingyaoww/gif-reply.            language tweets containing animated_gif as
                                                           embedded media. Tweets were collected from a
2   GIF Communications                                     ∼10% sample of Twitter from March 13th, 2019
                                                           to Jan 24th, 2020, totaling 42,096,566 tweets with
Gifs have been widely adopted in communication             a gif that we were able to retrieve. Twitter does not
as a natural form of embodied speech where the vi-         canonicalize its gifs so two separate gif files may
sual imagery conveys emotions or a reaction as a re-       actually have the same imagery. Further, these files
sponse (Bakhshi et al., 2016; Tolins and Samermit,         may not be identical due to small differences such
2016). These gifs commonly come from widely-               as color variations or aspect ratios. To identify uses
known cultural products, such as movies or televi-         of the reference gifs, we use Average Hash from
sion shows, which provides common knowledge                the imagehash library to create low-dimensional
for how they could be interpreted (Eppink, 2014;           representations of each gif where hash distance
Miltner and Highfield, 2017). However, a single gif        corresponds to perceptual distance. Since gifs are
may have multiple interpretations, depending on            animated and may contain varying scenes, we com-
the context, cultural knowledge of its content, and        pute the hash for the first, middle, and final frames,
the viewer (Jiang et al., 2017). As a result, a single     concatenating these into a single hash. Two gifs
gif can serve multiple functions in communication          are considered the same if (i) they have identical
(Tolins and Samermit, 2016).                               hashes or (ii) their hamming distance is < 10 and
   Gifs have grown in their use through increas-           gifs with that hash have been used more than 500
ing affordances by platforms like Tumblr, Reddit,          times in Twitter. This latter condition was selected
Imgur, and Twitter that allow gifs to be natively          after manual evaluation of thresholds to trade-off
displayed like text in conversation threads (Jiang         between increasing the size of the training data and
et al., 2018). Further, gif-based keyboards have           reducing potential noise caused by matching error.
been introduced that allow users to search for gifs        A visual example of this process can be found in
that have been tagged with keywords or other meta-         Appendix Figure 8.
data (Griggio et al., 2019). Yet, these technologies
                                                              Not all gif responses in the Twitter data are con-
require that gif data be prepared with sufficient tags
                                                           versational or appropriate for wider re-use. There-
to be searchable or to have sufficient data to use
                                                           fore, we filter these responses to only those gifs
collaborative filtering techniques for recommenda-
                                                           whose imagery matches gifs hosted by the Giphy
tions (Jiang et al., 2018, p.9). As a result, there is a
                                                           website, which is the backend for many gif-based
clear gap in identifying appropriate response gifs
                                                           keyboards. Giphy contains a wide collection of gifs
directly from the text, which this work fills.
                                                           that are curated to remove content inappropriate for
                                                           general use (e.g., violent or sexual imagery). Gifs
3   Data
                                                           on the platform are categorized (e.g., “reaction” or
Despite the widespread use of gifs, no standard            “celebrities”) and we identify 28 categories con-
dataset exists for text and gif replies. Further, al-      taining 972 keywords likely to contain gifs used
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
used with at least five gifs and where those gifs
                                                            have been used at least 1000 times in total; this
                                                            process removes many low-frequency tags that are
                                                            either overly-specific or idiosyncratic in their use.
                                                               Finally, we performed a manual inspection of all
                                                            remaining tags to remove tags that are too general
                                                            (e.g., “emotion”) and retain only noun, adjective,
                                                            and verb tags (words or multi-word expressions)
                                                            that describe specific emotions or actions. A total
                                                            of 241 unique tags were retained (Appendix C).
                                                            6.0% of gifs have at least one tag associated with
Figure 2: The frequency distribution of gifs in our data
                                                            them (mean 1.9 tags). However, these tagged gifs
roughly follows a log-normal distribution, with a few       account for 38.7% of the replies in our dataset,
gifs used often, while a long tail of gifs are used rela-   suggesting tags are only available for more-popular
tively infrequently.                                        gifs. Our dataset represents roughly an order of
                                                            magnitude more data and more tags than the closest
                                                            related dataset of Chen et al. (2017) that contained
in conversation. A total of 2,095,993 gifs linked           23K gifs with 17 manually-curated emotions.
to those keywords were ultimately retrieved and
stored as image hashes. Additional details of cate-         4     Gif Reply Models
gories and keywords are in Appendix B.
   After the matching image hashes to filter replies,       We introduce a series of models for producing a gif
we identify 115,586 unique gifs, referred to as ref-        response in conversation. Each model will select a
erence gifs, and 1,562,701 tweet replies using one          gif from the 115K gifs in our dataset as a response
of these gifs, which forms our official dataset. Fig-       to a text-based message. This task is related to but
ure 2 shows these gifs’ frequency in the data; much         distinct from work on image-text matching (Lee
like words, a handful of gifs receive widespread            et al., 2018), which aims to find an image describ-
use, while a long tail of gifs are rarely used.             ing a piece of text, or text-to-image (e.g., Wen et al.,
                                                            2015; Xu et al., 2018), which generates an image
3.2   Gif Metadata                                          from a text description. Here, we aim to select gifs
We augment our gif data with information about              that reflect natural continuations or reactions to a
their content. Some gifs have text that transcribes         message in a dialog, akin to how gifs are used in
what a person is saying in the gif’s scene or is a          social media. For all models, additional details on
meta-commentary on the content. This text is ex-            the training procedures and hyperparameters are
tracted using paddleOCR (Du et al., 2020). Since            provided in Appendix A. The three models that
some gifs are long enough to contain multiple utter-        follow use varying degrees of information about
ances, we run OCR on four frames sampled from               the gifs and text to select a response.
each quartile of the gif’s length. Roughly 50%
(58,020) of gifs contain at least one extracted word        4.1    Tag-based Predictions
from the selected frames, with an mean of 5.5 ex-           The first model uses tags as a shared representation
tracted words per gif across the dataset.                   for characterizing gifs and text. Analogous to how
   Second, some gif repositories like Giphy allow           object tags are used as anchor points for image-text
users to tag gifs with information on their content         matching (Li et al., 2020) and pivot languages are
or theme, e.g., “face palm” or “movie.” We collect          used in machine translation (Cheng et al., 2017),
tags for the 115K reference gifs used in Twitter, ob-       we use tags to bridge information between the text
taining 39,651 unique tags. These user-generated            in a tweet and the visual content of a gif. Here,
tags were moderately noisy due to orthographic              each gif becomes associated with a set of tags de-
variations like spelling, capitalization, and spacing.      scribing its conversational functions and for each
Therefore, we merge tags by (i) lower-casing the            text, we predict the set of tags for gifs responses to
text and (ii) performing a manual merge for similar         it—in essence, predicting what types of responses
word forms (e.g., “excited” and “exciting”). To             are most appropriate. We describe both of these
minimize noise, we retain only tags that have been          processes next and how gifs are ultimately selected.
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
Estimating Gif Tags Only 6.0% of the gifs in our         given a tweet, we use the trained tweet encoder to
data have associated tags. Therefore we train a          extract its representation and compute its cosine
neural model to predict tags using known tags as         similarity with each encoded representation for our
training data. To capture any changes in emotion         gifs. The gif with the highest cosine similarity is
or imagery across the gif, we make separate pre-         returned as the best response.
dictions for four frames sampled across the gif
(the same used in §3.2). Each frame is passed            4.3      P EPE THE K ING P RAWN
through an EfficientNet-based (Tan and Le, 2019)         Our final model, K ING P RAWN1 (referred to as
GIF encoder, shown in Figure 3, to extract a low-        “P EPE”.) selects gif responses by using a richer
dimensional feature vector from each frame. These        set of multimodal features to create a gif represen-
frame embeddings are fused using the attention           tation. Rather than encode the gif solely from its
mechanism from a transformer encoder layer. The          image content, we use a multimodal encoder that
output of the transformer feeds into a fully con-        captures (i) any text it might have, (ii) the types of
nected layer, which is trained as a multi-label clas-    objects present in the gif, and (iii) object regions as
sifier using binary cross-entropy to predict which       visual features. We encode these gif aspects using
tags should be present.                                  an O SCAR transformer (Li et al., 2020) to create
Predicting Response Tags for Text For each mes-          a unified representation, shown in Figure 3 (bot-
sage, we predict the k-hot distribution of tags          tom). Object names and regions of interest feature
for a gif response by training a BERTweet model          vectors are extracted using a pre-trained bottom-up
(Nguyen et al., 2020), which has been pre-trained        attention model (Anderson et al., 2018).
on a large corpus of Twitter data (shown as “Tweet          As input to the O SCAR encoder, the captions to
Encoder" in Figure 3). The model with an addi-           each of the gif’s four frames are concatenated to-
tional fully connected layer is trained as a multi-      gether with an “[INTER_FRAME_SEP]" separator
label classifier using binary cross-entropy, using       token. We filter object areas detected by the bottom-
the tags for the gifs used in reply (if known).          up attention model (Anderson et al., 2018) and we
Tag-based Gif Selection At inference time, given         keep all objects with probability >0.5. We then
a message, we use the text-to-tag model to predict       concatenate object names together with the same
a k-hot distribution over tags. Then, we select the      inter-frame separator between names of different
gif whose estimated tag distribution is closest in       frames. Together, the caption text, object names,
Euclidean distance.                                      and image-region features are fed into the O SCAR
                                                         transformer encoder to generate a GIF feature vec-
4.2   CLIP variant                                       tor; the transformer is initialized with the default
                                                         O SCAR weights. We use BERTweet to encode text.
The second model uses an end-to-end training ap-         The entire P EPE model is trained end-to-end using
proach based on the architecture of OpenAI CLIP          contrastive loss, similar to the CLIP model.
(Radford et al., 2021). The architecture features
two encoders, one for text and one for images. Dur-      5       Evaluation
ing training, the encoders are updated using con-
                                                         We initially evaluate the methods in two ways.
trastive loss that maximizes the cosine similarity of
                                                         First, we use traditional classification-based eval-
paired image-text representations and minimizes
                                                         uation, testing whether the models can reproduce
the cosine similarity of random pairs of images and
                                                         the observed gif replies. However, some messages
texts. We replicate the CLIP architecture and train-
                                                         could have multiple valid gif responses. Therefore,
ing procedure, using BERTweet to encode text and
                                                         as a second test, we evaluate the model in a retrieval
EfficientNet (Tan and Le, 2019) to encode a com-
                                                         setting, measuring whether its most-probable re-
posite image of four frames from the gif (compared
                                                         sponses are good quality for a message.
with BERT and ResNet in their implementation).
                                                         Experimental Setup Models are trained and
While originally designed to select an image for a
                                                         tested on a dataset containing 1,562,701 Tweet-
text description, our model is trained to select a gif
reply for a text message—a more challenging task             1
                                                              K ING P RAWN refers to “selecKting INteresting Gifs for
than the image retrieval task used in the original       Personal RespAWNses.” In this crazy muppet-name-land-
                                                         grab world we live in, our only regret is that we couldn’t
CLIP setup, as the message may not contain words         get “Pepino Rodrigo Serrano Gonzales” to fit as a bacronym,
describing elements of the gif. At inference time,       which we leave to future work.
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
EfficientNet GIF Encoder                                                                    Tweet Encoder

                                                     EfficientNet                     Transformer                              Tweet
                                                      EfficientNet                                                                              BERTweet-base
                                                        EfficientNet
                                                       EfficientNet-b0                  Encoder                            We 're getting
                                                                                         Layer                                                   Transformer
                                                      (shared weight)                                                      married today !

                                                                                                    feature vector for                                          feature vector for
            4 selected frames
                                                                                                    downstream task                                             downstream task
       4 x 224 x 224 (after reshape)
                                                                                                         1 x 512                                                     1 x 512

                                                                                   Oscar GIF Encoder

                                                                    OCR                        extracted caption
                                                                                                Aww , thank you

                                                                                             extracted object names                               Oscar
                                                             Bottom-up Attention                                                                Multimodal
                                                                                   face woman [INTER_FRAME_SEP] face woman
                                                                                                                                               Transformer

                                                                                        extracted object feature vectors                                        feature vector for
                                                                                                                                                                downstream task
                                                                                                                                                                     1 x 512
                      4 selected frames
                 4 x 224 x 224 (after reshape)

                             Figure 3: The different encoder modules used to construct the models in §4.

GIF pairs associated with 115,586 unique gifs,                                                 sufficient content to judge appropriateness indepen-
where 605,063 tweet-gif pairs are associated                                                   dent of the larger social context.
with at least one tag. Using the finalized 241                                                    Two annotators (the authors) were shown a list
unique tags as classes for multi-label classifica-                                             of potential gif responses for a tweet and asked to
tion, we split the dataset by stratify on tags us-                                             judge whether this is an appropriate gif response
ing the iterative train-test split method provided by                                          (a binary rating). Gifs were selected from the ten
scikit-multilearn library (Sechidis et al.,                                                    most-probable replies for each system and collec-
2011; Szymański and Kajdanowicz, 2017) to cre-                                                tively shown in random order to prevent knowing
ate a 80:10:10 train, dev, and test split which                                                which system generated each reply. A total of 2,500
is finalized to train the models described in §4.                                              gif-tweet pairings were annotated. Annotators at-
Following BERTweet (Nguyen et al., 2020), we                                                   tained a Krippendorf’s α of 0.462; while moderate
preprocess tweets in our dataset using NLTK                                                    agreement, this value is expected given known dif-
TweetTokenizer for tokenization, emoji                                                         ferences in how people interpret and value gif re-
package to translate emotion icons, and converted                                              sponses based on their familiarity with its content,
mentions and links to special “@USER" and                                                      message interpretation, and life-experience (Jiang
“HTTPURL" tokens.                                                                              et al., 2018). We follow the evaluation setup from
Annotated Data To test whether each model’s pre-                                               other retrieval-based dialog systems (e.g. Yu et al.,
dictions are valid responses, we annotate the ten                                              2021; Kumar and Callan, 2020) and use normal-
most-probable gif predictions for a subset of the                                              ized Discounted Cumulative Gain (nDCG), which
tweets in our test data. Many tweets in our test set                                           measures whether more appropriate gif responses
require substantial context to understand due to hav-                                          are ranked higher. A gif’s appropriateness score is
ing few tokens, linking to URLs that provide extra                                             the sum of annotators’ ratings.
knowledge, mentioning other users in directed com-                                             Results The P EPE model was able to identify rele-
munication. These factors suggest social context                                               vant and good-quality gif responses, as shown by
or general knowledge aids in the recipient’s under-                                            its performances on the test data (Table 1) and an-
standing of the gif’s intentions. While the model                                              notated data (Table 2). Performance on the test set
can still benefit from training on such examples,                                              is expected to be low, given the challenge of identi-
judging the appropriateness of response is difficult                                           fying the exact gif used for a tweet when multiple
without access to the social context. Therefore, to                                            possible gifs are likely to be equally valid. How-
reduce interpretation ambiguity, we annotate only                                              ever, the P EPE model is still able to identify the
tweets without URLs or user mentions and having                                                exact gif (out of 115K) in its top 10 predictions
at least 10 tokens. This process selects tweets with                                           for 3% of the data, substantially outperforming all
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
Model                      Top-1        Top-5      Top-10            Model                                  nDCG
  Tag-based               0.000000     0.000092    0.000119            P EPE                                  0.8145
  Random                  0.000020     0.000059    0.000158
  CLIP variant            0.000488     0.001669    0.002783
                                                                       P EPE without object names             0.7665
  Distribution sampling   0.000996     0.005098    0.009780            P EPE without caption                  0.7559
  P EPE                   0.005375     0.018723    0.030918            P EPE without object features          0.7533

Table 1: Models’ precision-at-k on selecting the exact          Table 3: Results for ablated versions of P EPE where
gif used as a response for a tweet in the test set; this per-   specific input is removed (cf. Table 2) show that all in-
formance is an underestimate of each model, as many             put forms contribute to the ability to select replies.
model-predicted gifs may be appropriate.

           Model                        nDCG                    and evaluating the model’s resulting gifs on the
                                                                same test instances.
           Random                       0.3273
           Tag-based                    0.4526                     The ablated model performances, shown in Ta-
           Distribution sampling        0.4969                  ble 3, reveal that each input is useful for selecting
           CLIP variant                 0.5934                  gifs.2 Object features capture visual information
           P EPE                        0.8145                  about what specifically is present in the gif (beyond
                                                                the discrete names of what is present, e.g., “person”
Table 2: Models’ nDCG scores at proposing appropri-             or “building”) and show that multimodality is im-
ate gif replies, measured from annotations on the top           portant for high performance—predicting replies
10 most probable gif replies of each model.                     just from a gif’s caption and categorized content
                                                                are insufficient. Similarly, the caption of a gif (if
                                                                present) is important, as the text can help make
other models.                                                   explicit the intended interpretation of a gif.
   Performance on the annotated data (Table 2) pro-
vides a more realistic assessment of whether mod-               6     Field Experiment
els can generate high-quality replies, as it measures
whether the models’ replies themselves were good.               To test the generalizability of our models and qual-
The P EPE model attains substantially higher per-               ity of their responses, we conduct a large-scale ran-
formance (p
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
Parent Tweets             Tag-based              CLIP variant      P EPE            Dist. Samp.      Random
  That wonderful feeling
  you get when you ar-
  rive to a business din-
  ner that you’re suppos-
  edly paying for...and re-
  alize you’ve forgotten
  your credit card

  I’m convinced some of
  y’all don’t get laid

 Table 4: Model-selected replies to messages (paraphrased for privacy). Click an image to view the gif on Giphy.

this score in our experiments to evaluate quality.                 we use Thompson sampling (Russo et al., 2018)
   Our experiment focuses on generating Gif-based                  to randomly select which arm of the trial to use.
replies to top-level text comments (comments made                  Thompson sampling builds a probability model for
directly to the post). This setup mirrors the conver-              the estimated reward of each arm (here, the score a
sational data our models were trained on. Imgur                    reply receives) and samples from the model such
supports several ways of filtering its stream of posts.            that higher-rewarding arms are sampled more fre-
To ensure that our replies have sufficient visibility,             quently. As a result, this method can provide tighter
we select posts that have already receive 10 com-                  estimates for the reward of the most useful arms.
ments and appear in the “most viral” sorting. From                 Scores in Imgur have a skewed distribution, with
these posts, we reply to the top-rated text comment.               few comments receiving very high scores and most
The RCT runs from 8 AM to 8 PM (local time),                       receiving near the default score (1). Therefore, we
making at most 10 replies per hour.                                use Poisson Thompson sampling. Some comments
   Not all topics or comments are suitable for auto-               may be downvoted to receive scores below zero, so
mated responses and great care was taken to pre-                   for simplicity, we truncate these scores to 0.
vent potential harm to the community. Through                         We initialize the reward estimates for our ex-
multiple rounds of testing which replies would be                  periment by selecting one of the five models in
responded to, we curated a list of keywords that                   a round-robin manner to reply to an Imgur com-
could lead to potential controversial replies, such                ment for 3 days. These initial scores act as priors
as terms about religion or race (full list in Ap-                  for Thompson sampling to update Poisson distri-
pendix D). Any comment containing a token or                       butions for each model. In the trial, we choose a
lemma matching a word on this list is excluded                     model by sampling from the up distributions using
and not replied to. As a further safeguard, exper-                 all previous days’ scores as the prior. The exper-
imenters monitored all replies to remove any that                  iment ran from April 15th, 2021 to August 30th,
were deemed inappropriate. See the Ethics Section                  2021, and models generated a total of 8,369 replies.
(§9) for a longer discussion of safeguards.                           To evaluate the results of the RCT, we construct
   The field experiment consists of five arms, cor-                a Negative Binomial regression on the dependent
responding to the three trained models and the two                 variable of the score received for a model’s reply,
baseline models. During each trial, one model is se-               truncating negative scores to zero. The Negative
lected and generates a response; the trained model                 binomial was chosen instead of Poisson due to
replies with the most probable gif.4                               over-dispersion in the score variable. The models
   Not all models are equally likely to perform well               are treated as a categorical variable, using the ran-
and so to make the most use of our trial budget,                   dom model as a reference. Since the score will
    4
      Due to a bug, early experimental trials for the CLIP and     depend, in part, on the attention received by the
P EPE models used the tenth most-probable gif; however, using      parent post and comment (higher-rated comments
the ratings in the annotated data, a t-test of the difference in   are displayed first), we include linear effects for
quality for most- and tenth-most probable gifs showed no
statistically-significant difference in quality for both models    the post and parent comment. Finally, we include
(p>0.1). Therefore, we include this data in our results.           five text-related variables to control for the con-
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
CLIP variant           ***                                                   2gG2xiMTtFwsg
                                                                         fnjxvV295sWEJjvwXU
                                                                               BAPSj0xM1cFe8
   Dist. Samp.                                                            iJsvRxNTAcup6DVfLP

                                                           GIPHY GIF ID
                                                                      3oEjHLcg4QMU5umb9m
                                                                               aKrTvuOv4hlKM
          Pepe                                   ***                    3oKIPllDN24q8Awtwc
                                                                         jIu44mYwUItSHTW3tj
      Tag-based                                                            jTrWAzlFGfvVY34PSJ
                                                                          l396L17pwHWOIJrTG
                   0.2         0.0         0.2                                                         60     40    20           0    20
                  Coefficient for GIF Reply Score                                                           GIF Reply Score
                                                                                             (a) Tag-based
Figure 4: Negative Binomial regression coefficients for
each model on predicting a gif reply’s score, using the                   lfesfEtobCSbsHzC8d
random-gif model as the reference category; bars show                  m9d3Xif3ShZ42CxlWP
standard error and *** denotes significance at 0.01.                    f9k1tV7HyORcngKF8v
                                                                            loitbnzQ1JQ8Iizx8w

                                                           GIPHY GIF ID
                                                                                 bfrlODgSLqXxS
                                                                      4HmjGg306HiLHWlm2f
tent of the parent comment: the topic distribution                     7J26CGAahos6d5S1A6
                                                                        8hZ9FMolyKc0X8BSr7
(Appendix Table 9) from a 10-topic model (drop-                       iqkHA3DmB8GjORY030
ping one topic due to collinearity), the sentiment                    OOzcnk3PzLDHqWs6Tb
                                                                                                       30   20     10        0       10    20
and subjectivity of the message estimated using                                                             GIF Reply Score
TextBlob library, the length of the comment, and                                           (b) CLIP variant
whether the comment contained a question.
                                                                                  tnYri4n2Frnig
                                                                      5wWf7GR2nhgamhRnEuA
6.2    Results                                                              5gw0VWGbgNm8w
                                                                              iXTrbbYMQBCMM
                                                           GIPHY GIF ID

The field experiment demonstrates that the P EPE                       65ODCwM00NVmEyLsX3
model is able to generate significantly higher-                           26AHLBZUC1n53ozi8
                                                                         3o8doT9BL7dgtolp7O
scoring responses. Figure 4 shows the Negative                                 Fq6Bdki3coEWQ
Binomial regression coefficients for the three mod-                      3oEjHAUOqG3lSS0f1C
                                                                                KzyMcEfDh4Jiw
els and empirical distribution baseline, with the                                                 40        20           0           20
                                                                                                              GIF Reply Score
random gif model as a reference; full regression
                                                                                                  (c) P EPE
results are shown in Appendix Table 6. The P EPE
model substantially outperforms all other models          Figure 5: Score distributions for most-frequently used
(p0.73 for all models). The most-used gifs
and the reply (Madden, 2018, p.29). Further, we           for each model had average scores that were pos-
observed that, when the model’s reply truly seemed        itive, but the distributions for each gif show that
random, some users replied say they upvoted solely        some uses were occasionally downvoted. This high
because they enjoyed the gif.                             variance in scores indicates that a gif’s intrinsic
   As a follow-up experiment, we tested whether           qualities are not solely responsible for the received
models could be getting higher (or lower) scores by       score and, instead, appropriate use in context is
repeatedly picking the same gifs that are skewed          plays a significant part in community reception.
towards a positive or negative reaction. Figure 5            We examined whether models relied on the same
shows the score distribution for the top ten most fre-    set of gifs. Figure 6 shows the distribution of gif
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
isting messages from a large social media corpus as
                              2.5                                         Tag-based            potential replies and rank these to select a response.
 log10(frequency of unique gif)

                                                                          CLIP variant         Our work mirrors models that use neural networks
                              2.0                                         Pepe
                                                                          Dist. Samp.          for ranking (Yan et al., 2016; Inaba and Takahashi,
                              1.5                                         random               2016; Penha and Hauff, 2021, e.g.,); however, we
                                                                                               note that many recent knowledge-grounded and
                              1.0                                                              open domain models use encoder-decoder meth-
                              0.5                                                              ods to improve versatility and applicability (e.g.,
                                                                                               Ghazvininejad et al., 2018; Gao et al., 2019; Zhou
                              0.0
                                                                                               et al., 2020). Generative approaches are likely in-
                                  0.0   0.5     1.0      1.5     2.0        2.5          3.0   appropriate for gif-based conversation as gifs are
                                              log10(rank of unique gif)
                                                                                               more akin to mimetic artifacts that build on cultural
Figure 6: Gif use frequency by each model, shown                                               knowledge (Eppink, 2014), making synthesizing a
as frequency-vs-rank log-scaled with first-order line fit                                      new gif from scratch likely less effective.
(jitter added for separation).
                                                                                                  All three models used here rely on joint embed-
                                                                                               ding spaces for gif and text. Multiple works in
uses by each model, indicating that the tag-based                                              NLP have been proposed to align these representa-
model relied frequently on a small set of gifs. How-                                           tions (Kiros et al., 2014; Wang et al., 2016), often
ever, the P EPE and CLIP variant models were sub-                                              for particular applications such as visual question
stantially more varied, indicating they draw from                                              answering (Antol et al., 2015). Recent work has
the long-tail of possible gifs.                                                                focused on embeddings these media with a single
   Do any of our models spark more subsequent                                                  encoder that takes both text and images as input
conversation? We fit a separate Negative Binomial                                              (e.g., Wang et al., 2019; Chen et al., 2020), in con-
regression on the total number of comments made                                                trast to our model that uses separate image and text
to our reply, using the same IVs as the score regres-                                          encoders (Figure 3); these multimodal encoders
sion and include the reply’s score itself as another                                           are prohibitively computationally expensive to use
IV. This model (Appendix Table 8) shows that both                                              in our setting during inference time, as the model
the distributional-sampling baseline and P EPE mod-                                            would need to be run on each gif (and message) to
els produced replies that led to fewer subsequent                                              rank replies, compared with our model that only
comments (p
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
9   Ethics                                              ate gif, which is mitigated by the use of Giphy to
                                                        seed our initial gifs. As this platform is curated
The interactive nature of the RCT necessitated          and does not host objectively offensive gifs (e.g.,
a close consideration of ethical issues (Thieltges      overly-violent content), our initial gif set is rela-
et al., 2016). Prior to beginning the RCT, the study    tively free of objectionable gifs. Because our model
team obtained IRB approval to interact with users.      learns directly from gifs’ frequency of use, unless
While necessary in the legal sense, IRB approval        objectively offensive gifs are widely used, they are
is not sufficient to justify the ethical grounds of     unlikely to be deployed from our RCT; we specu-
the study. The primary risks of the study are if the    late that few objectively offensive gifs are widely
automated models respond with an inappropriate          used and, in practice, we have not identified any
gif or respond to a message that is not suitable for    during the study period or when examining hun-
automated response (e.g., discussing the death of a     dreds of random gifs in our data (or used in the
loved one or making an offensive statement). These      RCT).
risks were mitigated in multiple ways throughout           Finally, one risk is that by learning gif responses
the dataset construction and field experiment.          from observed data, our models may reinforce
   First, the selection criteria for which comments     cultural stereotypes that are encoded in the gifs
we reply to was designed to only reply to content       themselves (Erinn, 2019), e.g., the association of
that was already deemed appropriate by the com-         African American individuals with strong emotions.
munity. By selecting only posts that had received       While our gif data is relatively clean of overtly of-
sufficient upvotes to be called “viral” and were        fensive gifs, we acknowledge that our model likely
already receiving comments, we mitigate the risk        does inadvertently perpetuate some of these latent
of engaging in topics or conversations that are in-     biases in the data. However, the success of our
appropriate according to the norms of the Imgur         model suggests a future mitigation strategy for plat-
community, as these posts would be removed by           forms suggesting gifs: as biases become known,
moderators or would have received sufficient down-      our approach can be used to suggest less-biased
votes to stay in obscurity.                             gifs as potential responses to mitigate future harm.
   Second, by focusing on the top-voted comment
to these posts, we again reply to content that has      Acknowledgments
already been deemed high-quality by the comment.        We thank the reviewers, area chairs, and senior area
This comment-level criteria substantially lowers        chairs for their thoughtful comments and feedback.
the risk of our models commenting on inappropri-        We also thank the Blablablab for helpful feedback
ate comments (e.g., a comment insulting another         and letting us deploy P EPE to the group’s Slack and
user), as these comments are readily downvoted by       putting up with the ridiculous gif replies and Imgur
the community prior to our intervention.                for being a wonderful community. This material is
   Third, we employed extensive filtering to avoid      based upon work supported by the National Science
replying to any comment containing a potentially        Foundation under Grant No. 2007251.
sensitive topic, e.g., a discussion of race or trauma
(keywords are listed in Appendix D). The initial set
of keywords was developed through examining po-         References
tentially sensitive topics and then iteratively added   Peter Anderson, Xiaodong He, Chris Buehler, Damien
to by simulating which messages our RCT would             Teney, Mark Johnson, Stephen Gould, and Lei
reply to and examining whether it would be appro-         Zhang. 2018. Bottom-up and top-down attention for
priate. During the field RCT, experimenters contin-       image captioning and visual question answering. In
                                                          2018 IEEE Conference on Computer Vision and Pat-
uously monitored the comments to ensure no harm           tern Recognition, CVPR 2018, Salt Lake City, UT,
was being done. Ultimately, only three comments           USA, June 18-22, 2018, pages 6077–6086. IEEE
were removed during the initial two days, which           Computer Society.
was due to a bug in the lemmatization and these         Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-
comments should have been filtered out by our ear-         garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,
lier criteria; these comments were removed quickly         and Devi Parikh. 2015. VQA: visual question an-
                                                           swering. In 2015 IEEE International Conference on
and we did not observe any notable response from
                                                           Computer Vision, ICCV 2015, Santiago, Chile, De-
the community.                                             cember 7-13, 2015, pages 2425–2433. IEEE Com-
   Fourth, one risk is replying with an inappropri-        puter Society.
Saeideh Bakhshi, David A. Shamma, Lyndon Kennedy,          Hao Fang, Hao Cheng, Maarten Sap, Elizabeth Clark,
  Yale Song, Paloma de Juan, and Joseph Jofish Kaye.         Ari Holtzman, Yejin Choi, Noah A. Smith, and Mari
  2016. Fast, cheap, and good: Why animated gifs en-         Ostendorf. 2018. Sounding board: A user-centric
  gage us. In Proceedings of the 2016 CHI Conference         and content-driven social chatbot. In Proceedings of
  on Human Factors in Computing Systems, San Jose,           the 2018 Conference of the North American Chap-
  CA, USA, May 7-12, 2016, pages 575–586. ACM.               ter of the Association for Computational Linguis-
                                                             tics: Demonstrations, pages 96–100, New Orleans,
Michael S Bernstein, Margaret Levi, David Magnus,            Louisiana. Association for Computational Linguis-
  Betsy Rajala, Debra Satz, and Charla Waeiss. 2021.         tics.
  Esr: Ethics and society review of artificial intelli-
  gence research. ArXiv preprint, abs/2106.11521.          Jianfeng Gao, Michel Galley, and Lihong Li. 2019.
                                                              Neural Approaches to Conversational AI: Ques-
Fumihiro Bessho, Tatsuya Harada, and Yasuo Ku-                tion Answering, Task-oriented Dialogues and Social
  niyoshi. 2012. Dialog system using real-time crowd-         Chatbots. Now Foundations and Trends.
  sourcing and Twitter large-scale corpus. In Proceed-
  ings of the 13th Annual Meeting of the Special Inter-    Marjan Ghazvininejad, Chris Brockett, Ming-Wei
  est Group on Discourse and Dialogue, pages 227–           Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and
  231, Seoul, South Korea. Association for Computa-         Michel Galley. 2018. A knowledge-grounded neural
  tional Linguistics.                                       conversation model. In Proceedings of the Thirty-
                                                            Second AAAI Conference on Artificial Intelligence,
Elli Bourlai and Susan C Herring. 2014. Multimodal          (AAAI-18), the 30th innovative Applications of Arti-
   communication on tumblr: “i have so many feels!”.        ficial Intelligence (IAAI-18), and the 8th AAAI Sym-
   In Proceedings of the 2014 ACM conference on Web         posium on Educational Advances in Artificial Intel-
   science, pages 171–175.                                  ligence (EAAI-18), New Orleans, Louisiana, USA,
                                                            February 2-7, 2018, pages 5110–5117. AAAI Press.
Weixuan Chen, Ognjen Oggi Rudovic, and Rosalind W
  Picard. 2017. Gifgif+: Collecting emotional ani-         Carla F Griggio, Joanna Mcgrenere, and Wendy E
  mated gifs with clustered multi-task learning. In          Mackay. 2019. Customizations and expression
 2017 Seventh International Conference on Affective          breakdowns in ecosystems of communication apps.
 Computing and Intelligent Interaction (ACII), pages         Proceedings of the ACM on Human-Computer Inter-
  510–517. IEEE.                                             action, 3(CSCW):1–26.

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El             Tim Highfield and Tama Leaver. 2016. Instagrammat-
  Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and                ics and digital methods: Studying visual social me-
  Jingjing Liu. 2020. UNITER: Universal Image-               dia, from selfies and gifs to memes and emoji. Com-
  TExt representation learning. In ECCV.                     munication research and practice, 2(1):47–62.

Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, and          Chung Hoon Hong, Yuan Liang, Sagnik Sinha Roy,
  Wei Xu. 2017. Joint training for pivot-based neural        Arushi Jain, Vihang Agarwal, Ryan Draves, Zhizhuo
  machine translation. In Proceedings of the Twenty-         Zhou, William Chen, Yujian Liu, Martha Miracky,
  Sixth International Joint Conference on Artificial In-     et al. 2020. Audrey: A personalized open-domain
  telligence, IJCAI 2017, Melbourne, Australia, Au-          conversational bot. In Alexa Prize Proceedings.
  gust 19-25, 2017, pages 3974–3980. ijcai.org.
                                                           Pingping Huang, Jianhui Huang, Yuqing Guo, Min
Michelle Cohn, Chun-Yen Chen, and Zhou Yu. 2019.              Qiao, and Yong Zhu. 2019. Multi-grained attention
  A large-scale user study of an Alexa Prize chatbot:        with object-level grounding for visual question an-
  Effect of TTS dynamism on perceived quality of so-          swering. In Proceedings of the 57th Annual Meet-
  cial dialog. In Proceedings of the 20th Annual SIG-         ing of the Association for Computational Linguis-
  dial Meeting on Discourse and Dialogue, pages 293–          tics, pages 3595–3600, Florence, Italy. Association
  306, Stockholm, Sweden. Association for Computa-            for Computational Linguistics.
  tional Linguistics.
                                                           Michimasa Inaba and Kenichi Takahashi. 2016. Neural
Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin,              utterance ranking model for conversational dialogue
  Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua           systems. In Proceedings of the 17th Annual Meeting
  Yang, Qingqing Dang, et al. 2020. PP-OCR: A Prac-          of the Special Interest Group on Discourse and Dia-
  tical Ultra lightweight OCR system. ArXiv preprint,        logue, pages 393–403, Los Angeles. Association for
  abs/2009.09941.                                            Computational Linguistics.

Jason Eppink. 2014. A brief history of the gif (so far).   Jialun “Aaron" Jiang, Jed R Brubaker, and Casey
   Journal of visual culture, 13(3):298–306.                  Fiesler. 2017. Understanding diverse interpretations
                                                              of animated GIFs. In Proceedings of the 2017 CHI
Wong Erinn. 2019. Digital blackface: How 21st cen-            Conference Extended Abstracts on Human Factors
 tury internet language reinforces racism.                    in Computing Systems, pages 1726–1732.
Jialun “Aaron" Jiang, Casey Fiesler, and Jed R              Conference on Empirical Methods in Natural Lan-
   Brubaker. 2018. “The Perfect One” Understanding          guage Processing: System Demonstrations, pages 9–
   Communication Practices and Challenges with An-          14, Online. Association for Computational Linguis-
   imated GIFs. Proceedings of the ACM on human-            tics.
   computer interaction, 2(CSCW):1–20.
                                                          Tianrui Niu, Fangxiang Feng, Lingxuan Li, and Xiaojie
Daniel Jurafsky, Elizabeth Shriberg, Barbara Fox, and       Wang. 2020. Image synthesis from locally related
  Traci Curl. 1998. Lexical, prosodic, and syntactic         texts. In Proceedings of the 2020 International Con-
  cues for dialog acts. In Discourse Relations and Dis-      ference on Multimedia Retrieval, pages 145–153.
  course Markers.
Mahmoud Khademi. 2020. Multimodal neural graph            Silvia Pareti and Tatiana Lando. 2018. Dialog intent
 memory networks for visual question answering. In           structure: A hierarchical schema of linked dialog
 Proceedings of the 58th Annual Meeting of the Asso-         acts. In Proceedings of the Eleventh International
 ciation for Computational Linguistics, pages 7177–          Conference on Language Resources and Evaluation
 7188, Online. Association for Computational Lin-            (LREC 2018), Miyazaki, Japan. European Language
 guistics.                                                   Resources Association (ELRA).

Ryan Kiros, Ruslan Salakhutdinov, and Richard S           Gustavo Penha and Claudia Hauff. 2021. On the cal-
  Zemel. 2014. Unifying visual-semantic embeddings          ibration and uncertainty of neural learning to rank
  with multimodal neural language models. ArXiv             models for conversational search. In Proceedings
  preprint, abs/1411.2539.                                  of the 16th Conference of the European Chapter
                                                            of the Association for Computational Linguistics:
Artie Konrad, Susan C Herring, and David Choi. 2020.        Main Volume, pages 160–170, Online. Association
  Sticker and emoji use in facebook messenger: impli-       for Computational Linguistics.
  cations for graphicon change. Journal of Computer-
  Mediated Communication, 25(3):217–235.                  Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
                                                            Ramesh, Gabriel Goh, Sandhini Agarwal, Girish
Vaibhav Kumar and Jamie Callan. 2020. Making in-            Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
  formation seeking easier: An improved pipeline for        et al. 2021. Learning transferable visual models
  conversational search. In Findings of the Associa-        from natural language supervision. ArXiv preprint,
  tion for Computational Linguistics: EMNLP 2020,           abs/2103.00020.
  pages 3971–3980, Online. Association for Compu-
  tational Linguistics.                                   Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu
Kuang-Huei Lee, X. Chen, G. Hua, H. Hu, and Xi-             Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn,
  aodong He. 2018. Stacked cross attention for image-       Behnam Hedayatnia, Ming Cheng, Ashish Nagar,
  text matching. ArXiv preprint, abs/1803.08024.            et al. 2018. Conversational ai: The science behind
                                                            the alexa prize. ArXiv preprint, abs/1801.03604.
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xi-
  aowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu,           Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott
  Li Dong, Furu Wei, et al. 2020. Oscar: Object-            Gray, Mark Chen, Rewon Child, Vedant Misra,
  semantics aligned pre-training for vision-language        Pamela Mishkin, Gretchen Kruegerand Sandhini
  tasks. In European Conference on Computer Vision,         Agarwal, and Ilya Sutskever. 2021. DALL-E: Cre-
  pages 121–137. Springer.                                  ating images from text. https://openai.com/
                                                            blog/dall-e/.
John Savery Madden. 2018. The Phenomenological
  Exploration of Animated GIF Use in Computer-            Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni,
  Mediated Communication. Ph.D. thesis, University          Ian Osband, and Zheng Wen. 2018. A tutorial on
  of Oklahoma.                                              thompson sampling. Foundations and Trends® in
                                                            Machine Learning, 11(1):1–96.
Luke Melas-Kyriazi, Alexander Rush, and George Han.
  2018. Training for diversity in image paragraph cap-
                                                          Konstantinos Sechidis, Grigorios Tsoumakas, and Ioan-
  tioning. In Proceedings of the 2018 Conference on
                                                            nis Vlahavas. 2011. On the stratification of multi-
  Empirical Methods in Natural Language Processing,
                                                            label data. Machine Learning and Knowledge Dis-
  pages 757–761, Brussels, Belgium. Association for
                                                            covery in Databases, pages 145–158.
  Computational Linguistics.
Kate M Miltner and Tim Highfield. 2017. Never             Piyush Sharma, Nan Ding, Sebastian Goodman, and
  gonna GIF you up: Analyzing the cultural signifi-          Radu Soricut. 2018.       Conceptual captions: A
  cance of the animated GIF. Social Media+ Society,          cleaned, hypernymed, image alt-text dataset for au-
  3(3):2056305117725223.                                     tomatic image captioning. In Proceedings of the
                                                            56th Annual Meeting of the Association for Compu-
Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen.              tational Linguistics (Volume 1: Long Papers), pages
  2020. BERTweet: A pre-trained language model               2556–2565, Melbourne, Australia. Association for
  for English tweets. In Proceedings of the 2020             Computational Linguistics.
Piotr Szymański and Tomasz Kajdanowicz. 2017. A           Rui Yan, Yiping Song, and Hua Wu. 2016. Learning
   network perspective on stratification of multi-label      to respond with deep neural networks for retrieval-
   data. In First International Workshop on Learn-           based human-computer conversation system. In Pro-
   ing with Imbalanced Domains: Theory and Appli-            ceedings of the 39th International ACM SIGIR con-
   cations, pages 22–35. PMLR.                               ference on Research and Development in Informa-
                                                             tion Retrieval, SIGIR 2016, Pisa, Italy, July 17-21,
Mingxing Tan and Quoc V. Le. 2019. Efficientnet:             2016, pages 55–64. ACM.
  Rethinking model scaling for convolutional neural
  networks. In Proceedings of the 36th International       Shi Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, and
  Conference on Machine Learning, ICML 2019, 9-              Zhiyuan Liu. 2021. Few-shot conversational dense
 15 June 2019, Long Beach, California, USA, vol-             retrieval. ArXiv preprint, abs/2105.04166.
  ume 97 of Proceedings of Machine Learning Re-
  search, pages 6105–6114. PMLR.                           Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum.
                                                              2020. The design and implementation of XiaoIce,
Ying Tang and Khe Foon Hew. 2019. Emoticon, emoji,            an empathetic social chatbot. Computational Lin-
  and sticker use in computer-mediated communica-             guistics, 46(1):53–93.
  tion: A review of theories and research findings. In-
  ternational Journal of Communication, 13:27.

Andree Thieltges, Florian Schmidt, and Simon
  Hegelich. 2016. The devil’s triangle: Ethical con-
  siderations on developing bot detection methods. In
  2016 AAAI Spring Symposium Series.

Jackson Tolins and Patrawat Samermit. 2016. Gifs
   as embodied enactments in text-mediated conversa-
   tion. Research on Language and Social Interaction,
   49(2):75–91.

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016.
  Learning deep structure-preserving image-text em-
  beddings. In 2016 IEEE Conference on Computer
  Vision and Pattern Recognition, CVPR 2016, Las Ve-
  gas, NV, USA, June 27-30, 2016, pages 5005–5013.
  IEEE Computer Society.

Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng,
  Junjie Yan, Xiaogang Wang, and Jing Shao. 2019.
  CAMP: cross-modal adaptive message passing for
  text-image retrieval. In 2019 IEEE/CVF Interna-
  tional Conference on Computer Vision, ICCV 2019,
  Seoul, Korea (South), October 27 - November 2,
  2019, pages 5763–5772. IEEE.

Miaomiao Wen, Nancy Baym, Omer Tamuz, Jaime
  Teevan, Susan T Dumais, and Adam Kalai. 2015.
  Omg ur funny! computer-aided humor with an ap-
  plication to chat. In ICCC, pages 86–93.

Jun Xu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanx-
  iang Che, and Ting Liu. 2020. Conversational graph
  grounded policy learning for open-domain conversa-
  tion generation. In Proceedings of the 58th Annual
  Meeting of the Association for Computational Lin-
  guistics, pages 1835–1845, Online. Association for
  Computational Linguistics.

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han
  Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He.
  2018. Attngan: Fine-grained text to image genera-
  tion with attentional generative adversarial networks.
  In 2018 IEEE Conference on Computer Vision and
  Pattern Recognition, CVPR 2018, Salt Lake City, UT,
  USA, June 18-22, 2018, pages 1316–1324. IEEE
  Computer Society.
Category                Subcategory                     validation macro-f1 was 0.07 achieved at the 70th
                                                         epoch.
 Cartoons & Comics       aqua teen hunger force
 Celebrities             richard pryor                   A.2   CLIP variant
 Reactions               angry                           The evaluation performance for model selection
 Emotions                happy                           is measured by nDCG. For every tweet-gif pair in
 Anime                   bleach                          the validation set, we measure the top 30 predicted
 Art & Design            psychedelic                     GIFs from the model using the tweet as input. The
 Nature                  sunrise                         relevance of an occurring ground truth gif in the
 Transportation          bicycle                         top 30 predictions given a tweet is set 1 for the
                                                         nDCG calculation.
    Table 5: Examples of GIF categories on GIPHY
                                                            CLIP variant is trained on the same finalized
                                                         dataset using contrastive loss. It was trained for
A     Additional Details on Model Training               16 epochs with a batch size of 16 using AdamW
                                                         optimizer of learning rate 1e-5 and weight decay
Following, we provide additional details on how          1e-3. Best validation performance is achieved at
each of the three models was trained.                    epoch 6 with an nDCG value of 0.015.
                                                            We replace the Transformer encoder layer with
A.1    Tag-based Model
                                                         a linear Layer on Efficient GIF Encoder from Fig-
EfficientNet-based Tag Classifier Gifs are re-           ure 3, and use this as our GIF Encoder for the
shaped to 224 by 224 pixel while keeping the as-         CLIP variant. Image inputs to the GIF encoder are
pect ratio by padding and normalized to a mean of        normalized following the official CLIP implemen-
0.5 and standard deviation of 0.5 for each channel       tation.
before feeding into the EfficientNet-based model.
We selected unique GIFs from the finalized dataset       A.3   P EPE
that has at least one associated tag and using the       The P EPE model follows most configurations from
iterative train test split on k-hot tag representation   the CLIP variant model, but replace the EffcientNet
to select 5% of those GIFs for validation. The Effi-     GIF encoder with an Oscar GIF encoder based
cientNet tag classifier was trained for 100 epochs       on Oscar pre-trained multi-modal transformer (Li
on a batch size of 32, using AdamW optimizer             et al., 2020).
with learning rate 1e-5 and weight decay 1e-3. The          Extra metadata are extracted from GIFs in the fi-
best validation performance was achieved at the          nalized dataset for further training. Captions within
40th epoch with macro-f1 of 0.30 in predicting 241       the GIF are extracted using PaddleOCR (Du et al.,
multi-label classes. Early experiment shows that         2020), and only extracted text with probability
transformer encoder layer (macro-f1 of 0.30) out         greater than 0.9 are kept as caption metadata.
performs linear layer (macro-f1 of 0.19) in fusing          Object tags and their corresponding features are
multi-frame gif features on the development set,         extracted with bottom-up attention (Anderson et al.,
therefore transformer encoder layer is used to fuse      2018) using py-bottom-up-attention
features of different frames in our implementation.      package. Object instances are filtered to only
Tweet-to-tag classifier Using the finalized dataset      keep instances that have a score higher than
mentioned in §3, we use tweet as input, and the          0.5, then object tags and their corresponding
k-hot tag representation of that tweet instance as       features are extracted from these instances. Final
ground truth label to train the multi-label classifier   object features of dimension 2054 are obtained
along with the tweet encoder for 241 classes. Ad-        by concatenating feature output with dimension
ditionally, we filter out tweets from the finalized      2048 from Faster-RCNN with scaled box position
dataset that do not have corresponding twitter tags      coordinates of the object following (Li et al.,
before training. The model with the best valida-         2020).
tion performance is selected to perform subsequent          The P EPE model is trained on the finalized
evaluation and field experiments. The tweet en-          dataset with extracted caption and object metadata.
coder was trained for 100 epochs with a batch size       It was trained for 16 epochs with a batch size of
of 32. The learning rate was set to 1e-5 with 1e-3       8 using AdamW optimizer of learning rate 1e-6
weight decay using AdamW optimizer. The best             and weight decay 1e-3. Preprocessing for GIFs is
Tweet-GIF Pairs
                                     42,096,566 pairs

                               Parent Tweet
                               Having a super great day .

                               Child animated GIF (reply)
                               File: tweet_video/EEqo71PWkAAqGab.mp4
                               AverageHash: 00080c2ce4f0f46410383828c262626300040424b4f4e061

                                           ......
                                 39,401,680 unique GIF files

                                 GIPHY GIF Collections

                                                                                                             Match Animated
                              GIPHY GIF ID: 3ornjLd54I3eQYmfpC                                                GIF with Hash
                              Tags: #high five #late night with seth meyers                      Matched
                              AverageHash: 00080c2ce4f0f46410383828c262726300040424b4f4e061

                              GIPHY GIF ID: 3o6fIQSs4BcsEbDE7S
                              Tags: #real housewives #bravo tv #slice #rhonj                   Not Matched
                              AverageHash: fbf9f35141008c8bfbf9f35151008c89fbf9f35151008c89

                                            ......
                                  2,095,993 unique GIF IDs

                                    Finalized Dataset
               1,562,701 pairs (605,063 pairs associated with selected tags)

                              Parent Tweet: Having a super great day .
                              GIPHY GIF ID: 3ornjLd54I3eQYmfpC
                              Selected Tags: #high five

                                                                                                              GIPHY Tags
                                                                                                               Selection

                              Parent Tweet: We 're getting married today !
                              GIPHY GIF ID: g9582DNuQppxC
                              Selected Tags: #cheers, #drink, #congratulations

                                              ......
                                     115,586 unique GIF IDs

Figure 7: A diagram of the pipeline used to collect, canonicalize, and filter gif-reply data from Twitter.
GIPHY GIF (ID: Y9pd1baXUIJTW)

               Tweet Animated GIF (D0mIkeXX0AIQt3N.mp4)                        Hamming distance = 9
                                                                                   Matched

                                                                                                        Image Average Hash: 6864f6ded2d282001f180b3d1c18f07c083e2f1fdfd08c8e

                                                                                                                       GIPHY GIF (ID: 2zR6G1VD9a1ws)

                                                                              Hamming distance = 51.0
                                                                                  Not Matched

       Image Average Hash: 686cfcfef2d282021f181b3d1c18f07c083e2e1f9fd08c8e

                                                                                                        Image Average Hash: f0f4fcfcf8e060001018383818183018001a1b1b1a18180c

           Figure 8: Matching Animated GIFs from Twitter with GIPHY gifs using Image Average Hash

the same as the Tag-based model. Max sequence                                                 Reactions              thumbs down
length is set to 256 tokens for the Oscar transformer.                                        Reactions              frustrated
Best evaluation performance is achieved at epoch                                              Reactions              oh snap
12 with an nDCG score of 0.007.                                                               Reactions              disgusted
                                                                                              Reactions              rejected
B    GIF categories on GIPHY                                                                  Reactions              embarrassed
                                                                                              Reactions              hug
 Category           Subcategory                                                               Reactions              yolo
 Reactions          what                                                                      Reactions              interested
 Reactions          hair flip                                                                 Reactions              thank you
 Reactions          bored                                                                     Reactions              sarcastic
 Reactions          frown                                                                     Reactions              shocked
 Reactions          slow clap                                                                 Reactions              cool story bro
 Reactions          mic drop                                                                  Reactions              middle finger
 Reactions          goodbye                                                                   Reactions              you got this
 Reactions          meh                                                                       Reactions              whatever
 Reactions          scared                                                                    Reactions              omg
 Reactions          do not want                                                               Reactions              deal with it
 Reactions          confused                                                                  Reactions              sigh
 Reactions          drunk                                                                     Reactions              oops
 Reactions          wow                                                                       Reactions              angry
 Reactions          mad                                                                       Reactions              finger guns
 Reactions          awesome                                                                   Reactions              good luck
 Reactions          please
You can also read