Understanding Chat Messages for Sticker Recommendation in Hike Messenger - arXiv

Page created by Bob Mueller
 
CONTINUE READING
Understanding Chat Messages for Sticker Recommendation in Hike Messenger - arXiv
Understanding Chat Messages for Sticker Recommendation
                                                                in Hike Messenger
                                                       Abhiskek Laddha              Mohamed Hanoosh        Debdoot Mukherjee                                Parth Patwa
                                                                                               Hike Messenger
                                                                              {abhiskekl, moh.hanoosh, debdoot, parthp}@hike.in
                                         ABSTRACT
                                         Stickers are popularly used in messaging apps such as Hike to visu-
                                         ally express a nuanced range of thoughts and utterances to convey
arXiv:1902.02704v3 [cs.CL] 19 Jul 2019

                                         exaggerated emotions. However, discovering the right sticker from
                                         a large and ever expanding pool of stickers while chatting can be
                                         cumbersome. In this paper, we describe a system for recommending
                                         stickers in real time as the user is typing based on the context of
                                         conversation. We decompose the sticker recommendation problem
                                         into two steps. First, we predict the message that the user is likely to
                                         send in the chat. Second, we substitute the predicted message with
                                         an appropriate sticker. Majority of Hike’s messages are in the form
                                         of text which is transliterated from users’ native language to the
                                         Roman script. This leads to numerous orthographic variations of the
                                         same message and complicates message prediction. To address this
                                         issue, we learn dense representations of chat messages and use them
                                         to cluster the messages that have same meaning. In the subsequent          Figure 1: Sticker Recommendation UI on Hike and a high level
                                         steps we predict the message cluster instead of the message. Our           flow of our two step sticker recommendation system.
                                         model employs a character level convolution network to capture
                                         the similar intents in orthographic variants of chats. We validate         any perceivable delay during typing. This is possible only if the
                                         our approach using manually labelled data on two tasks. We also            system runs end-to-end on the mobile device without any network
                                         propose a novel hybrid message prediction model, which can run             calls. Furthermore, a large fraction of Hike users use low end mobile
                                         with low latency on low end phones that have severe computational          phones, so we need a solution which is efficient both in terms of
                                         limitations.                                                               CPU load and memory requirements.
                                                                                                                        Prior to this work, sticker recommendation (SR) on Hike app was
                                                                                                                    based on string matching of the typed text to the tags that were man-
                                         1   INTRODUCTION                                                           ually assigned to each sticker. Simple string matching algorithms
                                         In messaging apps such as Facebook Messenger, WhatsApp, Line               fail to capture importance of phrases while scoring candidates. eg;
                                         and Hike, new modalities are extensively used to visually express          for an input text “so” the algorithm would rank phrase “so” higher
                                         thoughts and emotions (e.g. emojis, gifs and stickers). Emojis (e.g.,      than that of “sorry” while the latter could have been more useful
                                         ,, /) are used in conjunction with text to convey emotions in a            recommendation to the user. Recall of string matching based ap-
                                         message [7]. Unlike emojis, stickers provide a graphic alternative         proach is limited by exhaustiveness of tagging of the stickers. Tags
                                         for text messages. Hike stickers are composed of an artwork (e.g.,         of a sticker are supposed to contain all the phrases that the sticker
                                         cartoonized characters and objects) and a stylized text (See Figure        can possibly substitute, but there are many ways of expressing same
                                         1). They help to convey rich expressions along with the message.           message in a chat. For instance, people often skip vowels in words
                                         Hundreds of thousands of stickers are available for free download or       as they type; e.g., “where are you” → “whr r u”, “ok” → “k”. This
                                         purchase on sticker stores of popular messaging apps. Once a user          is further exacerbated when people transliterate messages from their
                                         downloads a sticker pack, it gets added to a palette, which can be         native language to Roman script; e.g., phrase “acchha” (Hindi for
                                         accessed from the chat input box. However, discovering the right           “good”) is written in many variants - “accha”, “acha”, “achha” etc.
                                         sticker while chatting can be cumbersome because it’s not easy to          Another reason for proliferation of such variants is that certain words
                                         think of the best sticker that can substitute your utterance. Even if      are pronounced differently in different regions. We observe 343 or-
                                         you do recall a sticker, browsing the palette to find it can slow you      thographic variants of “kya kar raha hai” (Hindi for “What are you
                                         down. Apps like Hike and Line offer type-ahead sticker suggestions         doing”) in our dataset. Hence, it is hard for any person to capture all
                                         while you type a message (as shown in Figure 1) in order to alleviate      variants of an utterance as tags.
                                         this problem.                                                              Decomposing Sticker Recommendation: One can potentially think
                                            The goal of type-ahead sticker recommendation is to help users          of setting up sticker recommendations with the help of a supervised
                                         discover a suitable sticker which can be shared in lieu of the mes-        model, which directly learns the most relevant stickers for a given
                                         sage that they want to send. The latency of generating such sticker        context defined by the previous message in the chat and the text
                                         recommendation should be in tens of milliseconds in order to avoid         typed in the chat input box. However, due to updates in the set of
Understanding Chat Messages for Sticker Recommendation in Hike Messenger - arXiv
available stickers and a massive skew in historical usage toward                   memory and CPU. Trie search is efficient mechanism to retrieve
a handful of popular stickers, it becomes difficult to collect reli-               message group based on typed text and can be executed for
able, unbiased data to train such an end-to-end model. Moreover,                   each character typed. Hence, the system is able to easily meet
an end-to-end model will need to be retrained frequently to support                the latency constraints with this setup. Scores from these two
new stickers and the updated model would have to be synced to all                  components are combined to produce final scores for message
devices. Such regular updates of the model, which is possibly >10                  prediction.
MB in size, will be prohibitively expensive in terms of data costs.               In summary, our contributions include:
Thus, instead of creating an end-to-end model, we decompose the                • We present a novel application of type-ahead Sticker Recom-
SR task into two steps. First, we predict the message that a user is             mendation within a messaging application. We decompose the
likely to send based on the chat context and what the user may have              recommendation task in two steps: message prediction and sticker
typed in the message input box. This model does not require frequent             substitution.
updates because the distribution of chat messages does not change              • We describe an approach to cluster chat messages that have the
much with time. Second, we recommend stickers by mapping the                     same meaning. We evaluate different methods to create dense
predicted message to stickers that can substitute it. We make use of             vector representations for chat messages, which preserve message
manual tags added to stickers to bootstrap the mapping of message                semantics.
to sticker.This step uses a lightweight message-to-sticker mapping,            • We propose a novel hybrid message prediction model, which
which is less than 1 MB in size. We update the mapping frequently                can run with low latency on low end phones that have severe
based on relevance feedback observed on recommendations and                      computational limitations. We show that the hybrid model has
incorporates new stickers as they are launched.                                  comparable performance with a neural network based message
    For message prediction, we train a classification model with the             prediction model.
help of historical chat data. Due to the aforementioned problem of
                                                                                  The rest of the paper is organized as follows. In Section 2, we
numerous variations in messages, we cluster different messages that
                                                                               explore methods to learn effective representations for chat messages
have same meaning and use them as classes in prediction model.
                                                                               and describe creation of message clusters, which can be used as
This helps us drastically reduce the number of classes in the classifier
                                                                               classes in the message prediction task. In Section 3, we discuss
while keeping most frequent message intent of our corpus. It lowers
                                                                               different message prediction models, including our Hybrid model.
its complexity as well as size.
                                                                               In Section 4, we briefly mention our approach for mapping stickers
    In this paper, we focus on how to efficiently set up the message
                                                                               to message clusters. In Section 5, we demonstrate the efficacy of
prediction task. In doing so, we specifically deal with two sets of
                                                                               the proposed message embeddings in identifying synonymous chat
challenges:
                                                                               messages and also evaluate the performance of a quantized neural
                                                                               network and the hybrid model on our message prediction task. Fi-
(1) Orthographic variations of chat messages: As described above,
                                                                               nally, we describe how we have deployed the system on Hike and
    a large fraction of chat messages are simply variants of each
                                                                               discuss related literature before concluding the paper.
    other. We investigate different architectures to learn embeddings
    for chat messages such that the messages with the same intent
    take up similar representations. We empirically show that the
                                                                               2     CHAT MESSAGE CLUSTERING
    use of charCNN [13] with transformer [23] for encoding chat                As mentioned earlier, we cluster frequent messages in our chat and
    messages can be highly effective to capture semantics of chat              use them as classes in the message prediction model. For covering a
    words and phrases. Next, we perform a fine grained clustering on           large fraction of messages in our chat corpus with a limited number
    the representations of the frequent chat messages with the help            of clusters, we need to group all messages with the same intent
    of the HDBSCAN algorithm. These clusters are used as classes               into a single cluster. This has to be done without compromising the
    for our message prediction model. Though we construct the                  semantics of each cluster. In order to efficiently cluster messages,
    message clusters to reduce complexity of the message prediction            it is critical to obtain dense vector representations of messages that
    task, we observe that these message clusters are generic enough            effectively capture their meaning.
    to be used in other application such as conversational response                In this section, we describe an encoder which is used to convert
    generation, intent classification etc. [22, 24].                           chat phrases into dense vectors [19]. Then, we explain the network
(2) Computational limitations on low end smart-phones: Most of                 architecture which is used to train the message embeddings such
    our users use Hike on low end mobile devices with severe limi-             that we learn similar representations for messages that have same
    tations on memory and compute power. Running inference with                meaning.
    a neural network model for message prediction proves to be
    challenging on such devices [8]. The size of a neural network              2.1     Encoder
    model trained for message prediction exceeds the memory limi-              The architecture of the encoder is shown in Figure 2. Input to the
    tation even after quantization. We present a novel hybrid model,           encoder is a message mi , consisting of a sequence of N words
    which can run efficiently on low end devices without signifi-              {w i,1 , w i,2 , . . . , w i, N }, and the output is a dense vector ei ∈ Rd ×1 .
    cantly compromising accuracy. Our system is a combination of                  To represent a word, we use an embedding composed of two parts;
    a neural network based model that processes chat context and               a character based embedding aggregated from character representa-
    runs on server, and a Trie [20] based model that processes typed           tions, and a word level embedding. To generate the character based
    text input on the client. The first component is not limited by            embedding, we use a character CNN [13] that leverages sub-word
                                                                           2
Figure 3: Model architecture for learning message embedding.
                                                                                        Input-Response Chat (IRC) maximize the similarity of trans-
                                                                                                                            ′
                                                                                        formed next message embedding (ei+1 ) with current message
                                                                                        embedding (ei ). Both encoder share the parameters.

                                                                                        It adds a positional embedding with word embedding to encode
                                                                                        the sequence information present in the message. Similar to [26],
Figure 2: Illustration of encoder which comprises of a character                        we use only the encoder part of transformer to create our message
based CNN layer for each word and GRU/Transformer layer at                              embeddings. Unlike GRU, which gives the final representation at
word level                                                                              step t, transformer gives a representation of each word after attending
                                                                                        to the different words in the sentence using a series of attention
information to learn similar representations for orthographic variants                  layers. We compute the message embedding ei by averaging the
of the same word. Let dc be the dimension of character embedding.                       words vectors in the final layer of the transformer.
For a word w t we have a sequence of l characters [c t,1 , c t,2 . . . , c t,l ].
Then, the character representation of w t can be obtained by stacking                   2.2    Model Architecture
the character level embeddings in a matrix Ct ∈ Rl ×dc and applying                     Akin to [26], we use the model architecture shown in Fig. 3 to train
a narrow convolution with a filter H ∈ Rk ×dc of k width with ReLU                      the encoders described in Section 2.1. We use a dot-product scor-
non linearity, and a max pooling layer. A k width filter can be as-                     ing model [9, 26] to train the sentence level representation from
sumed to capture k-gram features of a word. We have multiple filters                    input-response pairs in unsupervised manner. It is a discriminative
for particular a width k, that are concatenated to obtain a character                   approach using which the network is optimized to score the gold
level embedding ew  char for word w . The character level represen-                     reply message higher in comparison to other reply messages. The in-
                                     t
tation of the word is concatenated with the word-level embedding                        put to our model is a tuple, (mi , mi+1 ), extracted from a conversation
 wor d of w to get a the final word represetation e . Let d be the
ew                                                                                      between two users. Messages mi and mi+1 are encoded using the
            t                                        w           w
dimenion of word embedding. So, the representation of a word w t                        encoder described in Section 2.1 and represented as ei and ei+1 re-
becomes ew ∈ R(dw +dc )×1 .                                                             spectively. Note that the parameters of both the encoders are tied, i.e
   To capture the sequential properties of a message, we explore two                    they share weights. Hence, both ei and ei+1 represent the encoding
architectures.                                                                          of message in same space. We transform the input space embedding
                                                                                        ei+1 to reply space by applying two fully connected layers to obtain
2.1.1 GRU. Gated Recurrent Unit (GRU) [4] is an improved ver-                                                              ′
                                                                                        the final response embedding ei+1 . Finally, the dot product of the
sion of RNN to solve the vanishing gradient problem. At each step t                                                                                        ′
                                                                                        input message embedding ei and the response embedding ei+1 is
it takes the output of previous step (t − 1) and current word repre-
                                                                                        used to score the replies.
sentation to generate the representation of a message till time step t.
                                                                                           We train the model using batches of randomly shuffled tuples.
The final step representation in the GRU is taken to be the message
                                                                                        Within a batch, each mi+1 serves as the correct response to its corre-
embedding ei .
                                                                                        sponding input mi and all other instances are assumed to be negative
2.1.2 Transformer. [23] propose an architecture solely based on                         replies. It is computationally efficient to consider all other instances
the attention to curb the recurrence step in RNN. It uses multi-                        as negative during training because we don’t have to explicitly en-
headed attention to capture various relationships among the words.                      code the specific negative example for each instance.
                                                                                    3
2.3    Message Clustering
We cluster frequent messages with the help of the embeddings
learned as above. Our goal is to have a single cluster for all the
orthographical variations and acronyms of a message. We observe
that the number of different variants of a phrase increases with the
ubiquity of that phrase. For example, “good morning” has ∼ 300
variants while relatively less frequent phrases tend to have signifi-
cantly low number of variants. This poses a challenge when applying
a standard density based clustering algorithm such as DBSCAN be-
cause it is difficult to decide a single threshold for drawing cluster
boundaries. To handle this uncertainty, we choose HDBSCAN [18]
to cluster chat messages. This algorithm builds a hierarchy of clus-              Figure 4: Hybrid architecture for message cluster prediction. 1)
ters and handles the variable density of clusters in a time efficient             Predicts the next message cluster using a neural network (NN)
manner. After building the hierarchy, it condenses the tree, extracts             based classification model on server and sends the result to mo-
the useful clusters and assigns noise to the points that do not fit into          bile device. 2) For each character typed, client predicts the mes-
any cluster. Further, it doesn’t require parameters such as the number            sage cluster based on the typed text using trie. 3) Combiner ag-
of clusters or the distance between pairs of points to be considered              gregates scores from both models to obtain a final score for each
as a neighbour etc. The only necessary parameter is the minimum                   message cluster
number of points required for a cluster. All the clusters including
the noise points are taken to be the different classes in our message             Top frequent G message clusters that were prepared in Section 2.3
prediction step.                                                                  are chosen as classes for this model. We encode the last received
                                                                                  message and typed text into a dense vector, which is fed as input
                                                                                  features to a classification layer that is a fully connected layer having
3     MESSAGE CLUSTER PREDICTION
                                                                                  sigmoid non-linearity and G output neurons. Details of the training
In this section, we describe our approach of predicting the message               procedure is described in Section 5.4
(its cluster to be precise) that a user is going to send. In chat, the next          An active area of research is to reduce the model size and the
message to be sent heavily depends on the conversational context.                 inference times of neural network models with minimum accuracy
In this paper, we use the previous message received from the other                loss so that they can run efficiently on mobile devices [8, 10, 11, 21].
party in chat as the only context signal. Upon receiving a message,               One approach is to quantize the floating point representations of the
we predict a message that is a likely response. In case the user                  weights and the activations of the neural network from 32 bits to a
starts typing, we update our message prediction after every key press             lower number of bits. We use a quantization scheme detailed in [11]
taking into account the text entered so far. The response time of                 that converts both weights and activations to 8-bit integers and we
our message prediction model needs to be of the order of tens of                  use a few 32-bit integers to represent biases. This reduces the model
milliseconds so that the user’s typing experience is not adversely                size by a factor of 4. When we quantize the weights of the network
affected and the sticker recommendations change in real time as the               after training the model with full precision floats, the inference times
user types in the message input box.                                              reduce significantly. We employ quantize aware training1 to reduce
    We pose message prediction as a classification problem where we               the effect of quantization on model accuracy. Under this scheme, we
train a model to score each message cluster based on the likelihood               use quantized weights and activation to compute the loss function
of the event that the user sends the next message from that cluster.              of the network during training as well. This ensures parity during
A neural network (NN) based classification model that accepts the                 training and inference time. While performing back propagation, we
last received message and the typed text as inputs can be a possible              use full precision float numbers because minor adjustments to the
solution, but running inference on such a model on mobile devices                 parameters are necessary to effectively train the model. After weight
proves to be a challenge [8]. The size of the NN model that needs to              pruning, we obtain a quantized model of size ∼ 9 MB .
be shipped explodes since we have a large input vocabulary and a
large number of output classes. The embedding layer that transforms               3.2     Hybrid Message Prediction Model
the one-hot input to a dense vector usually has a size of the order of
                                                                                  We build a hybrid message prediction model, where a resource inten-
tens of megabytes, which exceeds the memory limits that we set for
                                                                                  sive component is run on the server and its output is combined with
our application. To overcome this issue, we evaluate two orthogonal
                                                                                  that of a lightweight on-device model to obtain message predictions
approaches. One is to apply a quantization scheme to reduce the
                                                                                  under tight resource constraints on memory and CPU. Algorithm 1
model size and the other is to build a hybrid model, composed of a
                                                                                  outlines the score computation of the hybrid model. Figure 4 demon-
neural network component and a trie-based search. In this section,
                                                                                  strates the high level flow with an example. The three components
we detail these two approaches.
                                                                                  in our hybrid model are detailed below.
3.1    Quantized Message Prediction Model                                         3.2.1 Reply Model: Similar to the prediction model described
We train a neural network for message prediction task where the                   in Section 3.1, we build a NN model which takes the last received
input is the last received message and the typed text, and the out-               1 https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/quantize/
put is the message cluster from which the user sends the message.                 README.md
                                                                              4
Algorithm 1 Message Cluster Prediction                                                 To score the retrieved message clusters, we add the frequency of
Input: Last message received prev, Typed text typ                                  the phrases in our chat corpus as additional information in the trie.
                                                                                   So, the records of the form < phrase , ( message cluster id, frequency,
 1:   For each message cluster д ∈ G, compute Pr eply (G = д|prev)
                                                                                   min prefix length ) > are stored as key-value pairs in the trie. The
      with the reply NN model
                                                                                   score of a message cluster retrieved is the ratio of the aggregated
 2:   Send the message clusters with scores higher than a threshold,
                                                                                   frequency of the retrieved records from that message cluster to the
      Pr eply > tr eply , to the client.
                                                                                   sum of frequencies of all entries that were matching, irrespective of
 3:   Using text typ as input to Trie model, fetch set of phrases S that
                                                                                   whether these records are retrieved or not after the minimum prefix
      start with typ and the set of phrases Sm ⊆ S that start with typ
                                                                                   filter is applied This is shown in Equation 1. For example, suppose
      and also meet the minimum prefix length constraint.
                                                                                   the typed text is string “go”, and the sum of frequencies of all phrases
 4:   Compute trie scores for all message clusters mapped to phrases
                                                                                   starting with “go” is 10000, and there were three different phrases in
      in Sm as
                                         Í                                         the message cluster for “good morning”, which were starting with
                                                freq(ph)                           “go” (e.g., “good morning”,“good mrng”, “good mong”), having a
                                        ph ∈Sд ∩Sm
                Pt r ie (G = д|typ) =        Í                           (1)       combined frequency of 1500, then the score of the “good morning”
                                                    freq(ph)                       cluster is calculated to be 0.15.
                                            ph ∈S
                                                                                       For building the trie, we pick the top frequent phrases and the
      where Sд is the set of all phrases present in message cluster д              frequency of a phrase ph in our chat corpus after some normalization
 5:   For message clusters obtained from reply model or trie model,                is set as f req(ph). The length of the smallest prefix pre f of phrase
      combine the scores from both models as:                                      ph that gives P(ph|pre f ) ≥ θ is kept as the minimum prefix length
        Q(G = д|prev, typ) = w 0 + w t ∗ Pt r ie (G = д|typ)+                      of the phrase ph.
                            Pr eply (G = д|prev)∗                        (2)           At Hike, we created a trie with 34k phrases and around 7500
                                     −λnc                                          message clusters. In the serialized form, the size of the trie was
                            (wp0 e          + wp1 Pt r ie (G = д|typ))
                                                                                   around 700KB. This is small enough for us to ship to client devices.
Output: Message cluster scores Q                                                   Another advantage of a trie is that it is interpretable and makes it easy
                                                                                   for us to incorporate any new phrase, even if they are not observed
                                                                                   in our historical data.
                                                                                   3.2.3 Combiner: After receiving relevant message clusters and
message as input and scores each message cluster by their likelihood
                                                                                   corresponding scores from reply model and trie model, a final
of being a reply to the input message. Given the last message prev,
                                                                                   score for a message cluster Q(G = д|prev, typed) is computed from
the model computes the reply probability Pr eply (G = д|prev) for
                                                                                   Pr eply (G = д|prev) , Pt r ie (G = д|typed) and length of string typed
each message cluster д in the set of message clusters G.
                                                                                   nc as mentioned in Equation 2. The weighting of terms is designed in
   When a message gets routed to its recipient, the reply model
                                                                                   such a way that as the user types more characters, contribution from
is queried on the server and the response predictions are sent to
                                                                                   the reply model vanishes unless the message cluster is not predicted
the client along with the message. Since this inference runs on the
                                                                                   from the trie. The weights used were manually chosen after some
server, we can afford to use large sized NN models without adversely
                                                                                   trial and error. A threshold tcombined is applied on the final score to
affecting user experience. Furthermore, we pre-compute and cache
                                                                                   filter out recommendations with low scores.
the replies for frequent messages to improve the average response
time for reply prediction. The message clusters that have a reply
                                                                                   4    MESSAGE CLUSTER TO STICKERS
probability Pr eply (G = д|prev) above a threshold tr eply are sent to
the clients.                                                                            MAPPING
                                                                                   As mentioned in Section 1, we make use of a message cluster to
3.2.2 Typed Model: Trie [20] is an efficient data structure for                    sticker mapping to suggest suitable stickers from the predicted mes-
prefix searches. We retrieve the relevant message clusters for a given             sage clusters. When a sticker is created, it is tagged with conversa-
typed text by querying a trie that stores  tuples as key-value pairs. These  pairs                meta-data in order to map the message clusters to stickers. We com-
are the chat messages and their corresponding message clusters that                pute similarity between the tag phrases of a sticker and the phrases
we prepare from our corpus, as described in Section 2 . In order                   present in each message cluster, after converting them into vectors
to reduce false positives, we assign a minimum prefix length for                   using the encoder mentioned in Section 2. Compared to the histor-
each phrase in the trie, so that a matching record is retrieved only if            ical approach of suggesting a sticker when the user’s typed input
the typed text meets this minimum length or its message cluster is                 matches one of its tag phrases, the current system is able to suggest
present in the recommendations available from the reply model. For                 stickers even if different variations of the tag phrase are typed by
example, suppose the phrase “hi” has a min prefix length set to 2,                 user, as we have many variations of a message already captured
then a typed text “h” will not retrieve the record corresponding to                in the message clusters. We regularly refresh the message cluster
“hi”, though it is matching the typed text. But, in case the message               to sticker mapping by taking into account the relevance feedback
cluster of “hi” is present in the predictions of the reply model, then             observed on our recommendations. Since generation of message to
“hi” is retrieved even if length of the prefix is below the required               sticker mapping is not the main focus of paper, we skip the details
minimum length.                                                                    of these algorithms.
                                                                               5
Spell             Syn            Overall            5.2    Message Embedding Evaluation
       Models
                    P@10 R@10        P@10 R@10        P@10 R@10
    Trans+CharCNN    73.3     67.1    89.2     51.3    77.6    62.9          In this subsection, we evaluate the embeddings generated from dif-
         Trans       70.8     67.6    86.7     47.4    75.0    62.2          ferent architecture on two manually labeled data. First, we describe
    GRU+CharCNN      71.6     66.7    86.6     48.7    75.6    61.9          the baselines, training and hyper parameter tuning.
         GRU         70.7     67.6    86.2     44.9    74.8    61.5
                                                                             5.2.1 Message embeddings models. We compare the embed-
Table 1: Evaluation of different architecture of message embed-
                                                                             dings obtained from architecture (Section 2.2) and its variants which
ding model on chat slang task
                                                                             are as follows:
                                                                             (1) Trans+CharCNN - Encoder as shown in Figure 2 with trans-
                                                                                 former.
                                                                             (2) Trans - Uses only word-level embedding em  wor d as input to
                                                                                 transformer.
                                                                             (3) GRU+CharCNN - Encoder as shown in Figure 2 with GRU.
                                                                                                                       wor d as input to GRU.
                                                                             (4) GRU - Uses only word-level embedding em
                                                                             5.2.2 Training and Hyper-parameters: We train our GRU based
                                                                             models using Adam Optimizer with learning rate of .0001 and step
                                                                             exponential decay for every 10k steps. We apply gradient clipping
                                                                             (with value of 5), dropout at GRU and fully connected layer to reduce
                                                                             overfitting. For the GRU+CharCNN model, we use filters of width
                                                                             1, 2, 3, 4 with number of filters of 50, 50, 75, 75 for the convolution
                                                                             layers. The GRU models use embedding size of 300.
                                                                             Transformer models are trained using RMSprop optimizer with con-
                                                                             stant learning rate of 0.0001. They use 2 attention layers and 8 atten-
                                                                             tion heads. Within each attention layer, the feed forward network has
                                                                             input and output size of 256, and has a 512 unit inner-layer. To avoid
                                                                             overfitting we implement attention dropout and embedding drop out
Figure 5: Performance of the message embedding models on                     of 0.1. The convolution layers of the Trans+charCNN model has the
phrase similarity task                                                       same number of filters and filter width as that of the GRU+CharCNN
                                                                             model.
5     EXPERIMENTS                                                            5.2.3 Phrase Similarity Task. We created a dataset which con-
In this section, we first describe the dataset used for training the         sists of phrase pairs and labelled them as similar or not-similar. For
Sticker Recommendation (SR) system. Next, we quantitatively eval-            e.g. (“ghr me hi”, “room me h”) , (“majak kr rha hu", “mjak kr rha
uate different message embeddings on two manually curated datasets.          hu") are labelled as similar while (“network nhi tha”, “washroom
We show qualitative results after clustering to show the effectiveness       gyi thi"), (“it’s normal”, “its me”) are labelled as not-similar. Pos-
of the message embeddings. Then, we present a comparison of the              sible similar examples are sampled from top 50k frequent phrases
NN model and the hybrid model for the message cluster prediction.            using 2 methods: a) Pairs close to each other in terms of minimum
Lastly, we discuss about the deployed system.                                edit distance. b) Pairs having words with similar word embeddings.
                                                                             Possible negative examples are sampled randomly. These pairs were
5.1     Dataset and preprocessing                                            annotated by 3 annotators. Pairs with disagreement within the anno-
                                                                             tators were not considered during evaluation. Finally, the dataset has
We collected conversations of anonymous users for a period of 5              3341 similar pairs and 2437 non-similar pairs.
months. This dataset is pre-processed to extract useful conversations           To evaluate the models, we calculate the angular similarity [3]
from the chat corpus. First step is tokenization, which includes             between the embeddings of phrases in a pair. The ROC curves of
accurate detection of emoticons and stickers, reducing more than 3           the models are shown in Figure 5. We observe that transformer has
consecutive repetitions of the same character to 2 characters. We also       significant improvement over GRU. This is due to better capturing of
add a special symbol at the start and the end of every conversation          semantics of phrases using self-attention. Trans+CharCNN models
After pre-processing the data, we create tuples of the current and           performs the best and shows ≈6% absolute improvement in terms of
the next message for training the message embedding models. Since            area under the curve over the GRU baseline.
stickers are mostly used to convey short messages, we filter out
all the tuples which have at least one message having more than 5            5.2.4 Chat Slang Task: We manually labelled a set of words
words. Both our input and output vocabularies consist of top frequent        which are common slangs or variants used in conversation and
50k words in our corpus. The dataset is randomly split into training         mapped them to their corresponding correct form. We considered
and validation sets. Messages having length less than 5 words are            two types of errors. 1) Spelling variants (spell) - In this class we
padded up to the pre-defined maximum length of 5. Similarly, we              considered words which can be converted to their correct form by
pad characters in words if the character length of word is less than         either replacing the characters nearby in keyboard (‘hsn’ → ‘han’)
10 characters.                                                               or addition of vowels (‘phn’ → ‘phone’, ‘yr’ → ‘yaar’). These errors
                                                                         6
occur either due to unintentional mistakes or to shorten the word                                   # of           # of times
                                                                                                                                        Fraction
length to increase the speed of typing. 2) Synonym words (syn) - we                              Character         inaccurate
considered words which have same meaning and are phonetically                      Method                                                of msg
                                                                                                   to be           predictions
similar but cannot be converted into their correct form by spelling                                                                     retrieved
                                                                                                  Typed              shown
correction. For e.g.(‘plzz’ → ‘please’, ‘n8’ → ‘night’). Most users
are familiar with these form of transformation of words and use them             Quantized
widely in conversation. In total, we labelled 213 words for spelling                NN               1.58               1.47              0.978
variants and 78 synonym words2 to their correct word form.                        (100d)
   To evaluate the message embedding we retrieve top 10 nearest                   Hybrid             2.84               1.22              0.991
neighbour from top frequent phrases for each query of labelled                 Table 2: Performance of different message prediction models
dataset and used two evaluation metrics. 1) Precision@10 (P@10)
counts the number of correct phrases in top 10 (retrieved phrases that
are semantically similar to query). 2) Recall@10 (R@10) calculates             to type for seeing correct message cluster in top 3 positions (# of
whether labelled phrase of query is present in top 10 or not. Two              Character to be typed), 2) How many times the model has shown
human judges were used in the annotation and disagreements were                wrong message before predicting the correct message cluster in first
resolved upon consensus among judges.                                          3 positions (# of times inaccurate prediction) and 3) Fraction of
   Table 1 shows the performance of different variants of encoder              messages that could be retrieved by a model in first 3 positions,
to learn message embedding. Transformer and GRU have compa-                    with some prefix of that message (Fraction of msg retrieved). Lower
rable recall on spell data because it contains single words. Hence,            the absolute number for first two metrics, better the model is while
transformer doesn’t have much advantage over GRU. CharCNN                      fraction of message retrieved should be as high as possible. For
improves the recall performance of synonym word variants signifi-              training the message prediction using NN model, we collected pairs
cantly for both the architectures. CharCNN is able to capture certain          of current and next messages from complete conversation data. Our
chat characteristics which occur due to similar sounding sub-words.            training data had around 10M such pairs. We randomly sampled 38k
For e.g. ‘night’ and ‘n8’, ‘see’ and ‘c’ , ‘what’ and ‘wt’ are used            pairs for testing. We curate the training data by treating all prefix of
interchangeably during conversation. CharCNN learns such chat                  next message as a typed text and message cluster of the next message
nuances and performs well in matching those words to semantically              as class label. We restrict our next message clusters only to top 7500
similar words.                                                                 message clusters (clusters are picked on basis of sum frequency
                                                                               of their phrases). Selected clusters covered 34k top frequent reply
5.3       Message Clustering Evaluation                                        messages in our data set. For hybrid model we train two models.
Figure 6 shows a qualitative evaluation of clusters obtained from              One is prediction of next message based on current message. It is
HDBSCAN algorithm. We select the top 100 clusters based on the                 trained directly from current and next message pairs. Next message
number of phrases present in cluster. We projected the phrase em-              was mapped to its corresponding message cluster. Second we build
bedding learned from our model on 2-dimension using t-sne [17]                 a trie model based on typed message. We prepare phrase frequency
for visualization. Figure 6 shows some randomly picked phrases                 and min prefix of phrases as mentioned in Sec. 3.2.
from the clusters; phrases from the same cluster are represented with             Results of various model is shown in Table 2. The quantized NN
the same color. We observe that the various spelling variants and              model performs better in terms of # of characters to be typed. This
synonyms for a phrase are grouped in a single cluster. For example,            is expected because the NN model learns to use both inputs simulta-
the ‘good night’ cluster includes phrases like ‘good nighy’, ‘good             neously whereas the combiner (last function in Hybrid model) used
nyt’, ‘good nght’, ‘gud nyr’, ‘gud nite’. Our clustering algorithm is          is a simple weighted linear combination of just three features. The
able to capture fine-grained semantics of phrases. For instance, ‘i            hybrid model performs slightly better in terms of # of times inac-
love u’, ‘i love you’, ‘i luv u’ phrases belong to one cluster while           curate recommendations shown and fraction of messages retrieved.
‘love u too’, ‘love u 2’, ‘luvv uhh 2’ belong to a different cluster           Compared to quantized models, hybrid model needs slightly more
which helps us in showing accurate sticker recommendation for both             than one more character to be typed on an average to get required
clusters. If a user has messaged ‘i love u’ then showing stickers              predictions. It has to be noted that the quantized NN is of size ∼ 9.2
related to ‘i love u too’ cluster is make more relevant replies as             MB which is not practically feasible in our use case. The on-device
compared to showing ‘i love u’ stickers. Our message embedding is              foot print for hybrid model is below 1 MB. Another important point
able to cluster phrases like ‘i love u baby’ and ‘i luv u shona’ in a          to be noted is that a trie based model can guarantee retrieval of all
cluster distinct from the ‘i love u’ cluster. This helps us deliver more       message clusters, with some prefix of the message. If the message
precise sticker recommendations since we have different stickers in            is not interfered by another longer but more frequent message, the
our database for phrases like ‘i love u’ and ‘i love u baby’.                  message class can be featured in top position itself, with some pre-
                                                                               fix of the message as input or with full message as input. A pure
5.4       Message Cluster Prediction Evaluation                                NN based models can not make such guarantees. This is reflected
                                                                               in the third metric shown in Table 2–more than 2% of times, the
Our aim is to show the right stickers with minimal effort from user.
                                                                               intended message class is not retrievable by the NN. Given that this
So, we compare our hybrid model and quantized models mainly
                                                                               SR interface is one of the heavily used interface for sticker discovery,
on the following metrics. 1) Number of character that a user need
                                                                               if a sticker is not retrieved through this recommendation, the user
2 We   plan to release the labelled data and baseline results                  might perceive it as the app doesn’t support the message class or the
                                                                           7
Figure 6: HDBSCAN clusters of top frequent phrases using message embedding produced from our GRU-CharCNN model. The
points represents chat message vectors projected into 2D using TSNE [17]

app does not have such stickers. So making the retrieval of message         or adds new information. [1, 2] predict which of the top-20 emojis
cluster close to 100%% is critical for our SR models.                       are likely to be used in an Instagram post based on the post’s text
                                                                            and image. Unlike emojis that are majorly used in conjunction with
5.5    Real World Deployment                                                text, stickers are independent messages that substitute text. Thus, in
                                                                            order to have effective sticker suggestion for a given conversational
The proposed system has been deployed in production on Hike for
                                                                            context, we need to predict the likely utterance and not just the
more than 6 months. A/B tests to compare this system with a previous
                                                                            emotion. Since, the possible utterances are more numerous than
implementation showed an 8% improvement in the fraction of mes-
                                                                            emotions, our problem is more nuanced than emoji prediction.
saging users who use stickers. There are multiple regional languages
                                                                               There exists a large body of research on conversational response
spoken in India. We segregate our data based on geography and train
                                                                            generation. [22, 24] design an end-to-end model leveraging a hi-
separate models (both message embedding and message prediction
                                                                            erarchical RNN to encode the input utterances and another RNN
models) for each state. It ensures that frequent chat phrases distri-
                                                                            to decode possible responses. The approach suffers from a limita-
bution doesn’t get skewed towards one majority spoken language
                                                                            tion that it often generates generic uninformative responses (e.g., “i
across India. We obtain the primary language of a user from the
                                                                            don’t know”). [28] describe a model that explicitly optimizes for
language preferences explicitly set by them or infer it from their
                                                                            improving diversity of responses. [25] propose a retrieval based ap-
sticker usage.
                                                                            proach with the help of a DNN based ranker that combines multiple
                                                                            evidences around queries, contexts, candidate postings and replies.
6     RELATED WORK                                                          Smart Reply [12] is a system that suggests short replies to e-mails
To the best of our knowledge there is no prior art which studies the        which uses an approach to select the high quality responses with
problem of type-ahead sticker suggestions. However, there are a few         diversity to increase the options of user to choose from. Akin to our
closely related research threads that we describe here.                     system, SmartReply also generates clusters of responses. They apply
   The use of emojis is widespread in social media. [7] classify            semi-supervised learning to expand the set of responses starting from
whether an emoji, which is used in a tweet, emphasizes something            a small number of manually labeled responses for each semantic
                                                                        8
intent. In contrast, we follow an unsupervised approach to discover                         [4] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014.
message clusters since the set of all intents that may correspond to                            Empirical evaluation of gated recurrent neural networks on sequence modeling.
                                                                                                arXiv preprint arXiv:1412.3555 (2014).
stickers wasn’t readily available. A unique aspect of our system is                         [5] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine
that we update the message prediction by incorporating whatever                                 Bordes. 2017. Supervised learning of universal sentence representations from
                                                                                                natural language inference data. arXiv preprint arXiv:1705.02364 (2017).
the user has typed so far; we need to do this in order to deliver                           [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT:
type-ahead sticker suggestions.                                                                 Pre-training of Deep Bidirectional Transformers for Language Understanding.
   There is a parallel research thread around learning effective repre-                         arXiv preprint arXiv:1810.04805 (2018).
                                                                                            [7] Giulia Donato and Patrizia Paggio. 2017. Investigating redundancy in emoji
sentations for sentences that can capture sentence semantics [5, 6, 14–                         use: Study on a twitter based corpus. In Proceedings of the 8th Workshop on
16, 26]. Skip Thought [14] is an encoder-decoder model that learns                              Computational Approaches to Subjectivity, Sentiment and Social Media Analysis.
to generate the context sentences for a given sentence. It includes                             118–126.
                                                                                            [8] Philipp Gysel, Mohammad Motamedi, and Soheil Ghiasi. 2016. Hardware-
sequential generation of words of the target sentences, which limits                            oriented approximation of convolutional neural networks. arXiv preprint
the target vocabulary and increases training time. Quick Thought                                arXiv:1604.03168 (2016).
                                                                                            [9] Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-hsuan Sung, Laszlo
[16] circumvents this problem by replacing the generative objective                             Lukacs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Ef-
with a discriminative approximation, where the model attempts to                                ficient natural language response suggestion for smart reply. arXiv preprint
classify the embedding of a correct target sentence given a set of sen-                         arXiv:1705.00652 (2017).
                                                                                           [10] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun
tence candidates. Recently, BERT [6] has proposed a model which                                 Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets:
predicts the bidirectional context to learn sentence representation                             Efficient convolutional neural networks for mobile vision applications. arXiv
using tranformer [23]. [26] proposes the Input-Response model that                              preprint arXiv:1704.04861 (2017).
                                                                                           [11] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, An-
we have evaluated in this paper. However, unlike these works, we                                drew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization
apply a CharCNN [13, 27] in our encoder to deal with the problem                                and training of neural networks for efficient integer-arithmetic-only inference. In
                                                                                                Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
of learning similar representations for phrases that are orthographic                           2704–2713.
variants of each other.                                                                    [12] Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins,
                                                                                                Balint Miklos, Greg Corrado, Laszlo Lukacs, Marina Ganea, Peter Young, et al.
7    CONCLUSION                                                                                 2016. Smart reply: Automated response suggestion for email. In Proceedings of
                                                                                                the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
In this paper, we present a novel system for deriving contextual,                               Data Mining. ACM, 955–964.
type-ahead sticker recommendation within a chat application. We                            [13] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-
                                                                                                Aware Neural Language Models.. In AAAI. 2741–2749.
decompose the recommendation task into two steps. First, we pre-                           [14] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun,
dict the next message that a user is likely to send based on the last                           Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in
                                                                                                neural information processing systems. 3294–3302.
message received in the chat. As the user types the message, we                            [15] Quoc V Le and Tomas Mikolov. 2014. Distributed Representations of Sentences
continuously update our prediction by taking into account the char-                             and Documents.. In ICML, Vol. 14. 1188–1196.
acter sequence typed in the chat input box. Second, we substitute the                      [16] Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for
                                                                                                learning sentence representations. arXiv preprint arXiv:1803.02893 (2018).
predicted message with relevant stickers. We discuss how numerous                          [17] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.
orthographic variations for the same utterance exist in Hike’s chat                             Journal of machine learning research 9, Nov (2008), 2579–2605.
corpus, which mostly contains messages in a transliterated form. We                        [18] Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchical
                                                                                                density based clustering. The Journal of Open Source Software 2, 11 (2017), 205.
describe a clustering solution that can identify such variants with                        [19] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
the help of a highly effective message embedding, which learns                                  Distributed representations of words and phrases and their compositionality. In
                                                                                                Advances in neural information processing systems. 3111–3119.
similar representations for semantically similar messages. Message                         [20] Donald R Morrison. 1968. PATRICIA—practical algorithm to retrieve information
clustering reduces the complexity of the classifier used in message                             coded in alphanumeric. Journal of the ACM (JACM) 15, 4 (1968), 514–534.
prediction, e.g., by predicting one of 7500 classes (message clusters).                    [21] Sujith Ravi. 2017. Projectionnet: Learning efficient on-device deep networks
                                                                                                using neural projections. arXiv preprint arXiv:1708.00630 (2017).
We are able to cover all intents expressed in 34k frequent messages.                       [22] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and
Furthermore, it allows us to collect relevance feedback from related                            Joelle Pineau. 2016. Building End-To-End Dialogue Systems Using Generative
message contexts to improve our mapping of message to stickers.                                 Hierarchical Neural Network Models.. In AAAI, Vol. 16. 3776–3784.
                                                                                           [23] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
For message prediction on low-end mobile phones, we use a hybrid                                Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you
model that combines a neural network on the server and a memory                                 need. In Advances in neural information processing systems. 5998–6008.
efficient trie search on the device for effective sticker recommen-                        [24] Chen Xing, Wei Wu, Yu Wu, Ming Zhou, Yalou Huang, and Wei-Ying Ma. 2017.
                                                                                                Hierarchical recurrent attention network for response generation. arXiv preprint
dation. We show experimentally that the hybrid model is able to                                 arXiv:1701.07149 (2017).
predict a higher fraction of overall messages clusters compared to                         [25] Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to respond with deep neural
                                                                                                networks for retrieval-based human-computer conversation system. In Proceedings
a quantized neural network. This model also helps in better sticker                             of the 39th International ACM SIGIR conference on Research and Development
discovery for rare message clusters.                                                            in Information Retrieval. ACM, 55–64.
                                                                                           [26] Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong, Noah Constant, Petr Pilar,
                                                                                                Heming Ge, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Learning
REFERENCES                                                                                      Semantic Textual Similarity from Conversations. arXiv preprint arXiv:1804.07754
 [1] Francesco Barbieri, Miguel Ballesteros, Francesco Ronzano, and Horacio Saggion.            (2018).
     2018. Multimodal emoji prediction. arXiv preprint arXiv:1803.02392 (2018).            [27] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional
 [2] Francesco Barbieri, Miguel Ballesteros, and Horacio Saggion. 2017. Are Emojis              networks for text classification. In Advances in neural information processing
     Predictable? arXiv preprint arXiv:1702.07285 (2017).                                       systems. 649–657.
 [3] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St.          [28] Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett,
     John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-                  and Bill Dolan. 2018. Generating informative and diverse conversational responses
     Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder.              via adversarial information maximization. arXiv:1809.05972 (2018).
     CoRR abs/1803.11175 (2018). arXiv:1803.11175 http://arxiv.org/abs/1803.11175
                                                                                       9
You can also read