Predicting the Semantic Orientation of Emoticons

Page created by Ruby Bell
 
CONTINUE READING
Predicting the Semantic Orientation of Emoticons
Journal of Computational Information Systems 9: 6 (2013) 2391–2398
Available at http://www.Jofcis.com

        Predicting the Semantic Orientation of Emoticons

             Fei WANG 1 , Yifeng XU 1 , Yunfang WU 1,∗, Xueqiang LV 2
            1 Institute   of Computational Linguistics, Peking University, Beijing 100871, China
2 Beijing   Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information
                       Science and Technology University, Beijing 100192, China

                                                   Abstract

Emoticons are widely used in blogs, instant messages and social networks, which often imply some
positive, negative or neutral semantic orientation, and so are widely exploited in sentiment analysis. In
this paper, we take a pictographic view on emoticons, and propose symbol-pair method and substring
method to automatically predict the semantic orientation of out of vocabulary emoticons. The experime-
ntal results are promising, achieving an accuracy of 78%.

Keywords: Emoticons; Sentiment Orientation; Symbol-pairs; Pair-based Method; Substring Method

1       Introduction
With the popularity of microblogging services like Twitter, microblog is now considered as the
intersection of blog, instant message and social network. It provides considerable data for business
and political organizations to track public opinions. Microblog also boosts the development and
popularity of network language, which contains many interesting linguistic phenomena. Among
them, emoticons form notable characteristics of microblog and have attracted the attention of
many researchers in the field of sentiment analysis.
   According to Wikipedia1 , there are five types of emoticons, including: (1) Western emoticons2
(ASCII strings like “:)smile ” “>: [f rown ”); (2) Eastern emoticons (pictographic Unicode strings like
“(T T)crying ”); (3) 2 channel emoticons (complicated multiline Unicode blocks); (4) Graph emoti-
cons (widely used in instant messages like              ); (5) Unicode emoticons (Unicode characters
like © §). Table [1] gives a detailed description.
  According to our investigation in Sina Microblogging3 which is a popular Chinese microblogging
platform, graph emoticons and eastern emoticons are most widely used. Graph emoticons are
    ∗
    Corresponding author.
    Email address: wuyf@pku.edu.cn (Yunfang WU).
  1
    http://en.wikipedia.org/wiki/List of emoticons
  2
    In this paper, we mark the meaning of an emoticon in the superscript. For example, :)smile tells that :) is an
emoticon and it means smile.
  3
    http://weibo.com/.

1553–9105 / Copyright © 2013 Binary Information Press
March 15, 2013
2392           F. Wang et al. /Journal of Computational Information Systems 9: 6 (2013) 2391–2398

                                    Table 1: Types of emoticons
         Emoticon Type         Examples                                                Characters
                                 smile    unhappy          f rown
       Western Emoticons       :)        :(           >: [                           ASCII strings
        Eastern Emoticons      (T T)crying orzkowtow                   cheer
                                                                                      Pictographic

                                              somebody wearing a tall hat
       2 Channel Emoticons                                                  ,   Stereotypical (multiline),
                                          do it
                                                                                  graphic,complicated
        Graph Emoticons                                                            Used in MSN, etc.
       Unicode Emoticons       ©§                                                  Unicode characters

provided by a popup window, from which you can choose one or more emoticon(s). This means
that the number of graph emoticons is fixed. On the contrary, eastern emoticons (like “(T T)”
and “        ” ) can be either provided by an input method or created by users, making it much
more irregular and complex. As a result, eastern emoticons are expanding rapidly, and there are
a lot of out of vocabulary emoticons that need to be assigned with proper polarity tags.
  However, there is little work devoting to predicting the semantic orientation (SO) of emoticons
themselves. On Chinese microblogging platforms, new emoticons are produced almost every day,
and consequently there are many emoticons that are out of vocabulary. Therefore, to predict the
SO of unknown emoticons is vital to sentiment analysis as well as social computing in microblog.
  In this paper, we will focus on predicting the SO of eastern emoticons as well as a few frequently
used western emoticons like “:-)” and “:-(”. We propose two methods to tackle this task: a
symbol-pair method and a substring method. Given 28 positive and 25 negative emoticons as
seed emoticons, our method achieves an accuracy of 78% , extracting 134 true positive and 158
true negative emoticons. To the best of our knowledge, this is the first work to automatically
predict the SO of emoticons.

2      Related Work
Emoticons are mainly used in two ways: acting as class labels and working as features.
   Recently, researchers use emoticons to label a tweet as positive or negative and thus avoid the
labor intense work of annotation. Pak et al. [1] [2] search Twitter using two types of emoticons,
i.e., happy emoticons like :) and sad emoticons like :(, to generate a positive and negative dataset.
Davidov et al. [3] proposes a new method to classify tweets into multiple sentiment categories by
using hashtags and smileys as clues. Yang et al. [4] uses blog posts containing certain emoticons
to form a readily available corpus for emotion classification. Read [5] uses emoticons as indicators
of emotions to automatically classify texts as positive or negative. By utilizing emoticons as
sentiment labels, the framework avoids the need for labor intensive manual annotation, allowing
identification and classification of sentiment/emotion types of texts.
 Besides acting as labels, emoticons have also proved to be important features in Twitter senti-
ment analysis [6]. Instead of using a raw word representation, Barbosa[6] creates a more abstract
F. Wang et al. /Journal of Computational Information Systems 9: 6 (2013) 2391–2398            2393

representation of tweets, utilizing Meta-features and Tweet Syntax Features. The latter one
includes retweet, hashtag, reply, link, if the tweet contains a link, punctuation, emoticons and
upper cases. What is more, emoticons are employed to predict retweet events and analyze user
behavior. Naveed et al.[7] finds that negative emoticons are supportive to “bad news travel fast”
and can be used in retweet prediction. Rao et al.[8] demonstrates that the presence of certain
kinds of emoticons can be strong indicators of female users when classifying latent user attributes.
Knowing more about the Sentiment Orientation of the emoticons may of greater help in works
above.

3      A Pictographic View on Emoticons
In this paper, we prefer looking at emoticons from a pictographic view. Table 1 explains the
pictographic character of emoticons, where color black denotes eyes; red denotes mouth; purple
denotes face contours; green denotes hands-arms; blue denotes other elements.

                           Table 2: The pictographic feature of emoticons
                      Types      Emoticons               Descriptions
                     positive                      smiling eyes, happy face
                                                   happy face, waving arms
                                                   happy face, pout, kissing
                     negative                     weeping eyes, sobbing nose
                                                           sweating
                                               Helplessness, shrugged shoulder

    From Table [2], we can find out the following characteristics of emoticons.

     • An emoticon consists of several symbols, which represent some facial parts or body move-
       ments. For instance,             consists of nose , eyes     , face contour     and arms
          .

     • The symbols are often occurring in parallel pairs. For instance, in           , pair
       represent two eyes, pair represent face contour, and pair     represent two-arms.

     • Given an emoticon, we could extract each pair from outside in. For example, given the
       emoticon             , we extract pair    , pair and pair . According to our observation,
       we define: the innermost pair as eyes-pair (like ); the symbol between two eyes as mouth-
       element (like ); the outmost pair as out-pair that may represent face contour, hands or
       others (like     ); the remaining elements outside the out-pair as odd-element. When there
       is only one pair in an emoticon, we regard it as eyes-pair instead of out-pair. For example,
       in emoticon       , the eyes-pair is , the mouth-element is , and the odd-element is .

     • The SO of emoticons can be computed by the polarities of component symbols. For instance,
       according to the waving arms     and the happy eyes , we can infer that              has
       positive meaning. Given the sweat , we could imply that       has negative meaning.
2394                F. Wang et al. /Journal of Computational Information Systems 9: 6 (2013) 2391–2398

4         Determining Candidate Emoticons
In order to predict the SO of emoticons, we should firstly determine whether a string constitutes
an emoticon or not. For instance, “∼∼!!!” is not an emoticon.
 In our experiments, the symbols that ever appear in the seed emoticons are called seed-symbols.
We define candidate emoticons as follows:
    (i) emoticon → P (S|P )∗ P
    (ii) P → seed-symbols
    (iii) S → white-space
  What is more, an emoticon should obey the following principles. It should not (1) begin
with ?,!,or ; (2) be longer than 15 Unicode characters; (3) contain too much successive spaces;
(4) contain too many continuously appearance of the same character. Rule (4) is used to filter
noises like “!!!!!!!!”, “∼∼∼!”, “∼ ∼∼”, which are frequently appeared in microblogs but barely
emoticons.

5         Predicting the SO of Emoticons
Following the observation of Section 3 and after the processing of Section 4, we will predict the
SO of emoticons using the following methods.

5.1         Symbol-pair method
The symbol-pair method is designed following our innovation to capture the pair-significant pic-
tographic feature of emoticons. It consists of two steps. Step 1 finds out the SO of seed symbols,
and Step 2 computes the SO of emoticons according to the component seed symbols.

        • Step 1. We build a seed emoticon set from Sougou pinyin input method4 , which pro-
          vides 108 emoticons at the time of our experiment, including 102 eastern emoticons and 6
          western emoticons. We delete the spaces and Chinese characters from each emoticon. For
          example, both “         ” and “        ” are transformed to “       ”. We remove duplicate
          emoticons after the above transformation, and are left with 80 emoticons. We manually
          label each emoticon as positive, negative or neutral. Then we automatically identify the
          mouth-element, eyes-pair, out-pair and odd-element from each emoticon. If one pair or ele-
          ment exclusively appears in emoticons of only one SO category, we assign the corresponding
          SO to that pair or element; otherwise we assign neutral to it. For example, eyes-pair
          exclusively appears in positive emoticons, so we assign positive to pair .
        • Step 2. For each candidate emoticon, we extract all the symbol pairs from outside in;
          identify odd-element, out-pair, eyes-pair and mouth-element, respectively. Then, we com-
          pute the SO of each emoticon according to the identified pairs and elements, using a voting
          method. It returns neutral when positive and negative get the same vote. If there is no
          symbol pair, it returns false, indicating that the candidate emoticon cannot be judged as a
          real emoticon.
    4
        A popular input method in China, can be downloaded from http://pinyin.sogou.com/
F. Wang et al. /Journal of Computational Information Systems 9: 6 (2013) 2391–2398                     2395

  Following the intuition that symbols representing eyes and mouth play a more important role in
determining the SO of emoticons, we use a weighted vote strategy. Let out-pair and odd-element
get weight 1, eyes-pair will get weight 5 and mouth-element will get weight 3.
  When dealing with odd-element, we employ partial match instead of exact match, for the
concerns of instances like ‘- -|||’,‘- -|’,‘- -||||||||||’ where all the odd-elements follow the pattern ‘|+ ’
that symbolize sweat. When a partial match occurs, it gets 0.5 times the weight of an exact
match to reduce the impact of noise from the partial match.

5.2     Substring method
The substring method is designed to capture a few frequently used western emoticons like :) and
                                                                        wink
a few emoticons where pairs are not obvious. For example, emoticon            cannot be captured
by the above symbol-pair method, because the symbol representing eyes          is not a pair. But
it can be captured by the substring method, finding it quite similar to      . This method also
contains two steps.

   • Step 1. Compute the similarity between candidate emoticons and each seed emoticon, by
     calculating the longest common subsequence (LCS). The similarity score is defined as:
                             Len(LCS)
      similar score =    Len(seed emotion)

   • Step 2. The SO of the seed emoticon which gets the highest similarity score is then assigned
     to the candidate emoticon. The similarity score should be larger than a predefined threshold
     (set to 0.6 in our experiment), otherwise it would return false. If more than one seed
     emoticons get the highest similarity score, and these have different SO, then the method
     returns false.

5.3     Combining method
The symbol-pair and substring method have some complementary properties, so we can combine
them to get better performance. Assuming SO1 and SO2 are SO predicted by two methods, we
use the following matrix to decide the final SO.

                                         Table 3: Decision Matrix
                                                           SO2
                                             Positive   Negative   Neutral      False
                             Positive        Positive   Neutral    Positive    Positive
                            Negative         Neutral    Negative   Negative   Negative
                     SO1
                             Neutral         Positive   Negative   Neutral     Neutral
                               False         Positive   Negative   Neutral      False
2396           F. Wang et al. /Journal of Computational Information Systems 9: 6 (2013) 2391–2398

6      Experimental Results

6.1    Data
We downloaded 133,065 microblogs using Sina API. We extracted 3,225 strings following (i)–
(iii), with 1,764 of them filtered by rules (1)–(4) described in Section 4. We applied the three
methods on the remaining 1,461 candidate emoticons and 1,075 of them were finally judged as
real emoticons. Two people separately labeled these 1,075 emoticons with positive, negative and
neutral, and the kappa value of the inter-annotator agreement is κ = 0.743. The third person
labeled the emoticons with disagreement. Finally, three people voted for the final judgment.
When the SO given by three people differed from one another, we assigned neutral to the emoticon.
Figure[1] demonstrates the distribution of positive/negative/neutral emoticons in our labeled
data.

                   Fig. 1: The distribution of SO types in our labeled emoticons

6.2    Results
Figure [2] illustrates the number of positive/negative/neutral emoticons extracted by these three
methods. Table [4] lists the contingency table constituted by the number of extracted emoticons
(Ext) against the true answers (Ans), as well as the accuracy of each method. We would like
to mention that among the 162/283/630 positive/negative/neutral emoticons extracted by the
combining method, 51/51/92 of them can be extracted by both methods.

                            Table 4: The performance of different methods
       aa              symbol-pair method            substring method        combining method
         a   Ext
       Ans aaa
               a     Pos.    Neg.     Neutral   Pos.    Neg.     Neutral   Pos.   Neg.    Neutral
            Pos.     106       5        37      79        1        5       134     5        21
            Neg.      2       130       85       0       102       48       2     158       65
         Neutral      15      107      558      15       40        61      26     120      544
        Accuracy               0.76                       0.69                     0.78
F. Wang et al. /Journal of Computational Information Systems 9: 6 (2013) 2391–2398         2397

                  Fig. 2: The number of emoticons extracted by different methods

  The symbol-pair method extracts more emoticons and obtains higher accuracy than the sub-
string method, and the combining method gets better performance than each isolated approach,
achieving an accuracy of 78%. The symbol-pair method performs better on positive emoticons
with a precision of 86.2%, while the substring-based method performs better on negative emoti-
cons with a precision of 71.3%.
  The experimental results are generally promising. There are 28/25 positive/negative emoticons
in our seed set, and the numbers expand to 134/158 after our work. Among all the extracted
emoticons, only 41 emoticons exactly match the ones in the seed, and we correctly predict 251 new
sentiment bearing emoticons. Table [5] gives some examples of new sentiment bearing emoticons
extracted by our method.

                     Table 5: Some examples extracted by combining method
                            Positive Examples       Negative Examples
                           (∧ ∼)v       ◦∩ ∩◦       ORZZZZZ       = =b
                          Y(∧ ∧)Y      (∧ ◦ ∧)/ ∼     T T · ··    - -...
                         ◦(∧ 5 ∧)◦         ;-)         >
2398            F. Wang et al. /Journal of Computational Information Systems 9: 6 (2013) 2391–2398

wrongly computed as neutral. Both “     ” and “ ” are odd-elements that havent been seen in
the seed set. But these two odd-elements are sentiment bearing, implying “I doubt” or “what
the heck” emotion.

7      Conclusions
This paper attempts to predict the SO of out of vocabulary emoticons. Capturing the pair-
significant pictographic feature of emoticons, we propose two new methods, symbol-pair method
and substring method, to compute the SO of emoticons automatically. The experimental results
are general promising, 251 new sentiment bearing emoticons are correctly predicted.
  The sentiment of emoticons and their context are expected to be consistent. In future work,
we would like to employ the sentiment of detected emoticons to promote sentiment analysis of its
containing text, and meanwhile, correct the SO of emoticons within context.

Acknowledgement
This work is supported by 2009 Chiang Ching-kuo Foundation for International Scholarly Ex-
change (under Grant No. RG013-D-09), National High Technology Research and Development
Program of China (863 Program) (No. 2012AA011101) and the Opening Project of Beijing Key
Laboratory of Internet Culture and Digital Dissemination Research(No. ICDD201202).

References
 [1]   Alexander Pak and Patrick Paroubek, Twitter as a Corpus for Sentiment Analysis and Opinion
       Mining. In Proceedings of LREC, 2010.
 [2]   Alexander Pak and Patrick Paroubek, Twitter Based System : Using Twitter for Disambiguating
       Sentiment Ambiguous Adjectives, In Proceedings of the 5th International Workshop on Sentiment
       Evaluation, ACL, 2010, pp. 436 – 439.
 [3]   Dmitry Davidov, Oren Tsur and Ari Rappoport,Enhanced Sentiment Learning Using Twitter
       Hashtags and Smileys, In Proceedings of COLING, 2010, pp. 241 – 249.
 [4]   Changhua Yang, Kevin Hsin-Yih Lin, Hsin-His Chen, Building Emotion Lexicon from Weblog
       Corpora, In Proceedings of ACL, 2007, pp. 133 – 136.
 [5]   Jonathon Read, Using Emoticons to Reduce Dependency in Machine Learning Techniques for
       Sentiment Classification, In Proceedings of the ACL, 2005 Student Research Workshop, pp. 43 –
       48.
 [6]   Luciano Barbosa and Julan Feng, Robust Sentiment Detection on Twitter from Biased and Noisy
       Data, In Proceedings of COLING, 2010, pp. 36 – 44.
 [7]   Nasir Naveed, Thomas Gottron, Jérôme Kunegis, Arifah Che Alhadi, Bad News Travel Fast: A
       Content-based Analysis of Interestingness on Twitter, In Proceedings of the ACM WebSci, 2011,
       pp. 1 – 7.
 [8]   Delip Rao, David Yarowsky, Abhishek Shreevats, Manaswi Gupta, Classifying Latent User At-
       tributes in Twitter, In Proceedings of SMUC, 2010.
You can also read