An Approach to Mining Social Networks in Chat Room

Page created by Seth Gibbs
 
CONTINUE READING
Journal of Computational Information Systems 7:1 (2011) 135-143
Available at http://www.Jofcis.com

                 An Approach to Mining Social Networks in Chat Room

                           Faliang HUANG1,2,†, Nanfeng XIAO1, XinGuo CHENG1, Ruliang XIAO2
                  1
                      School of Computer Sci. and Eng., South China University of Technology Guangzhou 510641, China
                                   2
                                     Faculty of Software, Fujian Normal University Fuzhou 350007, China

                                                               Abstract
       Mining social networks in a chat room is valuable since it makes it possible to discover essential relations among chatters
       in chat rooms and effectively monitor the chat rooms. In existing works, some focus on message content analysis, some put
       emphasis on the underlying thread structure in the chatter dialogs, but few works are reported on approaches to mining
       social networks in a chat room. In this paper, we propose a novel mining approach which discovers social networks by
       integrating dialog thread structure association with message content similarity. We improve traditional vector space model
       (VSM) with semantic similarity of terms, make some refinements on the old heuristics in PieSpy and give novel rules
       resulted from large amount of observation. We experimentally evaluate the proposed approach and demonstrate that our
       algorithm is promising and efficient.

       Keywords: Social Networks Mining; Message Content Similarity; Thread Structure Association

1. Introduction
The arising and development of computer-mediated communication (CMC) has rapidly turned the world
into a global village in recent years. Chat programs such as ICQ, MSN and mIRC can facilitate users freely
communicate with each other. Every coin has its two sides, as the old saying goes. On the one hand,
proliferation of IRC chatting offers many opportunities for people to interchange ideas and discuss
problems, on the other hand, IRC rooms characterized as public and virtual identity can be used as a forum
for discussions of dangerous activities, such as recruiting and training new terrorists, committing corporate
and homeland espionage [1] or disseminating pornography to commit juvenile sex crimes [2]. How to
effectively monitor the chat rooms is attracting much attention from academia, industries and governments.
   An immediate answer is to mine the chatting data logged in the web servers. Indeed, popularization of
various chatting tools has resulted in the accumulation of large amounts of data containing useful
information. Unlike traditional documents, chatting data flow in and out of a computer system continuously
and with varying update rates and language irregularity such as the worst spelling and grammar. It is the
above two features that make existing text mining techniques such as document representation, document
clustering and dimensionality reduction inappropriate for chatting data analysis.
   In order to achieve successful surveillance of chatting rooms, researchers focus on the task of
automaking discovery of social interactions and contextual topics in the relevant chatters, which can give
rise to a better yet computer-generated understanding of human relations and interactions, a process
otherwise involving a significant commitment of manual effort. Butterfly [3] samples chatting groups and
recommends interesting ones to users. Based on text classification, ChatTrack [4] creates a concept-based
profile that summarizes the topics discussed in a chat room or by an individual participant. Motivated by
the time-orderedness of chatting data, Mutton [5] develops a software bot (PieSpy) to infer and visualize

†
    Corresponding author.
    Email addresses: faliang.huang@gmail.com (Faliang HUANG)

1553-9105/ Copyright © 2011 Binary Information Press
January, 2011
136                F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143

social networks on internet relay chat. Based on the heuristics in PieSpy, authors in [6] give three modified
heuristics, i.e., explicit reference, immediate reaction and dialog. All these systems used in mining social
networks in chat rooms only consider either chatting content or thread structure in chatting data stream, but
none takes both aspects into consideration.
   In this paper, we propose a novel mining approach to detect social networks in chat rooms by integrating
thread structure association with message content similarity. On the message content analysis, we improve
traditional vector space model (VSM) with semantic similarity of terms and analyze message content with
the improved model, the improved VSM model can better capture the characteristics. For example, each
message has a very small number of terms, of chatting data. And on the thread structure association, we not
only make some refinements on the heuristics in PieSpy but also propose novel heuristics which can better
seize the inherent thread structure of message stream. On this base, the weights of the mined social network,
represented with graph matrix, are adaptively adjusted. Experimental results prove that our approach can
discover some social networks but PieSpy cannot discover, which can better reflect the essential relations
of chatters.
   The structure of this paper is organized as follows. Section 2 reviews the related studies in this field.
Simple statistical analyses of chatting data are done in section 3. Section 4 describes the proposed
algorithm in detail. In Section 5, we present the experimental results on a real dataset together with the
discussions of the results, and finally we summarize our work.

2. Related Works
Our work is closely related to topic detection and tracking (TDT) which is a longstanding problem. TDT
researchers proposed algorithms to detect topics hidden in data stream-like materials such as emails, blogs
and etc. Sun et al. [7] come up with an approach to detect a hot topic in mobile short messages by analyzing
statistical properties of message characters. BuzzTrack [8] creates the topic-based email groups with a
clustering algorithm which integrates thread similarity, people similarity text similarity and subject
similarity; Wang et al. [9] propose a message representation dynamics to combine the text content
information and linguistic feature in message stream, which better make full use of stream features. Authors
in [10] describe a method to detect topic words from blog documents by defining ‘topic words” as words
frequently used by people who share the same interests. However, our work is different from them in the
following two aspects: (1) the basic element in TDT is a story about a certain topic in news streams while
in our work studied objects are mainly short messages conveying certain information. In our problem, it is
difficult to extract the topic from one single message. However, TDT assumes that the content of each story
is rich enough to reflect a specific topic. (2) The temporal information in our work plays an important role
in discovering relations among chatters.
   Another related work is thread structure recovery, Wang et al. [11] first define thread structure recovery
task as follows: “thread structure recovery is the process whereby a parent message is explicitly linked to
one or more responding child messages”. The thread recovery task mainly contains two subtasks: 1)
constructing a connectivity matrix by leveraging a shallow message similarity measure between messages
in a chatting stream, and 2) determining parent-child relationships within the connectivity matrix.
Achievements in applying explicit thread structure to analyze social media are drawing more researchers
into this study. Adams and Martell [12] present three different strategies to establish parent-child
relationship between posts, i.e., hypernym augmentation, nickname augmentation and time-distance
penalization, Shen et al. [13] propose a single-pass clustering to detect thread in text message streams based
on linguistic features such as sentence type and personal pronouns and temporal information.
   The last noticeable work is social network analysis (SNA), which is the mapping and measuring of
relationships and flows between people, groups, organizations, computers, URLs, and other connected
information entities. The nodes in the network are the people and information entities while the links show
F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143                     137

relationships between the nodes. SNA provides both a visual and a mathematical analysis of node
relationships.

3. Statistical Analysis of Chatting Data
With an effort to better understand characteristics of chatting data, we respectively collect 10000 messages
in the following channels: students, computers, music, movie, #linux, #fedora. An example snippet from
students channel is shown in Figure 1. The first 4 channels are from ICQ, and the remaining channels are
from mIRC. We measure the datasets with Average Sentence Length(ASL) and Vocabulary Variety(VV),
formulated as follows:
                                                     # of tokens
                                             ASL =                                                         (1)
                                                   # of sentences
                                                   # of types
                                              VV =                                                         (2)
                                                   # of tokens
  From Table 1, we can see that although ASL of messages in different channels is somewhat different,
nearly all are less than 5 (only one exception), and all VVs are relatively small. These results indicate that
short sentences frequently occur in chatting.

                                        Fig.1 An Example Snippet of Chatting Data

                                          Table 1 ASL and VV of Chatting Data
                                Students     Computers     Music    Movie    #linux   #Fedora
                         ASL      4.3            4.5        4.2      4.6        5.2     4.9
                          VV      0.28          0.25        0.3      0.22    0.155     0.16

4. Social Network Construction
As argued in Section 1, most existing works in inferring social networks in chatting rooms consider either
only message content or message thread structure. Here we consider both aspects but not only one.

4.1. Preprocessing
To compute message similarity, we first follow the following steps to preprocess chatting data.
  (1) Reconstructing abbreviations. The pervasiveness of abbreviations such as cyber slang, acronyms and
shortened words enlightens us on reconstructing abbreviations, which operation we think will be beneficial
138                 F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143

to exhibit panorama of the messages. We manually construct a lookup table AbbrList implemented as map
data structure. We give an example table consists of internet slang and corresponding meaning (table 2).
   (2) Modifying stoplist. The observation that ASL of messages is rather small implies that it is
impracticable to eliminate “stopwords” with the stoplist widely used in text mining, because that will lead
to aggravate occurrence of “zero-valued” similarity [14]. Here we modify the traditional stoplist by
removing some words with specific part-of-speech such as verb, personal pronoun and etc.
   (3) Using Brill tagger [15] from the NLTK Lite toolkit to assign a part of speech to each word in a
message.
   (4) Constructing bag-of-words. This step is primarily responsible for selecting the verbs and nouns from
messages.

                              Table 2 An Example of Internet Slang and Corresponding Meaning

Slang     PLS          ASAP            F2F            ATST                  BBL        BTW           KIT            CYL          THX

                     As soon as       face to     At the same              Be back     By the       Keep in        See you
meaning   please                                                                                     touch                       thanks
                      possible         face           time                  later       way                         later

4.2. Content Based Similarity
Vector Space Model, a widely used data model for text classification and clustering, has some intrinsic
limitations such as frequent occurrence of “zero-valued” similarity, so we attempt to overcome the
limitations based on tolerant rough set [14]. In section 3, we have observed that short sentences account for
an overwhelming proportion of dialogs in chat room, which means that, in contrast with traditional text
clustering, dialogs(our clustering objects) with “zero-valued” similarity occurred more frequently, in
contrast with traditional text clustering, So here we abandon our previous rough set based strategy[14] and
adapt semantic similarity retrieval model (SSRM) originally used in information retrieval [16] to our
scenarios. Suppose message mi and mj, mi = {w1 , w2 ,", wn }, m j = {w1 , w2 ,", wm}, terms weight are
                                                             i     i         i              j   j          j

initialized based on information entropy, i.e. with formula 3, and then we have mi =< v1 , v2 ,", vn > and
                                                                                                               i     i       i

mj =< v1j , v2j ,", vmj > represented as weight vector.
                                      vi = entropy ( wi ) = − pi log pi                                                            (3)
  The weight of each term wi in message is adjusted based on its relationships with other semantically
similar terms j within the same vector, which can be formulated as below:
                                                         i≠ j
                                     vi = vi +          ∑v        j
                                                 wsim ( wi , w j ) ≥ T
                                                                       + wsim( wi , w j )                                          (4)

 Where wsim(wi, wj) denotes semantic similarity between term wi and term wj.
 Message vector is augmented by synonym, hyponyms and hypernyms, which can be consulted in
WordNet, and so we have
                        ⎧        i≠ j
                                               1
                        ⎪        ∑
                        ⎪ wsim ( wi , w j ) >T n
                                                 v j wsim( wi , w j ),
                                                                             wi is a new term
                   vi = ⎨        i≠ j                                                                                              (5)
                                               1                             w had weight vi
                        ⎪        ∑
                        ⎪⎩wsim ( wi , w j ) >T n
                                                 v j wsim( wi , w j ) + vi , i

  Finally we can compute content based similarity between message m1 and m2 as following:
F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143                        139

                             csim(m , m ) =
                                            ∑ ∑ v ⋅ v ⋅ wsim(w , w )
                                                   i
                                                          1
                                                        j i
                                                                  2
                                                                  j
                                                                                1
                                                                                i
                                                                                    2
                                                                                    j
                                                                                                             (6)
                                                ∑ ∑ v ⋅v
                                      1   2                             1   2
                                                              i       j i   j

4.3. Response Structure based Similarity
Some interesting researches demonstrate that utilization of response structure information contained in
discussions is beneficial to detect the underlying social networks in chat rooms, so here we use a few rather
simple heuristics to infer response structure based message similarity. We list the heuristics and the
corresponding scenarios.
   Explicit addressing, the scenario is that a message has its explicit receiver, that is to say, a chatter makes
it clear who he want to chat with.
   Linguistic feature respondence, the scenario is that a chatter send an interrogative message and another
chatter follows with a declarative message.
   Immediate reaction, the scenario is that a chatter sends a message after a longish silent period of time,
and within a certain short time span another chatter gives a message. Different from PieSpy, we take length
of the latter message into account and believe that the length is in inverse proportion to probability the first
chatter is the receiver of the latter message.
   Dialog density, the scenario is that two chatters discourse alternately and frequently in a short duration,
in other words, larger dialog density means the less likely other people interweaves.
   Let two messages be m1 and m2, we respectively assign weight α , β , γ , θ to the similarity between m1
and m2 in the above four scenarios. This procedure can be described as follows:
  Procedure RsimComputation
  Initialize rsim( m1 , m2 )
  if m1 and m2 satisfy explicit addressing scenario
   rsim(m1 , m2 ) = rsim(m1 , m2 ) + α
  Else if m1 and m2 satisfy immediate reaction scenario
   rsim(m1 , m2 ) = rsim(m1 , m2 ) + β
  Else if m1 and m2 satisfy dialog density scenario
   rsim(m1 , m2 ) = rsim(m1 , m2 ) + γ
  Else if m1 and m2 satisfy linguistic feature respondence scenario
   rsim(m1 , m2 ) = rsim(m1 , m2 ) + θ

4.4. Network Construction
In this section, we describe techniques to construct social networks based on content similarity and
response structure similarity of messages [5] [6]. In consideration of random entrance and exit of chatters in
a chat room and characteristics of message data stream, we develop a partially dynamic strategy to
construct the responding social networks.
   Slide window technique is introduced to split the dataset into a certain small datasets in partially
dynamic construction. Then in each small dataset we can create social networks as follows:
   Given chatters P = {Pi | 1 ≤ i ≤ m} in a chat room and a time-ordered sequence of messages
M = {mi | 1 ≤ i ≤ n} from the chatters, social networks is dynamically created as follows: a chatter
corresponds to a node in social networks, and an edge from chatter Pi to chatter Pj denotes relevance of
corresponding two chatters. The associated weight of the edge from P1 to P2 can be computed with formula
(7).
140                F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143

                 weight ( P1 , P2 ) = ∑ (λ ⋅ csim(mi1 , m 2j ) + (1 − λ ) ⋅ rsim(mi1 , m 2j ))           (7)
                                      i, j

Where   λ (0 ≤ λ ≤ 1)   is used to accomplish tradeoff between message content factor and message thread
structure factor.
   It is notable that slide window size has great impact on our constructed social networks, which is proved
in the next section.

5. Experiments
We have described our approach to construct social network based on message content similarity and
thread structure similarity in the previous sections. In this section, we empirically compare our approach
with thread structure driven construction approach (abbr. TS approach) and message content driven
construction approach (abbr. MS approach).

5.1. Dataset and Evaluation Methods
We collected messages from channel #Linux by running mIRC for two hours, the dataset contains of 1327
messages of 150 chatters. We use precision, recall, and F-measure to evaluate our results, which can be
formulated as below:
                                      | (real links) ∧ (dis cov ered links ) |
                         Pr ecision =                                                                    (8)
                                                 dis cov ered links
                                   | (real links) ∧ (dis cov ered links ) |
                         Re call =                                                                       (9)
                                                   real links
                                       2 * Pr ecision * Re call
                                 F=                                                                    (10)
                                         Pr ecision + Re call
Where real links denote links between chatters manually identified, discovered links denote links between
chatters discovered by software.

5.2. Experimental Results
In this section, we evaluate our proposed approach from three aspects as following:
   1) Performance
   We use F-measure to evaluate discovered social networks by comparing our approach with TS approach
and MS approach. We set λ=0.4, window size=20 min, so we have 6 small datasets. Table 3 shows the
comparison of quality of social networks discovered by three different approaches. From table 3 we can see
that, TS approach performs neck and neck with MS approach, but compared to TS approach and MS
approach, performance of our approach is great improved. Figure 2 and Figure 3 are two examples of social
networks constructed with chat data. Comparing social networks in the Figure 2 and Figure 3, we can find
that connection between chatter J8a and chatter grawity is removed but connection between chatter J8a and
chatter lilzeus is added in social networks discovered with our approach. We further check chatting data of
these chatters and find the difference between social networks discovered by our approach and PieSpy can
be explained as follows: ties between J8a and grawity is strong in the viewpoint of thread structure,
however, topic relevance between J8a’s talking and lilzeus’ talking is much stronger.
   2) Sensitiveness to balance factor parameter
   As a balance factor, parameterλplays an important role in our approach, we conduct a group of
experiments with different λ. From Section 4 it is not difficult to get such deduction that inadequate
balance factor can decrease the quality of the discovered social networks: on one hand, too small λ can
lead to message content information loss, on the other hand, too large λ can cause thread structure
information loss, both cases can lead to worse performance. From Figure 4 we can understand our
F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143                      141

experimental result corresponds to our deduction: whenλ=0.4, the performance is the best.

                                     Table 3 FScore of Discovered Social Networks
                       #Dataset        Our Approach         TS Approach             MS Approach
                          1                0.64                 0.54                    0.58
                          2                0.66                 0.56                    0.59
                          3                0.65                 0.6                     0.58
                          4                0.68                 0.58                    0.56
                          5                0.67                 0.58                    0.59
                          6                0.68                 0.62                    0.61
                         Avg              0.6633                0.58                   0.585

   3) Sensitiveness to slide window size
   We conduct a group of experiments with different slide window size. From Figure 5 we can see that
parameter slide window size has much influence on the performance. When window size increases from 5
to 15, the performance climbs rapidly, while from 15 to 30, the F-measure fluctuates from 0.65 to 0.67.

                                  Fig.2 Social Network Constructed with Our Approach

6. Conclusion
In this paper, we have proposed an approach to mining social networks in chat room based on the
consideration of chatting content features and chatting thread structure. Statistical analysis of chatting data
and the intrinsic limitations of VSM inspire us to introduce semantic similarity. We also improve PieSpy
heuristics and come up with novel heuristics. To evaluate our approach, we have conducted some
experiments. The experimental results proved that our approach can discover much more meaningful
underlying social networks than other two approaches, and so our approach is effective and promising.

Acknowledgement
This work was supported by the National Natural Science Foundation of China and Civil Aviation
Administration China (No.60776816), the Nature Science Foundation of Guangdong Province (No.
8251064101000005), the Foundation of Fujian Educational Committee (No.JA10076) and the Natural
Science Foundation of Fujian Province (No. 2009J01272).
142                  F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143

                                      Fig.3 Social Network Constructed with PieSpy

                                       Fig.4 Sensitiveness to Balance Factor Parameter

References
[1] Wolak, J., Mitchell, K., Finkelhor, D. Internet Sex Crimes Against Minors: The Response of Law Enforcement.
     National Center for Missing and Exploited Children, 2003.
[2] http://news.xinhuanet.com/english/2007-04/06/content_5940180.htm
[3] Van Dyke N W, Lieberman H, Maes P. Butterfly:A Conversation-Finding Agent for Internet Relay Chat. In
    Proceedings of the International Conference on Intelligent User Interfaces, pages 39-41, 1999.
[4] Bengel, J., Gauch, S., and et al.. Chattrack: chat room topic detection using classification. In Proceedings of the
    2nd Symposium on Intelligence and Security Informatics, pages 266-277, 2004.
[5] Mutton P. Inferring and visualizing social networks on Internet relay chat. In proceedings of the 8th International
    Conference on Information Visualization, pages 35-43, 2004.
[6] V. H. Tuulos and H. Tirri. Combining Topic Models and Social Networks for Chat Data Mining. In Proceedings
    of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence(WI-2004), pages 206-213, 2004.
[7] Sun Q., Wang Q. and Qiao H.. The Algorithm of Short Message Hot Topic Detection Based on Feature
    Association. Inform. Technol. J., 8:236-240, 2009.
F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143                            143

                                        Fig.5 Sensitiveness to Slide Window Size.

[8] Gabor Cselle, Keno Albrecht, Roger Wattenhofer. BuzzTrack: topic detection and tracking in email. In
     Proceedings of the 12th international conference on Intelligent user interfaces(IUI-2007), pages 190-197, 2006.
[9] Le Wang, Yan Jia, Yingwen Chen. Conversation extraction in dynamic text message stream. Journal of
     Computers. 3(10): 86-93, 2008.
[10] Yuichiro Sekiguchi, Harumi Kawashima, Hidenori Okuda, Masahiro Oku. Topic Detection from Blog Documents
     Using Users’ Interests. In Proceedings of the 7th International Conference on Mobile Data
     Management(MDM’06), pages 108, 2006.
[11] Wang, Y. C., Joshi, M., Cohen, W. W., Rosé, C. P. Recovering Implicit Thread Structure in Newsgroup Style
     Conversations. In Proceedings of the 2nd International Conference on Weblogs and Social Media (ICWSM II),
     2008.
[12] P. Adams and C. Martell. Topic Detection and Extraction in Chat. In Proceedings of 2008 IEEE International
     Conference on Semantic Computing, pages 581-588, 2008.
[13] D. Shen, Q. Yang,J. Sun, Z. Chen. Thread Detection in Dynamic Text Message Streams. In Proceedings of
     Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in
     Information Retrieval(SIGIR-2006), pages 35-42,2006.
[14] Faliang Huang, Shichao Zhang. Clustering Web Documents Based on Knowledge Granularity. In Proceedings of
     the 8th Asia Pacific Web Conference(APWeb 2006), pages 85-96, 2006
[15] Brill, E.. A simple rule-based part of speech tagger. In Proceedings of the Third Annual Conference on Applied
     Natural Language Processing, pages 152-155, 1992.
[16] Giannis Varelas, Epimenidis Voutsakis, Paraskevi Raftopoulou, Euripides G. M. Petrakis, Evangelos E. Milios.
     Semantic similarity methods in wordNet and their application to information retrieval on the web. In Proceedings
     of the 7th ACM International Workshop on Web Information and Data Management(WIDM 2005), pages 10-16,
     2005.
You can also read