WHATSANALYZER: A TOOL FOR COLLECTING AND ANALYZING WHATSAPP MOBILE MESSAGING COMMUNICATION DATA - ITC-CONFERENCE.ORG

Page created by Gilbert Mann
 
CONTINUE READING
WHATSANALYZER: A TOOL FOR COLLECTING AND ANALYZING WHATSAPP MOBILE MESSAGING COMMUNICATION DATA - ITC-CONFERENCE.ORG
2018 30th International Teletraffic Congress

                  WhatsAnalyzer: a Tool for Collecting and
                  Analyzing WhatsApp Mobile Messaging
                           Communication Data
                                                       Anika Schwind, Michael Seufert*
                                            Insitute of Computer Science, University of Würzburg
                                                            Würzburg, Germany
                                   anika.schwind@informatik.uni-wuerzburg.de, michael.seufert.fl@ait.ac.at

     Abstract—WhatsAnalyzer is a web-based service, which col-                    this traffic and provide a proper management of the cellular
  lects and analyzes chat histories of the mobile messaging ap-                   resources, it is necessary to understand how MMAs are used.
  plication WhatsApp. Thereby, it leverages the e-mail export
  feature of WhatsApp to obtain the chat histories, which cannot                  Nowadays, MMAs are not purely text-based anymore, but
  be accessed otherwise due to encrypted storage on the mobile                    several MMAs allow the transmission of (media) files, such
  device and end-to-end encrypted transmission over the Internet.                 as images or videos, and some even feature voice calls or
  Thus, the major asset of the service is that real communication                 videoconferencing. Additionally, most apps are not limited to
  data can be collected without the bias introduced by observing or               one-to-one communication, but the creation of chat groups
  surveying participants. The collected communication data can be
  analyzed and provides valuable insights into the communication                  allows many-to-many communication. In contrast to regular
  in WhatsApp and the resulting network traffic. To incentivize                    chatting, a post in a group has to be transmitted to multiple
  users to send chat histories, the privacy of users is respected by              recipients, and thus, multiplies the traffic on the network.
  anonymizing all communication data. Moreover, some analyses                     While compression of media content is the default procedure
  of each chat history can be accessed on a web page by the sender                how MMAs cope with huge amounts of data, the users’
  of the chats.
     Index Terms—Communication model; WhatsApp; Mobile mes-                       demand for high quality content and the multiplication of
  saging application; Mobile instant messaging; Mobile networks.                  recipients might require additional network traffic management
                                                                                  to cope with the traffic load.
                                                                                     Thus, it might not be obvious yet, which Internet tech-
             I. I NTRODUCTION AND R ELATED W ORK                                  nology will be employed to cope with the new challenges
     Mobile messaging applications (MMAs) offer real-time                         and demands of (group-based) messaging communication on
  message transmission over the Internet. These apps are a free                   the Internet. Still, it is the user behavior that dictates the
  or low-cost alternative to operator-based messaging via SMS                     path of technology through service acceptance, adoption, and
  or MMS, and thus, show a growing popularity. In 2017, 1.82                      usage. Therefore, it is important to analyze the way people
  billion people used MMAs at least once a month, and an                          communicate with each other using MMAs in order to develop
  increase to 2.48 billion in 2021 is expected [1]. Thereby,                      effective traffic management algorithms to efficient deliver the
  WhatsApp and Facebook Messenger are the most popular                            generated data.
  apps with 1.2 billion monthly active users in 2017. They                           There is some literature on how people communicate with
  are followed by WeChat (938 million) and QQ Mobile (678                         each other and how this communication has been changed in
  million). Other popular MMAs are Skype, Snapchat, Kik,                          the last decade due to mobile messaging applications. Here,
  Viber, Line, BlackBerry Messenger, Telegram, and KakaoTalk,                     social aspects as well as technical aspects regarding WhatsApp
  which are reported to have 300 million and less monthly                         and other applications were investigated [4]–[6]. Only very
  active users [1], [2]. In [3], it was predicted that messaging                  few papers deal with the abstract modeling of the communica-
  traffic will reach up to 100 trillion MMA messages in 2019,                      tion within MMAs [7]–[9]. However, a comprehensive analysis
  which is 62.5% of global message traffic including MMAs,                         and modeling of communication in MMAs is still missing.
  SMS, MMS, e-mail, rich communications suite, and social                            For this reason, this paper presents the web-based service
  messaging. Thereby, the revenue generated from each MMA                         WhatsAnalyzer, which can receive WhatsApp chat histories
  message is forecast to be less than 1% of that from SMS and                     by e-mail and analyze the communication within the chat.
  MMS.                                                                            Thereby, it leverages the e-mail export feature of WhatsApp to
     These statistics show that the network traffic created by                     obtain a text-based version of the chat histories. As chat histo-
  ubiquitous communication through MMAs increases and puts                        ries are stored in an encrypted database on the mobile device
  a lot of load on mobile networks. To efficiently handle                          and messages are transmitted over the Internet with end-to-
                                                                                  end encryption, this is currently the only option to access the
    * Michael Seufert is now at AIT Austrian Institute of Technology GmbH,        chat data. Thus, the major asset of the WhatsAnalyzer service
  Vienna, Austria                                                                 is that real communication data can be collected without

978-0-9883045-5-0/18/$31.00 ©2018 ITC                                        85
DOI 10.1109/ITC30.2018.00020
WHATSANALYZER: A TOOL FOR COLLECTING AND ANALYZING WHATSAPP MOBILE MESSAGING COMMUNICATION DATA - ITC-CONFERENCE.ORG
or group chat without any other media files to the specific
                       ϭ                                                  e-mail address of WhatsAnalyzer. In the following, this user
                                                                          will be referred to as chat owner. As soon as an e-mail arrives
            ϯ                               Ϯ                             at WhatsAnalyzer’s inbox, it is noticed by the mail handling
                                  ϰ                                       module and the e-mail is being processed. As a first step, the
                                                                          module checks if the e-mail contains a valid WhatsApp chat
                                                                          history. In this case, the file is parsed, anonymized, evaluated,
                 ϰ                              ϰ                         and stored as described below. Afterwards, the received e-
                                                                          mail is deleted automatically. The mail system also handles
                                                                          outbound e-mails to users and the administrator. In the main
                                                                          use case, WhatsAnalyzer replies to chat owners after their chat
                                                                          was analyzed. They will receive an e-mail containing a link to
                                                                          the web page on which the evaluation of their chat is displayed.
       Fig. 1: Processing procedure of WhatsAnalyzer                      In addition, this e-mail contains an assignment of the chat
                                                                          members’ real names to their anonymized names.
the bias introduced by observing or surveying participants.
When collecting the chat histories, the privacy of the users is           B. Anonymization
respected, such that only timestamps, anonymized user names,
message types, and message lengths are extracted from the                    While bringing the chat history into a standardized for-
chat history. These communication data can be analyzed to                 mat, WhatsAnalyzer anonymises the chat and analyzes it
understand the communication in WhatsApp and the resulting                afterwards. In this context, it should be noted that after the
network traffic, both in terms of frequency and volume. To                 anonymization, the original chat history is deleted and only
give an incentive to use the service, some analyses of each               the anonymized version of it is used for further investigations.
chat history can be accessed on a web page with an individual                Figure 2 shows an original chat history as it can be sent from
link, which is given to the sender of the chat history by e-mail          WhatsApp and its anonymized version. The anonymization
together with a mapping of real and anonymized user names.                step not only protects users’ private data (real names of
Thereby, also the senders of the chat history can get interesting         communication partners and content of messages), but also
insights into their own communication.                                    transforms the chat histories into a standardized format. This is
                                                                          necessary because WhatsApp chat histories occur in a variety
                     II. W HATS A NALYZER                                 of formats (e.g., date, separator, system messages) depending
   This section describes the WhatsAnalyzer application. Fig-             on operating system, system language, and WhatsApp version,
ure 1 illustrates the general procedure. First, the user sends his        which makes parsing the histories challenging. Each line of a
or her WhatsApp chat history via e-mail to WhatsAnalyzer (1).             chat history represents a message of a user. Despite the various
The incoming mail is noticed by WhatsAnalyzer, which then                 formats, each post starts with a timestamp followed by the
automatically starts the processing procedure. The chat history           name of the author and the content of the message.
is anonymized and statistically evaluated, and the anonymized                The format of the timestamp varies considerably depending
chat and the statistics are stored in a database (2). Afterwards,         on the system language. To cope with this, the timestamps are
WhatsAnalyzer sends an e-mail to the user (3) including a                 parsed and normalized for post-processing in the following
link to his or her analysis visualized on a webpage (4). In the           format: dd.mm.yyyy, HH:MM.
following, the individual components of WhatsAnalyzer are                    The second part of a line contains the name of the author of
described in detail.                                                      the post as stored in the contact list of the device. Each of these
                                                                          names is replaced by a unique user ID in order to be able to
A. Mail Handling
                                                                          keep track of the individual behavior of different users. A list
   For mail handling, WhatsAnalyzer uses the mail server of               of the original names of the users and their IDs is temporarily
the University of Würzburg. The communication with the                   stored to be sent to the chat owner in the response e-mail
server is done via standard protocols: The incoming e-mails               and will be deleted afterwards. With this list, the chat owner,
are fetched from the server via the Internet Message Access               i.e., the user who sent the chat history, is able to identify all
Protocol (IMAP). To send e-mails, the Simple Mail Transfer                participants, while nobody else is.
Protocol (SMTP) is used.                                                     The last part of a line contains the actual message that has
   To trigger an analysis of a WhatsApp chat, the user has to             been sent. This can be some text written by a participant, a
utilize a WhatsApp feature, which allows to send a copy of the            placeholder indicating that a media file was sent, or a system
chat history by e-mail. Internally, WhatsApp then generates a             message (e.g., informing the participants that somebody has
text document with the chat history, which is attached to an              changed his or her phone number or changed the chats’ name).
e-mail and can be sent via the device’s e-mail application.               To protect the users’ privacy, the content of text messages is
The user has to send this text document of an individual chat             discarded but only the number of written characters is saved.

                                                                     86
WHATSANALYZER: A TOOL FOR COLLECTING AND ANALYZING WHATSAPP MOBILE MESSAGING COMMUNICATION DATA - ITC-CONFERENCE.ORG
02/08/17 ,   17:29   −   Michael c r e a t e d group ” Test ”        02.08.2017 ,     17:29   −   User1 : c r e a t e d group
02/08/17 ,   17:30   −   M i c h a e l a d d e d you                 02.08.2017 ,     17:30   −   You : were a d d e d
02/08/17 ,   17:32   −   Marco : Hi Anika                            02.08.2017 ,     17:32   −   User2 : 8 c h a r s
02/08/17 ,   17:57   −   +49 1234 1 2 3 4 5 6 7 8 : Hey              02.08.2017 ,     17:57   −   User3 : 3 c h a r s
02/08/17 ,   18:43   −   Anika : H e l l o e v e r y b o d y         02.08.2017 ,     18:43   −   U s e r 4 : 15 c h a r s
03/08/17 ,   08:29   −   M i c h a e l :       03.08.2017 ,     08:29   −   U s e r 1 : 

                                  Fig. 2: Original WhatsApp chat history and its anonymized version

  After the anonymization, the original chat history is deleted.
For all further analysis, only the anonymized version of the
chat is used.

C. Standard Evaluation

   In the next step, the anonymized chat history is statistically
evaluated and stored in a database. First, a unique ID for the
respective chat is generated. This ID is provided to the chat
owner via the response e-mail so that he or she can access the
associated visualization. Next, the anonymized chat history is
analyzed and different types of statistics are produced:
   Temporal Characteristics: In this analysis, the date of the
first post and the last post of the chat is saved. Note that this
timespan does not necessarily cover the complete conversation
of the chat. Parts of the conversation can be lost when the
chat owner changed or reset his or her device, or when the                        Fig. 3: Screenshot of the evaluation website
chat owner was added or removed from the chat. In addition,
it is investigated how much posts were sent in specific time
intervals, i.e., the number of messages per day, per weekday,            D. Visualization
and per daytime.                                                            The last step in the WhatsAnalyzer process is the visualiza-
   Chat Characteristics: For the whole chat conversation,                tion of the chat analysis. This web page only shows statistics
WhatsAnalyzer counts the number of members, the number of                using the anonymized user IDs. However, the chat owner is
sent messages, and the number of sent media files like photos             able to identify all participants by using the list he or she also
or videos. Moreover, every post is analyzed with respect                 received in the response e-mail.
to its length. In particular, the number of characters in the               Figure 3 shows a screenshot of an exemplary evaluation
shortest and in the longest text message are counted and the             webpage. In the upper left part, a listing of important statistics
corresponding users are identified.                                       can bee seen. Here, the date of the first and the last post of
   User Characteristics: For each user, the number of sent               the chat, the number of members of the chat, and the number
posts is calculated. Thereby, text posts and media posts are             of sent text and media messages is shown. Additionally, the
differentiated. A communication matrix is generated, which               member who wrote the longest and the member who wrote the
counts how often each user answers to any other user. In                 shortest message are displayed. On the right of the screenshot,
this context, each message is considered to be an answer                 a pie chart shows the percentage of messages each member
to the previous message. WhatsAnalyzer also determines the               posted during the conversation. In the lower part, the left
frequency of starting a new session per user. A session is               pie chart shows how often a user answered to every other
defined via a fixed pause threshold t, i.e., a session is a                member, while the right chart indicates the number of sent
sequence of posts, such that any two consecutive posts have              media messages per person as bar plot. Note that the color
not been sent more than t minutes apart. This analysis is                assigned to each user is the same for every chart on the web
repeated with t set to 30, 60, and 1 440 (24 hours). For every           page, so that the identification of particular chat members is
participant in the chat, it is counted how often he had the final         simplified. The whole exemplary evaluation webpage can be
say. A message was counted as final say if it was the last                found at https://goo.gl/hvV8eF.
message before a discussion break, which is at least one hour.
   After the statistical analyses, the data are stored in a                                   III. D EMOSTRATION
database for later visualization and further evaluations. Once             The demonstration shows the capabilities of WhatsAnalyzer.
the data are stored, the response e-mail containing the link to          Here, the procedure presented in Figure 1 is shown. For the
the visualized statistics of the chat is sent as described above.        demonstration, WhatsAnalyzer will run on a local server and

                                                                    87
WHATSANALYZER: A TOOL FOR COLLECTING AND ANALYZING WHATSAPP MOBILE MESSAGING COMMUNICATION DATA - ITC-CONFERENCE.ORG
(a) Exporting chat                 (b) Sending chat                       (c) Getting the analysis

                      Fig. 4: Three steps of how to get a WhatsApp chat analyzed using WhatsAnalyzer

give insights into each step of the procedure. The users can            tion. If the users give their approval, an evaluation of the plain
test the tool by sending either a personal WhatsApp chat from           text of the chat can be done. Thus, linguistic analyses could
their smartphone or by sending an exemplary chat from a given           be carried out to get more insights into the communication
demo smartphone.                                                        within WhatsApp. For example, it could be evaluated how the
   Figure 4 shows how WhatsAnalyzer can be used in this                 used language evolves during chatting in mobile messaging
demonstration. First, as can bee seen in the left and the middle        applications. Also other user and group characteristics could
part of the Figure, the user has to select a WhatsApp chat              be extracted, such as the ratio of emoticons usage or the mood
on the smartphone and send it to WhatsAnalyzer via e-mail.              of conversations.
Therefore, the user has to open a chat and click the menu
                                                                                                      R EFERENCES
button at the top right. Then he has to select ’More’, choose
’E-mail chat’ and click ’Without media’. Next, an e-mail app            [1] C. Boyle, “Messaging App Usage Worldwide: eMarketer’s Updated
                                                                            Forecast, Leaderboard and Behavioral Analysis,” eMarketer, Tech. Rep.,
is opened and the e-mail and the attached chat history has                  2017. [Online]. Available: http://www.emarketer.com/Chart/Mobile-
to be send to whatsanalyzer@uni-wuerzburg.de. As                            Phone-Messaging-App-Users-Worldwide-2016-2021-billions-
soon as the e-mail arrives at WhatsAnalyzer’s mail server, the              change/209369,           http://www.emarketer.com/Chart/Users-of-Select-
                                                                            Mobile-Messaging-Apps-Worldwide-2016-2017-millions/209534
chat is automatically anonymized and evaluated. Afterwards, it          [2] Statista, “Most popular mobile messaging apps worldwide as of January
replies to the user via e-mail, sending a link and an assignment            2017, based on number of monthly active users (in millions),” 2017.
of the anonymized names to the real names. The link leads                   [Online]. Available: https://www.statista.com/statistics/258749/most-
                                                                            popular-global-mobile-messenger-apps/
to a web page that shows several statistics of the sent chat as         [3] L. Foye, “Mobile & Online Messaging: SMS, RCS & IM
can bee seen in the right part of the Figure.                               Markets 2015-2019,” Juniper Research, Tech. Rep., 2015.
                                                                            [Online].     Available:      https://www.juniperresearch.com/press/press-
               IV. C ONCLUSION AND O UTLOOK                                 releases/messaging-revenues-down-600m-traffic-up-200pc,
   Due to increasingly popular mobile messaging applica-                    https://www.juniperresearch.com/press/press-releases/traffic-from-
                                                                            messaging-reach-438bn-per-day-by-2019
tions, the way people communicate has been evolved in                   [4] L. Piwek and A. Joinson, “”What do they Snapchat about?” Patterns of
the last years. To analyze this development, we presented                   Use in Time-limited Instant Messaging Service”,” Computers in Human
WhatsAnalyzer, a web-based tool to collect and analyze chat                 Behavior, vol. 54, pp. 358–367, 2016.
                                                                        [5] P. Fiadino, M. Schiavone, and P. Casas, “Vivisecting WhatsApp through
histories of the mobile messaging application WhatsApp. Chat                Large-scale Measurements in Mobile Networks,” in ACM SIGCOMM
histories can be extracted in WhatsApp and sent by e-mail to                Computer Communication Review, vol. 44, no. 4. ACM, 2014, pp.
WhatsAnalyzer, which is currently the only option to access                 133–134.
                                                                        [6] K. Church and R. de Oliveira, “What’s up with WhatsApp? Comparing
chat data. Users are encouraged to send their chat histories                Mobile Instant Messaging Behaviors with Traditional SMS,” in Proceed-
by emphasizing the protection of the users’ privacy. This                   ings of the 15th International Conference on Human-computer Interaction
means, all analyzed and stored data is completely anonymized,               with Mobile Devices and Services (MOBILE HCI), Munich, Germany,
                                                                            2013.
keeping only timestamps, anonymized user names, message                 [7] A. Rosenfeld, S. Sina, D. Sarne, O. Avidov, and S. Kraus, “A study of
types, and message lengths. These communication data suffice                 whatsapp usage patterns and prediction models without message content,”
to analyze the properties of WhatsApp chats, the users, and                 arXiv preprint arXiv:1802.03393, 2018.
                                                                        [8] M. Seufert, T. Hoßfeld, A. Schwind, V. Burger, and P. Tran-Gia, “Group-
the communication within. Moreover, some data are visualized                based communication in whatsapp,” in IFIP Networking Conference (IFIP
and presented to the chat senders to show them some basic                   Networking) and Workshops, 2016. IEEE, 2016, pp. 536–541.
insights into their communication.                                      [9] M. Seufert, A. Schwind, T. Hoßfeld, and P. Tran-Gia, “Analysis of
   In future work, it is planned to extend WhatsAnalyzer to add             group-based communication in whatsapp,” in International Conference
                                                                            on Mobile Networks and Management. Springer, 2015, pp. 225–238.
the possibility of analyzing chat histories without anonymiza-

                                                                   88
WHATSANALYZER: A TOOL FOR COLLECTING AND ANALYZING WHATSAPP MOBILE MESSAGING COMMUNICATION DATA - ITC-CONFERENCE.ORG
You can also read