ARCOV-19: THE FIRST ARABIC COVID-19 TWITTER DATASET WITH PROPAGATION NETWORKS

Page created by Gerald Nichols
 
CONTINUE READING
ArCOV-19: The First Arabic COVID-19 Twitter Dataset
                         with Propagation Networks

          Fatima Haouari, Maram Hasanain, Reem Suwaileh, Tamer Elsayed
             Computer Science and Engineering Department, Qatar University
      {200159617, maram.hasanain, rs081123, telsayed}@qu.edu.qa

                      Abstract                             Since the first reported case of Novel Coron-
                                                        avirus (COVID-19) in China, in November 2019,
    In this paper, we present ArCOV-19, an Arabic
    COVID-19 Twitter dataset that spans one year,       the COVID-19 topic has drawn the interest of many
    covering the period from 27th of January 2020       Arab users over Twitter. Their interest, reflected
    till 31st of January 2021. ArCOV-19 is the first    in the Arabic content on the platform, has reached
    publicly-available Arabic Twitter dataset cov-      a peak after two months when the first case was
    ering COVID-19 pandemic that includes about         reported in the United Arab Emirates late in Jan-
    2.7M tweets alongside the propagation net-          uary 2020. This ongoing pandemic has, unsur-
    works of the most-popular subset of them (i.e.,
                                                        prisingly, spiked discussions on Twitter covering a
    most-retweeted and -liked). The propagation
    networks include both retweets and conversa-
                                                        wide range of topics such as general information
    tional threads (i.e., threads of replies). ArCOV-   about the disease, preventive measures, procedures
    19 is designed to enable research under several     and newly-enforced decisions by governments, up-
    domains including natural language process-         to-date statistics of the spread in the world, and
    ing, information retrieval, and social comput-      even the change in our daily habits and work styles.
    ing. Preliminary analysis shows that ArCOV-            In this work, we aim to support future research
    19 captures rising discussions associated with      on social media during this historical period of our
    the first reported cases of the disease as they
                                                        time by curating an Arabic dataset (ArCOV-19)
    appeared in the Arab world. In addition to the
    source tweets and propagation networks, we          that exclusively covers tweets about COVID-19.
    also release the search queries and language-       We limit the dataset to Arabic since it is among
    independent crawler used to collect the tweets      the most dominant languages in Twitter (Alshaabi
    to encourage the curation of similar datasets.      et al., 2020a), yet under-studied in general.
                                                           ArCOV-19 is the first Arabic Twitter dataset
1   Introduction                                        designed to capture tweets discussing COVID-19
Twitter streams hundreds of millions of tweets          starting from January 27th 2020 till end of January
daily. In addition to being a medium for the spread     20211 , constituting about 2.7M tweets, alongside
and consumption of news, it has been shown to           the propagation networks of the most-popular sub-
capture the dynamics of real-world events includ-       set of them. To our knowledge, there is no publicly-
ing the spread of diseases such as the seasonal         available Arabic Twitter dataset for COVID-19 that
influenza (Kagashe et al., 2017) or more severe         includes the propagation networks of a good subset
epidemics like Zika (Vijaykumar et al., 2018),          of its tweets. Some existing efforts have already
Ebola (Roy et al., 2020), and H1N1 (McNeill et al.,     started to curate COVID-19 datasets including Ara-
2016). Moreover, collective conversations on Twit-      bic tweets, but Arabic is severely under-represented
ter about an event can have a great influence on the    (e.g., (Chen et al., 2020a; Singh et al., 2020)) or
event’s outcomes, e.g., US 2016 presidential elec-      represented by a random sample that is not specifi-
tions (Grover et al., 2019). Analyzing tweets about     cally focused on COVID-19 (e.g., (Alshaabi et al.,
an event, as it evolves, offers a great opportunity     2020b)), or the dataset is limited in the period it
to understanding its structure and characteristics,     covers and does not include the propagation net-
informing decisions based on its development, and       works (e.g., (Alqurashi et al., 2020)).
anticipating its outcomes as represented in Twitter        1
                                                             The dataset will be continuously augmented with new
and, more importantly, in the real world.               tweets over the coming months.
The contribution of this paper is three-fold:        constituting 65% and 12% of the tweets respec-
                                                         tively. Chen et al. (2020b) are continously collect-
    • We release ArCOV-19,2 the first Arabic Twit-       ing tweets since 21st of January 2020, where at
      ter dataset about COVID-19 that comprises          the time of writing, around 66% and 11% of the
      about 2.7M tweets collected via Twitter search     tweets are in English and Spanish respectively. An-
      API. It covers a full year allowing for captur-    other multilingual ongoing collection is released
      ing discussions on the topic since it started to   by Lopez et al. (2020). Qazi et al. (2020) enriched
      be popular in the Arab world. In addition to       their large-scale multilingual Twitter dataset with
      the tweets, ArCOV-19 includes propagation          geolocation information.
      networks of the most popular subset, search           The only raw Arabic Twitter dataset available is
      queries, and documented implementation of          the one released by Alqurashi et al. (2020), how-
      our language-independent tweets crawler.           ever it covers the period from 1st of January to 15th
    • We present a preliminary analysis on ArCOV-        April 2020 only. Moreover, it does not include the
      19, which reveals insights from the dataset        propagation networks of tweets.
      concerning temporal, geographical, and topi-          Compared to existing datasets, we release an
      cal aspects.                                       only-Arabic collection of tweets, where we exclude
    • We suggest several use cases to enable re-         retweets to avoid having copies of the same tweet.
      search on Arabic tweets in different research      Moreover, we also release the most-popular sub-
      areas including, but not limited to, emergency     set of them (i.e., most-retweeted and -liked and
      management, misinformation detection, and          spam free tweets). Furthermore, we collect the
      social analytics.                                  propagation networks including both retweets and
                                                         conversational threads (i.e., threads of replies) for
   The reminder of the paper is organized as fol-        the most-popular subset.
lows. We present the crawling process followed
to acquire ArCOV-19 in Section 3. We then thor-          3       Data Collection
oughly discuss the analysis that we conducted on
the dataset in Section 4. We suggest some use cases      ArCOV-19 includes two major components: the
to enable research on Arabic tweets in Section 5.        source tweets (i.e., tweets collected via Twitter
We finally conclude in Section 6.                        search API every day) and the propagation net-
                                                         works (i.e., retweets and conversational threads
2       Related Work                                     of a subset of the source tweets). In this section,
                                                         we present how we collected each in Sections 3.1,
Social media platforms and Twitter specifically          and 3.2 respectively, and summarize the released
showed to be an indispensable medium for sharing         data in Section 3.3.
information and discussions about the COVID-19
pandemic since December 2019. A considerable             3.1      Tweets Collection
body of raw social media datasets were released          To collect the source tweets, we used our tweet
to facilitate analyzing these discussions including      crawler3 that uses Twitter search API4 . The crawler
Twitter datasets (Banda et al., 2020; Lopez et al.,      takes a set of manually-crafted queries, comprising
2020; Chen et al., 2020b; Qazi et al., 2020; Gao         a target topic, as input. At the end of each day, the
et al., 2020; Alqurashi et al., 2020; Dashtian and       crawler issues a search request for each of those
Murthy, 2021), Instagram (Zarei et al., 2020), or        queries. Queries can be keywords (e.g., “Corona”),
the Chinese social media platform Weibo (Hu et al.,      phrases (e.g., “the killing virus”), or hashtags (e.g.,
2020; Gao et al., 2020).                                 “#the new coronavirus”). Twitter returns a maxi-
   Banda et al. (2020) released a multilingual           mum of 3,200 tweets per query. We customized
dataset that includes over 800M tweets, the data is      the search requests to return only Arabic tweets,5
collected since 1st of January 2020 using Twitter        and to exclude all retweets to avoid having copies
streaming API. Dashtian and Murthy (2021), col-
                                                             3
lected multilingual tweets posted between March               https://gitlab.com/bigirqu/ArCOV-19/
                                                         -/tree/master/code/crawler
and July 2020. The most dominant languages                  4
                                                              https://developer.twitter.com/en/
covered by their dataset are English and Spanish         docs/tweets/search/api-reference/
                                                         get-search-tweets
    2                                                       5
        https://gitlab.com/bigirqu/ArCOV-19/                  Language of tweets to return is a configurable parameter.
of the same tweet. Additionally, duplicate tweets                retweets using Pickaw,8 a platform for organiz-
(returned by different queries) are removed in each              ing contests on social media, and the replies using
day. Finally, tweets are sorted chronologically.                 PHEME (Zubiaga et al., 2016) Twitter conversation
   We started collecting our data since 27th of Jan-             collection script.9
uary 2020 using a set of queries that we manually-
updated based on our daily tracking of trending                  3.3    Data Release
keywords and hashtags. For example, starting from                In summary, we release the following resources
16th of December 2020, we extended our queries                   as ArCOV-19 dataset, taking into consideration
to cover tweets related to COVID-19 vaccines as                  Twitter content redistribution policy:10
that sub-topic was gaining interest on Arabic social                • Source Tweets: IDs of the tweets crawled in
media. The full list of queries used in each day                      each day.
is released alongside our dataset. We denote all                    • Search Queries: the list of search queries, in-
collected tweets as source tweets.                                    cluding keywords, phrases, and hashtags, used
   Due to technical reasons, we missed collecting                     in each period to collect our source tweets.
tweets for a few days. Since Twitter search API                     • Top Subset: IDs of the top 1,000 most popu-
limits the search results to the past 7 days, we                      lar tweets for each day.
missed the old tweets. To overcome that, we used                    • Propagation Networks: the propagation net-
GetOldTweets36 python library to download the                         works for the top subset which include for
search results using the same trending keywords                       each tweet in the top subset:
and hashtags we selected around those days.                              – Retweets: IDs of the full retweet set.
                                                                         – Conversational Threads: tweet IDs of
3.2   Propagation Networks Collection
                                                                            the full reply thread (including direct and
In addition to the source tweets, we also collected                         indirect replies).
the propagation networks (i.e., retweets and con-                Along with the dataset, we provide some pointers
versational threads) of the top 1000 most popular                to publicly-available crawlers that users can easily
(i.e., top-retweeted & -liked) tweets each day. To               use to crawl the tweets given their IDs.
our knowledge, this is the first Arabic tweet dataset
to include such propagation networks.                            4     ArCOV-19 in Numbers
   Before getting the most popular tweets on any
                                                                 In this section, we present a statistical summary and
day, we started from source tweets collected in
                                                                 conduct an analysis on ArCOV-19 to shed some
that day and applied a qualification pipeline. We
                                                                 light on its major characteristics.
first excluded tweets containing any inappropriate
word (using a list of inappropriate words we con-                4.1    Tweets and Users Distribution
structed). Next, tweets with more than two URLs
or four hashtags, or shorter than four tokens (all               Table 1 presents an overall summary of the tweets
are potentially spam) were dropped. Additionally,                and users statistics in ArCOV-19. It indicates that
duplicate tweets that have exact textual content are             the total number of source tweets in the dataset
also dropped to avoid redundancy; only the most                  is about 2.7M posted by over 690k unique users.
popular of them (according to our scoring criterion              We note that 18.66% of the tweets were posted by
below) is kept. Qualified tweets are then scored by              verified users (who constitute only 0.81% of the
popularity defined by the sum of the tweet’s retweet             unique users). This is a relatively large percentage,
and favorite counts. We finally sort the qualified               showing that a good portion of the source tweets
tweets by their scores and select the most popular               are popular. The average numbers of followers,
1,000 tweets. We denote the set of all such tweets               friends, and statuses of users are also relatively
over all days as the top subset.                                 large, showing that users observed in ArCOV-19
   For those 1K top tweets per day, we then col-                 are also popular (and possibly more influential).
lected all retweets and conversational threads (i.e.,            rest; we will make all available in the near future.
direct and indirect replies).7 We collected the                      8
                                                                       https://pickaw.com/en/
                                                                 twrench-becomes-pickaw
   6                                                                 9
     https://github.com/Mottl/                                         https://github.com/azubiaga/
GetOldTweets3                                                    pheme-twitter-conversation-collection
   7                                                                10
     At the time of writing, we have collected the replies for         https://developer.twitter.com/en/
tweets until end of April 2020 and we are still collecting the   developer-terms/agreement-and-policy
550,000
The table also indicates that 25.40% of the tweets         500,000
                                                                                                                                  Posted by non-verified users                          Posted by verified users

include URLs. We anticipate that this is due to the        450,000
                                                           400,000
extensive spreading of news (linked from tweets)           350,000
during this period. The numbers of tweets that are         300,000
                                                           250,000
geotagged and geolocated are also indicated in the         200,000

table; however, we defer the discussion of such            150,000
                                                           100,000
types of tweets to Section 4.3.                             50,000

   Figure 1 illustrates the monthly distribution of                 0

                                                                                                                                                                                                                    Dec 2020
                                                                                                    Mar 2020

                                                                                                                              May 2020
                                                                                                               Apr 2020

                                                                                                                                                                                                      Nov 2020
                                                                                                                                                                  Aug 2020
                                                                          Jan 2020

                                                                                                                                                                             Sep 2020

                                                                                                                                                                                                                               Jan 2021
                                                                                                                                                                                         Oct 2020
                                                                                                                                                       Jul 2020
                                                                                                                                         Jun 2020
                                                                                     Feb 2020
tweets in ArCOV-19. The volume of tweets dras-
tically increased in March when the virus started
to spread in several Arab countries. Then, tweets         Figure 1: Monthly distribution of source tweets of
number started to decrease monthly. We speculate          ArCOV-19.
this is due to the fact that the topic was not heavily
                                                                  Akhbar | ‫أﺧﺑﺎر اﻵن‬                                                                                                                                           11,086
discussed as in the first days. In December, we                          ‫ﺻﺣﯾﻔﺔ اﻟﺑﯾﺎن‬                                                                                                                                     10,223
                                                                   Shorouk News                                                                                                                         9,031
expanded our queries list to include hashtags and         Bawabaa News | ‫ﺑواﺑﺔ اﻷﺧﺑﺎر‬                                                                                                               8,453
                                                                Assawsana News                                                                                           6,962
keyword related to more current sub-topics such as                         ‫اﻟﯾوم اﻟﺳﺎﺑﻊ‬
                                                                           ‫ﺑواﺑﺔ اﻟوطن‬
                                                                                                                                                                     6,224
                                                                                                                                                                  5,810
vaccines and the new variant of the virus, which                      NEWS KTV
                                                                                ‫ﻛوروﻧﺎ‬                                                                          5,517
                                                                                                                                                              5,263

resulted in an increase in the volume of tweets.                            Mulhak
                                                                               ‫ﻗﻧﺎة اﻟﻐد‬
                                                                                                                                                            5,019
                                                                                                                                                            4,953
                                                                    AlMamlakaTV                                                                            4,890
   Figure 2 presents the top 25 tweeting users in                       ‫ﺻﺣﯾﻔﺔ اﻟﺧﻠﯾﺞ‬                                                                       4,844
                                                                    ALAYAM - ‫اﻷﯾﺎم‬                                                                        4,699
ArCOV-19. We notice that several of them are news                           24‫اﻟﺷﺎرﻗﺔ‬                                                                    4,584
                                                                        ‫ﺟرﯾدة اﻟدﺳﺗور‬                                                                    4,536
sources, and 20 are verified Twitter accounts.               MENAFN.com Arabic
                                                                     ‫ﺳﻛﺎي ﻧﯾوز ﻋرﺑﯾﺔ‬
                                                                                                                                                      4,137
                                                                                                                                                      4,124
                                                                         ‫اﻟﻌرﺑﻲ اﻟﺟدﯾد‬                                                               4,068
                                                                        ‫ﺻﺣﯾﻔﺔ اﻻﺗﺣﺎد‬                                                                 4,051
 Tweets Statistics                                                        ‫ﺟرﯾدة اﻟوطن‬
                                                                Independent ‫ﻋرﺑﯾﺔ‬
                                                                                                                                                     4,021
                                                                                                                                                    3,989
                                                                            ‫ﺻدى اﻟﺑﻠد‬                                                               3,975
 Source Tweets                2,675,049                               ‫اﻟوﻗﺎﺋﻊ اﻻﺧﺑﺎرﯾﺔ‬
                                                                         ‫ﺻﺣﯾﻔﺔ ﻋﻛﺎظ‬
                                                                                                                                                    3,968
                                                                                                                                                    3,912
 Geotagged                        2,078    (0.08%)                                              0                         2,000             4,000                 6,000                 8,000                    10,000             12,000
                                                                                                                                                             Number of tweets
 Geolocated                      60,873    (2.28%)
 Posted by verified users       499,351    (18.66%)       Figure 2: Number of tweets posted by the top 25 tweet-
 Include URL                    679,482    (25.40%)       ers in ArCOV-19 for the period of 27th Jan 2020 to 31st
 Include Media                  973,952    (36.41%)       Jan 2021.
 Top Subset                     370,132    (13.84%)
 Retweets of Top Subset       7,925,821
 Replies of Top Subset        1,476,950                   country names for the tweets posted in the period
                                                          of 27th Jan to 31st Mar 2020. We focus on this
 Users Statistics                                         critical period since the virus started to spread in
 Unique                         690,339                   the Arab world during that time. We then tracked
 Verified                         5,575    (0.81%)        their frequency over time. Figure 3 shows the time
 Average followers count          4,630                   series (over days) of the three types of keywords.11
 Average friends count              918                      The 10 most-frequent words shown in Fig-
 Average statuses count           9,054                   ure 3(a) indicate two different types of words:
                                                          those that are directly related to COVID-19 (e.g.,
Table 1: Tweets and users statistics of ArCOV-19. Stats
                                                          “health”) and those that are not but related to prayers
for replies are for tweets until end of April 2020.
                                                          and supplications (e.g., “Allah/God”)). It is inter-
                                                          esting to see the word “Allah” is very frequent
4.2   Tweets Content & Topics                             early on when the news about the virus started to
It is important to demonstrate that the tweets in         spread (probably over discussions around whether
ArCOV-19 constitute a good representative sample          the pandemic is a punishment from God or not, we
of the Arabic tweets posted during the target period      believe), then declines over time only to become
on COVID-19, and that they cover the prevalent            frequent again when the virus started to widely
topics discussed over Twitter during that period.         spread in the Arab world.
To help examine this hypothesis, we analyzed the               11
                                                               When identifying the most frequent words and hashtags,
textual content of the tweets; in particular, we iden-    we excluded the ones we used in our search queries since they
tified the most-frequent words, hashtags, and Arab        are expected to be very frequent by definition.
30%
                                                                 ‫هللا‬                                                                                                                                                                     ‫هللا‬                       ‫حالة‬                             ‫بفيروس‬                   ‫الصحة‬                  ‫بسبب‬
                    25%                                                                                                                                                                                                                   ‫اصابة‬                      ‫العالم‬                           ‫انتشار‬                   ‫اللهم‬                  ‫عدد‬

                                                                                                                                                       ‫اصابة‬
                    20%                                                                               ‫بفيروس‬
                                                                                                      ‫حالة‬
         % tweets

                                                                                                                ‫الصحة‬                      ‫بسبب‬
                    15%
                                                                                                                                                                                                                                                                                                             ‫اللهم‬                       ‫انتشار‬
                    10%
                                                                                                                                                                                                                                                                                                                                                        ‫العالم‬

                    5%

                    0%                                                                                                                                                                                                                    ‫عدد‬

                                                                                                                                                                                                        01-Mar
                                                                                                                                                                                                                       03-Mar
                                                                                                                                                                                                                                 05-Mar
                                                                                                                                                                                                                                          07-Mar
                                                                                                                                                                                                                                                   09-Mar
                                                                                                                                                                                                                                                            11-Mar
                                                                                                                                                                                                                                                                        13-Mar
                                                                                                                                                                                                                                                                                           15-Mar
                                                                                                                                                                                                                                                                                                    17-Mar
                                                                                                                                                                                                                                                                                                             19-Mar
                                                                                                                                                                                                                                                                                                                      21-Mar
                                                                                                                                                                                                                                                                                                                                23-Mar
                                                                                                                                                                                                                                                                                                                                         25-Mar
                                                                                                                                                                                                                                                                                                                                                  27-Mar
                                                                                                                                                                                                                                                                                                                                                            29-Mar
                                                                                                                                                                                                                                                                                                                                                                     31-Mar
                          27-Jan
                                    29-Jan
                                               31-Jan
                                                        02-Feb
                                                                  04-Feb
                                                                           06-Feb
                                                                                    08-Feb
                                                                                             10-Feb
                                                                                                      12-Feb
                                                                                                                14-Feb
                                                                                                                         16-Feb
                                                                                                                                  18-Feb
                                                                                                                                           20-Feb
                                                                                                                                                    22-Feb
                                                                                                                                                               24-Feb
                                                                                                                                                                                26-Feb
                                                                                                                                                                                            28-Feb
                                                                                                                         (a) Top 10 most frequent words

                    20%
                                                                                                                              _‫كورونا‬#                                                                                                                                           #‫الكويت‬                                           #‫الصين‬
                                                                                                                                 ‫الكويت‬                                                                     _‫كورونا‬#                                                             #‫إيران‬                                            #‫عاجل‬
                                                   ‫الصين‬#
                                                                                                                                   _‫كورونا‬#                                                                 ‫في_الكويت‬                                                            #‫كورونا_في_الكويت‬                                 #‫السعودية‬
                    15%
                                                                                                                                        ‫لبنان‬                                                                                                                                    #‫كورونا_الكويت‬                                    #‫كورونا_لبنان‬
                                                                                                                                                                                                                                                                                 #‫لبنان‬                                            #covid19
                                                                                                                            ‫لبنان‬#
         % tweets

                                                                                                                                                                                                        ‫الكويت‬#                                                                                     ‫عاجل‬#
                    10%
                                                                                                                    ‫إيران‬#
                                                                                                                                                                                                                                                                                   #covid19
                    5%
                                             ‫السعودية‬#

                    0%
                                                                                                                                                                                                                                                            11-Mar
                                                                                                                                                                                                                                                                        13-Mar
                                                                                                                                                                                                                                                                                           15-Mar
                                                                                                                                                                                                                                                                                                    17-Mar
                                                                                                                                                                                                                                                                                                             19-Mar
                                                                                                                                                                                                                                                                                                                      21-Mar
                                                                                                                                                                                                                                                                                                                                23-Mar
                                                                                                                                                                                                                                                                                                                                         25-Mar
                                                                                                                                                                                                                                                                                                                                                   27-Mar
                                                                                                                                                                                                                                                                                                                                                            29-Mar
                                                                                                                                                                                                                                                                                                                                                                     31-Mar
                                                        2-Feb
                                                                  4-Feb
                                                                           6-Feb
                                                                                    8-Feb
                          27-Jan
                                    29-Jan
                                               31-Jan

                                                                                             10-Feb
                                                                                                      12-Feb
                                                                                                                14-Feb
                                                                                                                         16-Feb
                                                                                                                                  18-Feb
                                                                                                                                           20-Feb
                                                                                                                                                    22-Feb
                                                                                                                                                               24-Feb
                                                                                                                                                                                 26-Feb
                                                                                                                                                                                            28-Feb
                                                                                                                                                                                                        1-Mar
                                                                                                                                                                                                                       3-Mar
                                                                                                                                                                                                                                 5-Mar
                                                                                                                                                                                                                                          7-Mar
                                                                                                                                                                                                                                                   9-Mar

                                                                                                                   (b) Top 10 most frequent hashtags
                                                                                                                         Time of first reported infected case
                                                                                                                                                                                                                                                                                                                                    ‫ليبيا‬
                                                                                                                ‫مصر‬

                                                                                                                                              ‫لبنان‬

                                                                                                                                                             ‫الكويت عمان البحرين العراق‬

                                                                                                                                                                                                                      ‫قطر‬
                                                                                                                                                                                               ‫السعودية تونس المغرب األردن‬

                                                                                                                                                                                                                                                                                                                         ‫سوريا‬
                                    ‫اإلمارات‬

                                                                                                                                                                                 ‫الجزائر‬

                                                                                                                                                                                                                                                                       ‫السودان موريتانيا‬
                                                                                                                                                                                                                                 ‫فلسطين‬

                    12%

                                                                                     ‫مصر‬                             ‫لبنان‬                                                                 ‫الكويت‬                                                                     ‫الكويت‬                           ‫مصر‬                      ‫لبنان‬                   ‫السعودية‬
                                                                                                                                                                                                                                                                      ‫االمارات‬                         ‫العراق‬                   ‫البحرين‬                 ‫قطر‬
                    10%                                                                                                                                                                    ‫البحرين‬
                                                                                                                         ‫العراق‬                                                                                                                                       ‫اليمن‬                            ‫االردن‬                   ‫الجزائر‬                 ‫سوريا‬

                     8%                                                                                                                                                                                                                                               ‫المغرب‬                           ‫السودان‬                  ‫تونس‬                    ‫فلسطين‬
         % tweets

                                                                                                                                                                                                                                                                      ‫ليبيا‬                            ‫عمان‬                     ‫موريتانيا‬               ‫الصومال‬

                     6%                                                                                                                                                                                                         ‫الجزائر‬
                                                    ‫اإلمارات‬                                                                                                                       ‫عمان‬
                     4%                                              ‫السعودية‬                                                                                                                                                                                                                          ‫قطر‬
                                                                                                                                                                                                                                                                                                                                                  ‫ليبيا‬
                     2%

                     0%
                                                                                                                                                                                                                                                                                                    17-Mar

                                                                                                                                                                                                                                                                                                                                                                     31-Mar
                                                                                                                                                                                                        01-Mar
                                                                                                                                                                                                                       03-Mar
                                                                                                                                                                                                                                 05-Mar
                                                                                                                                                                                                                                          07-Mar
                                                                                                                                                                                                                                                   09-Mar
                                                                                                                                                                                                                                                            11-Mar
                                                                                                                                                                                                                                                                        13-Mar
                                                                                                                                                                                                                                                                                           15-Mar

                                                                                                                                                                                                                                                                                                             19-Mar
                                                                                                                                                                                                                                                                                                                      21-Mar
                                                                                                                                                                                                                                                                                                                                23-Mar
                                                                                                                                                                                                                                                                                                                                         25-Mar
                                                                                                                                                                                                                                                                                                                                                  27-Mar
                                                                                                                                                                                                                                                                                                                                                            29-Mar
                           27-Jan
                                     29-Jan
                                               31-Jan
                                                        02-Feb
                                                                  04-Feb
                                                                           06-Feb
                                                                                    08-Feb
                                                                                             10-Feb
                                                                                                       12-Feb
                                                                                                                14-Feb
                                                                                                                         16-Feb
                                                                                                                                  18-Feb
                                                                                                                                           20-Feb
                                                                                                                                                    22-Feb
                                                                                                                                                                24-Feb
                                                                                                                                                                                 26-Feb
                                                                                                                                                                                            28-Feb

                                                                                                                (c) Word frequency of Arab countries

Figure 3: Time series of the frequency (in percentage of tweets) of top general keywords, hashtags, and Arab
country words for the period of 27th Jan to 31st Mar. The time of first reported cases in Arab countries is also
indicated and aligned with the time series.
youm7.com                                                                5420
   Figure 3(b) demonstrates how “#China” hashtag                              alarabiya.net
                                                                              youtube.com                                        2984
                                                                                                                                        3405

                                                                         elwatannews.com
was the most trending one from 27 of January until                            alaraby.co.uk
                                                                                                                              2769
                                                                                                                            2569
                                                                               aawsat.com                                2348
20 of February, as COVID-19 was prevalent only                                     sabq.org
                                                                   independentarabia.com
                                                                                                                         2342
                                                                                                                       2143
in China and still not widely spread (at that time) in                   shorouknews.com
                                                                                    cnn.com
                                                                                                                      2041
                                                                                                                     1989
                                                                           akhbaralaan.net                          1944
other countries. We can see how it started to be less                                 rt.com
                                                                                alittihad.ae
                                                                                                                    1913
                                                                                                                    1901
trending as the number of cases started to decline                      skynewsarabia.com
                                                                               argaam.com
                                                                                                                   1792
                                                                                                                  1705
                                                                                      ajel.sa                  1432
in China by that date.12 On the other hand, when                               royanews.tv
                                                                                  aa.com.tr
                                                                                                              1405
                                                                                                              1379
COVID-19 started to spread in the Arab world,                                          ow.ly
                                                                             euronews.com
                                                                                                             1346
                                                                                                             1306
                                                                                       24.ae                1239
other hashtags started to become viral. Further-                            ampproject.org
                                                                                 al-ain.com
                                                                                                           1231
                                                                                                           1218
more, Figure 3(b) shows that frequency spikes of                                     pscp.tv
                                                                     forbesmiddleeast.com
                                                                                                           1200
                                                                                                           1185

“#Iran”, “#Lebanon”, and “#Kuwait” exactly match                                                0   1000         2000        3000          4000   5000     6000
                                                                                                                  Number of occurrences
the confirmation dates of first reported cases in
Iran,13 Lebanon,14 and Kuwait,15 respectively.                     Figure 4: Most frequently-linked domains in the top
   To further analyze trending topics, Figure 3 fea-               tweets subset for the period of 27th Jan 2020 to 31st
tures the timeline of the first reported cases in the              Jan 2021.
Arab countries and aligns them with the time se-
ries throughout the figure. Aligning the timeline                  ArCOV-19 by considering the domains of URLs
with the series in Figure 3(c) reveals a significant               shared in the tweets. We focused on the top tweets
match between the frequency peaks of several coun-                 subset of ArCOV-19 (constructed as detailed in Sec-
try names and the corresponding dates of first re-                 tion 3.2) and identified the URLs posted in those
ported cases in those countries, most notably in                   tweets. We then expanded them (since Twitter
UAE, Egypt, Lebanon, Kuwait, Oman, Bahrain,                        URLs are shortened) and dropped URLs of images
Algeria, and Libya.                                                and videos uploaded to Twitter, to focus only on
   Figure 3 also demonstrates the power of ArCOV-                  URLs referring to external sources. We extracted
19 in capturing further controversial and trending                 110,220 URLs from 2,617 unique domains. Fig-
topics. Table 2 shows a timeline covering dates                    ure 4 shows that URLs from news websites are
of specific topics of discussion trending on social                dominantly the most-commonly shared. We ob-
media around times of peaks in tweeting frequency                  serve that these news websites mainly originate
in Figure 3(c).                                                    from three Arab countries, namely, Egypt, Saudi
 Country   Date      Topic                 Related News            Arabia, and United Arab Emirates. Interestingly,
 UAE       Feb 19    Yemeni Foreign        http://tiny.cc/71qtmz   videos from YouTube are among the most com-
                     Affairs Minister
                     thanks UAE for
                                                                   monly shared media. Coupled with the fact that
                     evacuating Yemeni                             36% of the tweets in ArCOV-19 include embedded
                     students from
                     China
                                                                   images and videos (Table 1), we believe this en-
 Iraq      Feb 20    Iraq announces        http://tiny.cc/yzrtmz   ables ArCOV-19 to be a potential dataset to further
                     closure of borders
                     with Iran
                                                                   support the evaluation of multi-modal retrieval and
 Egypt     Mar 2     Egyptian health       http://tiny.cc/7iqtmz   classification systems.
                     minister announces
                     a visit to China
           Mar 7     Kuwait bans travels   http://tiny.cc/lqstmz   4.3         Geographic Distribution of Tweets
                     with Egypt
                     including entry of                            Although Twitter provides automatic geo-reference
                     Egyptian residents                            functionally, few users (solely around 1-3% (Mur-
 KSA       Mar 4-5   Closure of The        http://tiny.cc/iqrtmz
                     Grand Mosque in                               dock, 2011; Jurgens et al., 2015)) opt to enable it
                     Mecca                                         due to privacy and safety reasons. Alternatively, to
                                                                   have an insight about the geographical distribution
Table 2: Examples of trending topics in social media
matched by spikes in tweeting frequency in ArCOV-19                and diversity of the tweets in our dataset, we ex-
                                                                   amined the place and coordinates attributes of the
   We further explore the topics discussed in                      Tweet object.16 We note that the place attribute is
                                                                   an optional attribute that allows the user to select a
  12
     https://tinyurl.com/rfj5khn
  13                                                                   16
     https://tinyurl.com/yykgwbox                                     https://developer.twitter.com/en/
  14
     https://tinyurl.com/yxkg78g5                                  docs/tweets/data-dictionary/overview/
  15
     https://tinyurl.com/y4nlgz5h                                  geo-objects
105
location from a list provided by Twitter (therefore,
the location might not necessarily show the actual                             104
location from where the tweet is posted), whereas

                                                          Number of retweets
the coordinates attribute represents the geographic                            103

location of the tweet as reported by the user or
client application.                                                            102

   Table 1 shows that ArCOV-19 has 60,873 ge-
                                                                               101
olocated tweets (i.e., having values in the place
attribute) and 2,078 geotagged tweets (i.e., having                            100
values in the coordinates attribute). Those tweets

                                                                                 27/01
                                                                                 29/01
                                                                                 31/01
                                                                                 02/02
                                                                                 04/02
                                                                                 06/02
                                                                                 08/02
                                                                                 10/02
                                                                                 12/02
                                                                                 14/02
                                                                                 16/02
                                                                                 18/02
                                                                                 20/02
                                                                                 22/02
                                                                                 24/02
                                                                                 26/02
                                                                                 28/02
                                                                                 01/03
                                                                                 03/03
                                                                                 05/03
                                                                                 07/03
                                                                                 09/03
                                                                                 11/03
                                                                                 13/03
                                                                                 15/03
                                                                                 17/03
                                                                                 19/03
                                                                                 21/03
                                                                                 23/03
                                                                                 25/03
                                                                                 27/03
                                                                                 29/03
                                                                                 31/03
                                                                                 02/04
                                                                                 04/04
                                                                                 06/04
                                                                                 08/04
                                                                                 10/04
                                                                                 12/04
                                                                                 14/04
                                                                                 16/04
                                                                                 18/04
                                                                                 20/04
                                                                                 22/04
                                                                                 24/04
                                                                                 26/04
                                                                                 28/04
                                                                                 30/04
were posted by 24,072 and 256 unique users re-
spectively. The geolocated tweets constitute about           Figure 6: Distribution of retweets per day for the top
2.28% of the total source tweets, which is con-              subset for the period of 27th Jan 2020 to 30th April
sistent with previous studies (Huang and Carley,             2020.
2019).
                                                             among the tracking keywords in the period between
  Libya
  1.7%
  Bahrain
                                                             22 of February and 13 of March. Furthermore, it is
  1.9%
  Qatar
                                                             not surprising to see the countries that have a few
  2.1%
  Jordan
                                                             cases have the smallest portions of contribution to
  3.0%
  Iraq                                           KSA         the content. Others have small audience userbases
  4.2%                                          41.7%
  Lebanon                                                    on Twitter (e.g., Tunisia).
  5.3%
  Egypt
  5.4%                                                       4.4                      Propagation Networks
  Oman
  6.6%                                                     As discussed earlier, we collected the propagation
  UAE
                                                           networks of the top subset. Overall, the number of
  6.8%                                                     retweets is 7,925,821 for the entire subset, and the
  Other                                        Kuwait
  7.3%                                          9.6%       number of replies is 1,476,950 18 .
                                                              At the time of collecting the retweets, we were
Figure 5: Distribution of geolocated tweets for the pe-    not able to get the retweets of some of the tweets ei-
riod of 27th Jan 2020 to 31st Jan 2021.                    ther because they were deleted or they were posted
                                                           from private accounts. For those collected suc-
   The geolocated tweets were indeed posted by             cessfully, Figure 6 illustrates the distribution of
users from 102 countries from around the globe.            retweets per day using boxplots. It shows that the
We found that 92.75% of them were posted from              average (and median) number of retweets follows
the Arab world. The largest contribution of the ge-        a similar pattern to what was shown in Figure 1
olocated content (about 41.7%) comes from Saudi            (notice that the Y-axis here is in log scale). We also
Arabia; this is somewhat expected as Saudi users           notice that a good number of tweets got more than
represent the highest number of active Twitter users       100 retweets; some of them got even larger num-
in the Arab world.17 Surprisingly, Kuwait comes            bers reaching about 10k retweets or more, showing
second with 9.6% of the geolocated content. We             highly propagated content. As shown in Figure 7,
believe the rationale is that it was among the first       we applied a similar analysis to the replies per day,
countries that reported COVID-19 cases in the Mid-         and found similar patterns to the retweets.
dle East. Since then, Kuwait started a series of strict
precautions such as a wide lock-down in many vi-             5                       Enabling Research
tal facilities until the government had imposed a
                                                            With the spread of Novel Coronavirus in several
nationwide curfew. We think, after the curfew, the
                                                            Arab countries and the subsequent procedural mea-
people become more active on Twitter as a plat-
                                                            sures taken by the local governments, it continued
form to break news and discuss developments of
                                                            to dominate discussions over social media (and
the virus. Additionally, we used related phrases
                                                                                18
and hashtags to “Kuwait”, “Lebanon”, and “UAE”                    The stats are for the replies for data until end of April
                                                             2020 and we are still collecting the rest; we will make all
  17
       https://tinyurl.com/jmkttt3                           available in the near future.
Misinformation Detection: With the sheer
                    104
                                                                 amount of information shared about COVID-19,
                                                                 many rumors are disseminated and getting high at-
                    103
Number of replies

                                                                 tention from the community, which causes a fast
                                                                 propagation of misinformation. This hinders the
                    102
                                                                 efforts of the health and governmental organiza-
                    101
                                                                 tions on fighting the pandemic as such rumors
                                                                 spread panic and may lead to undesirable conse-
                    100
                                                                 quences (e.g., increase of cases and mortality rate,
                                                                 or lack of supplies due to hoarding). ArCOV-19
                      27/01
                      29/01
                      31/01
                      02/02
                      04/02
                      06/02
                      08/02
                      10/02
                      12/02
                      14/02
                      16/02
                      18/02
                      20/02
                      22/02
                      24/02
                      26/02
                      28/02
                      01/03
                      03/03
                      05/03
                      07/03
                      09/03
                      11/03
                      13/03
                      15/03
                      17/03
                      19/03
                      21/03
                      23/03
                      25/03
                      27/03
                      29/03
                      31/03
                      02/04
                      04/04
                      06/04
                      08/04
                      10/04
                      12/04
                      14/04
                      16/04
                      18/04
                      20/04
                      22/04
                      24/04
                      26/04
                      28/04
                      30/04
                                                                 supports studying information/claims propagation,
    Figure 7: Distribution of replies per day for the top sub-   claims check-worthiness detection and verification
    set for the period of 27th Jan 2020 to 30th April 2020.      tasks on the most popular tweets. Furthermore,
                                                                 the retweet networks and conversational threads
                                                                 provide a valuable resource for early detection of
   Twitter in particular) to the time of this writing. To        fake news and identification of malicious rumor-
   enable research on different tasks on Arabic Tweets,          spreading accounts. To further facilitate work in
   we make ArCOV-19 publicly-available for the re-               this domain of problems, we recently constructed
   search community. Having quantified the topical,              ArCOV19-Rumors, an annotated dataset on top of
   geographical, and community representativeness                ArCOV-19 to support claim and tweet verification
   of ArCOV-19, we envision that it is suitable for              (Haouari et al., 2021).
   diverse natural language processing, information
   retrieval, and computational social science research          6   Conclusion
   tasks including, but not limited to, emergency man-
   agement, misinformation detection, and social ana-            In this paper, we presented ArCOV-19, the first
   lytics, as we discuss below. Furthermore, we pro-             Arabic Twitter dataset about the Novel Coronavirus
   vide propagation networks of daily highly popular             (COVID-19) that includes propagation networks
   subset (the most popular 1K) to ensure the quality            of a large subset of tweets. We release all source
   of tweets. This sample is drawn after filtering out           tweets, top subset, search queries, and the propa-
   potential low-quality tweets (e.g., spam) and con-            gation networks. Preliminary analysis showed that
   tent duplicates. We anticipate the top subset and             ArCOV-19 captured spikes in tweeting frequency
   the network to support these tasks on the popular             for country-specific tweets that are consistent with
   Arabic tweets that we assume have the most effect             the first reported cases of COVID-19 in several
   on shaping public opinion and understanding.                  Arab countries. We also found dominance of news
      Emergency Management: As COVID-19 is an                    agencies among top tweeting users and among most
   international pandemic, national and international            shared URLs. ArCOV-19 enables research under
   health organizations need to analyze the effects of           many domains including natural language process-
   the outbreak. We believe ArCOV-19 can support                 ing, information retrieval, and social computing.
   several tasks in that domain, such as filtering of in-        We plan to continue collecting tweets for the fore-
   formative content, events and sub-events detection,           seeable future and the dataset will be continuously
   summarization, identification of eyewitnesses, ge-            updated with newly collected tweets and propaga-
   olocation, and studying information and situational           tion networks.
   awareness propagation, to name a few.
                                                                 Acknowledgments
      Social Analytics: In addition to sharing reports
   and awareness during the outbreak, people tend to             The work of Tamer Elsayed and Maram Hasanain
   discuss their opinions (e.g., stance towards the situ-        was made possible by NPRP grant# NPRP 11S-
   ation and its consequences) and express their emo-            1204-170060 from the Qatar National Research
   tions (e.g., big changes in lifestyle, social distanc-        Fund (a member of Qatar Foundation). The work
   ing, loss of their beloved ones, etc.). Therefore, an-        of Reem Suwaileh was supported by GSRA grant#
   alyzing the tweets to detect sentiment, stance, hate          GSRA5-1-0527-18082 from the Qatar National Re-
   speech, and generally, offensive language, among              search Fund and the work of Fatima Haouari was
   other aspects, is of interest to many stakeholders.           supported by GSRA grant# GSRA6-1-0611-19074
from the Qatar National Research Fund. The state-        Yong Hu, He-Yan Huang, Anfan Chen, and Xian-Ling
ments made herein are solely the responsibility of         Mao. 2020. Weibo-cov: A large-scale covid-19 so-
                                                           cial media dataset from weibo. In Proceedings of
the authors.
                                                           the 1st Workshop on NLP for COVID-19 (Part 2) at
                                                           EMNLP 2020.
References                                               Binxuan Huang and Kathleen M Carley. 2019. A
                                                           large-scale empirical study of geotagging behavior
Sarah Alqurashi, Ahmad Alhindi, and Eisa Alanazi.          on Twitter. In Proceedings of the 2019 IEEE/ACM
  2020. Large Arabic Twitter Dataset on COVID-19.          International Conference on Advances in Social Net-
  arXiv preprint arXiv:2004.04315.                         works Analysis and Mining, pages 365–373.
Thayer Alshaabi, David R Dewhurst, Joshua R Minot,       David Jurgens, Tyler Finethy, James McCorriston,
  Michael V Arnold, Jane L Adams, Christopher M            Yi Tian Xu, and Derek Ruths. 2015. Geolocation
  Danforth, and Peter Sheridan Dodds. 2020a. The           prediction in twitter using social networks: A critical
  growing echo chamber of social media: Measuring          analysis and review of current practice. In Proceed-
  temporal and social contagion dynamics for over          ings of the Ninth International AAAI Conference on
  150 languages on twitter for 2009–2020. arXiv            Web and Social Media.
  preprint arXiv:2003.03667.
                                                         Ireneus Kagashe, Zhijun Yan, and Imran Suheryani.
Thayer Alshaabi, JR Minot, MV Arnold, Jane Lydia            2017. Enhancing seasonal influenza surveillance:
  Adams, David Rushing Dewhurst, Andrew J Reagan,           topic analysis of widely used medicinal drugs using
  Roby Muhamad, Christopher M Danforth, and Pe-             twitter data. Journal of medical Internet research,
  ter Sheridan Dodds. 2020b. How the world’s collec-        19(9):e315.
  tive attention is being paid to a pandemic: Covid-19
  related 1-gram time series for 24 languages on twit-   Christian E Lopez, Malolan Vasu, and Caleb Galle-
  ter. arXiv preprint arXiv:2003.12614.                    more. 2020.      Understanding the perception of
                                                           covid-19 policies by mining a multilanguage twitter
Juan M Banda, Ramya Tekumalla, Guanyu Wang,                dataset. arXiv preprint arXiv:2003.10359.
  Jingyuan Yu, Tuo Liu, Yuning Ding, and Ger-
  ardo Chowell. 2020. A large-scale covid-19 twit-       Andrew McNeill, Peter R. Harris, and Pam Briggs.
  ter chatter dataset for open scientific research–        2016. Twitter influence on uk vaccination and an-
  an international collaboration.  arXiv preprint          tiviral uptake during the 2009 h1n1 pandemic. Fron-
  arXiv:2004.03688.                                        tiers in Public Health, 4:26.
Emily Chen, Kristina Lerman, and Emilio Ferrara.         Vanessa Murdock. 2011. Your mileage may vary: On
  2020a. Covid-19: The first public coronavirus twit-      the limits of social media. SIGSPATIAL Special,
  ter dataset. arXiv preprint arXiv:2003.07372.            3(2):62–66.
Emily Chen, Kristina Lerman, and Emilio Ferrara.         Umair Qazi, Muhammad Imran, and Ferda Ofli. 2020.
  2020b. Tracking social media discourse about the        Geocov19: a dataset of hundreds of millions of mul-
  covid-19 pandemic: Development of a public coro-        tilingual covid-19 tweets with location information.
  navirus twitter data set. JMIR Public Health and        SIGSPATIAL Special, 12(1):6–15.
  Surveillance, 6(2):e19273.
                                                         Melissa Roy, Nicolas Moreau, Cécile Rousseau, Ar-
Hassan Dashtian and Dhiraj Murthy. 2021. Cml-covid:       naud Mercier, Andrew Wilson, and Laëtitia Atlani-
  A large-scale covid-19 twitter dataset with latent      Duault. 2020. Ebola and localized blame on social
  topics, sentiment and location information. arXiv       media: Analysis of twitter and facebook conversa-
  preprint arXiv:2101.12202.                              tions during the 2014–2015 ebola epidemic. Culture,
                                                          Medicine, and Psychiatry, 44(1):56–79.
Zhiwei Gao, Shuntaro Yada, Shoko Wakamiya, and
  Eiji Aramaki. 2020. Naist covid: Multilingual          Lisa Singh, Shweta Bansal, Leticia Bode, Ceren Budak,
  covid-19 twitter and weibo dataset. arXiv preprint        Guangqing Chi, Kornraphop Kawintiranon, Colton
  arXiv:2004.08145.                                         Padden, Rebecca Vanarsdall, Emily Vraga, and
                                                           Yanchen Wang. 2020. A first look at covid-19 infor-
Purva Grover, Arpan Kumar Kar, Yogesh K. Dwivedi,           mation and misinformation sharing on twitter. arXiv
  and Marijn Janssen. 2019. Polarization and accultur-      preprint arXiv:2003.13907.
  ation in us election 2016 outcomes – can twitter an-
  alytics predict changes in voting preferences. Tech-   Santosh Vijaykumar, Glen Nowak, Itai Himelboim, and
  nological Forecasting and Social Change, 145:438 –       Yan Jin. 2018. Virtual zika transmission after the
  460.                                                     first us case: who said what and how it spread
                                                           on twitter. American journal of infection control,
Fatima Haouari, Maram Hasanain, Reem Suwaileh,             46(5):549–557.
  and Tamer Elsayed. 2021. Arcov19-rumors: Ara-
  bic covid-19 twitter dataset for misinformation de-    Koosha Zarei, Reza Farahbakhsh, Noel Crespi, and
  tection. In Proceedings of the Sixth Arabic Natural      Gareth Tyson. 2020. A first instagram dataset on
  Language Processing Workshop, WANLP 2021.                covid-19. arXiv preprint arXiv:2004.12226.
Arkaitz Zubiaga, Maria Liakata, Rob Procter, Geral-
  dine Wong Sak Hoi, and Peter Tolmie. 2016.
  Analysing how people orient to and spread rumours
  in social media by looking at conversational threads.
  PloS one, 11(3).
You can also read