ARCOV-19: THE FIRST ARABIC COVID-19 TWITTER DATASET - ARXIV

Page created by Danielle Hill

Health & Fitness

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

ArCOV-19: The First Arabic COVID-19 Twitter Dataset
 with Propagation Networks
 Fatima Haouari, Maram Hasanain, Reem Suwaileh, Tamer Elsayed
 Computer Science and Engineering Department
 Qatar University
 Doha, Qatar
 {200159617, maram.hasanain, rs081123, telsayed}@qu.edu.qa
 Abstract
 In this paper, we present ArCOV-19, an Arabic COVID-19 Twitter dataset that covers the period from 27th of January
 till 31st of March 2020. ArCOV-19 is the first publicly-available Arabic Twitter dataset covering COVID-19 pandemic
 that includes around 748k popular tweets (according to Twitter search criterion) alongside the propagation networks
 of the most-popular subset of them. The propagation networks include both retweets and conversational threads (i.e.,
 threads of replies). ArCOV-19 is designed to enable research under several domains including natural language processing,
arXiv:2004.05861v2 [cs.CL] 18 Apr 2020

 information retrieval, and social computing, among others. Preliminary analysis shows that ArCOV-19 captures rising
 discussions associated with the first reported cases of the disease as they appeared in the Arab world. In addition to the
 source tweets and the propagation networks, we also release the search queries and the language-independent crawler
 used to collect the tweets to encourage the curation of similar datasets.

 Keywords: Coronavirus pandemic, popular tweets, spread analysis, misinformation, conversational threads, retweets,
 social analytics.

 1. Introduction dominant languages in Twitter (Alshaabi et al., 2020a),
 yet still under-studied in general.
 Twitter streams hundreds of millions of tweets daily. ArCOV-19 is the first Arabic Twitter dataset designed
 In addition to being a medium for the spread and con- to capture popular tweets (according to Twitter search
 sumption of news, it has been shown to capture the criterion) discussing COVID-19 starting from January
 dynamics of real-world events including the spread of 27th till the end of March 2020, constituting about
 diseases such as the seasonal influenza (Kagashe et al., 748k tweets, alongside the propagation networks of
 2017) or more severe epidemics like Zika (Vijaykumar the most-popular subset of them.1 To our knowledge,
 et al., 2018), Ebola (Roy et al., 2020), and H1N1 (Mc- there is no publicly-available Arabic Twitter dataset
 Neill et al., 2016). Moreover, collective conversations for COVID-19 that includes the propagation network
 on Twitter about an event can have a great influence on of a good subset of its tweets. Some existing efforts
 the event’s outcomes, e.g., US 2016 presidential elec- have already started to curate COVID-19 datasets in-
 tions (Grover et al., 2019). Analyzing tweets about an cluding Arabic tweets, but Arabic is severely under-
 event, as it evolves, offers a great opportunity to un- represented (e.g., (Chen et al., 2020; Singh et al.,
 derstanding its structure and characteristics, inform- 2020)) or represented by a random sample that is not
 ing decisions based on its development, and anticipat- specifically focused on COVID-19 (e.g., (Alshaabi et
 ing its outcomes as represented in Twitter and, more al., 2020b)), or the dataset is limited in the period it
 importantly, in the real world. covers and does not include the propagation networks
 Since the first reported case of Novel Coronavirus (e.g., (Alqurashi et al., 2020)). Furthermore, ArCOV-19
 (COVID-19) in China, in November 2019, the COVID- includes only popular tweets as we hypothesize that
 19 topic has drawn the interest of many Arab users over such tweets are more likely to be influential.
 Twitter. Their interest, reflected in the Arabic content The contribution of this paper is three-fold:
 on the platform, has reached a peak after two months
 • We release ArCOV-19, the first Arabic Twitter
 when the first case was reported in the United Arab
 dataset about COVID-19 that comprises 748k
 Emirates late in January 2020. This ongoing pandemic
 popular tweets per Twitter search criterion.2 In
 has, unsurprisingly, spiked discussions on Twitter cov-
 addition to the tweets, ArCOV-19 includes propa-
 ering a wide range of topics such as general informa-
 gation networks of a subset of 65k tweets, search
 tion about the disease, preventive measures, proce-
 queries, and documented implementation of our
 dures and newly-enforced decisions by governments,
 language-independent tweets crawler.
 up-to-date statistics of the spread in the world, and
 • We present a preliminary analysis on ArCOV-19,
 even the change in our daily habits and work styles.
 which reveals insights from the dataset concerning
 In this work, we aim to facilitate future research on so- temporal, geographical, and topical aspects.
 cial media during this complex and historical period of
 our history by curating an Arabic dataset (ArCOV-19) 1
 The dataset will be continuously augmented with new
 that exclusively covers tweets about COVID-19. We tweets over the coming few months.
 2
 limit the dataset to Arabic since it is one of the most https://gitlab.com/bigirqu/ArCOV-19

 1

• We suggest several use cases to enable research on 2.2 Propagation Networks Collection
Arabic tweets in different research areas including, In addition to the source tweets, we also collected the
but not limited to, emergency management, mis- propagation networks (i.e., retweets and conversa-
information detection, and social analytics. tional threads) of the top 1000 most popular tweets
The reminder of the paper is organized as follows. each day. To our knowledge, this is the first Arabic
We present the crawling process followed to acquire tweet dataset to include such propagation networks.
ArCOV-19 in Section 2. We then thoroughly discuss the Before getting the most popular tweets on any day,
preliminary analysis that we conducted on the dataset we started from source tweets collected in that day
in Section 3. We suggest some use cases to enable re- and applied a quality qualification pipeline. We first
search on Arabic tweets in Section 4. We finally con- excluded tweets containing any inappropriate word
clude in Section 5. (using a list of inappropriate words we constructed).
Next, tweets with more than two URLs or four hash-
2. Data Collection tags, or shorter than four tokens (all are potentially
ArCOV-19 includes two major components: the source spam) were dropped. Additionally, duplicate tweets
tweets (i.e., tweets collected via Twitter search API that have exact textual content are also dropped to
every day) and the propagation networks (i.e., avoid redundancy; only the most popular of them (ac-
retweets and conversational threads of a subset of the cording to our scoring criterion below) is kept. Qual-
source tweets). In this section, we present how we col- ified tweets are then scored by popularity defined by
lected each in Sections 2.1, and 2.2 respectively, and the sum of the tweet’s retweet and favorite counts. We
summarize the released data in Section 2.3. finally sort the qualified tweets by their scores and se-
lect the most popular 1,000 tweets. We denote the set
2.1 Tweets Collection of all such tweets over all days as the top subset.
To collect the source tweets, we implemented a For those 1,000 tweets per day, we then collected
popularity-based tweet crawler.3 The crawler takes a all retweets and conversational threads (i.e., direct
set of manually-crafted queries, comprising a target and indirect replies). We collected the retweets using
topic, as input. At the end of each day, the crawler Pickaw,8 a platform for organizing contests on social
issues a search request for each of those queries via media, and the replies9 using PHEME (Zubiaga et al.,
Twitter search API4 to get the top popular tweets as 2016) Twitter conversation collection script.10
defined by Twitter API5 on that day. Queries can be
keywords (e.g., “Corona”), phrases (e.g., “the killing 2.3 Data Release
virus”), or hashtags (e.g., “#the new coronavirus”). In summary, we release the following resources as
Twitter returns a maximum of 3,200 tweets per query. ArCOV-19 dataset, taking into consideration Twitter
We customized the search requests to return only Ara- content redistribution policy:11
bic tweets,6 and to exclude all retweets to avoid having
• Source Tweets: tweet IDs of the tweets crawled
copies of the same popular tweet. Additionally, dupli-
in each day.
cate tweets (returned by different queries) are removed
• Search Queries: the list of search queries, in-
in each day. Finally, tweets are sorted chronologically.
cluding keywords, phrases, and hashtags, used in
We started collecting our data since the 27th of Jan-
each period to collect our source tweets.
uary 2020 using a set of queries that we manually-
• Top Subset: tweet IDs of the top 1,000 most
updated based on our daily tracking of trending key-
popular tweets for each day.
words and hashtags. The full list of queries used in
• Propagation Networks: the propagation net-
each day is released alongside our dataset. We denote
works for the top subset which include for each
all collected tweets as source tweets.
tweet in the top subset:
Due to technical reasons, we missed collecting tweets
for a few days. Due to Twitter search API restric- – Retweets: tweet IDs of the full retweet set.
tion that limits the search results to the past 7 days, – Conversational Threads: tweet IDs of the
we missed the old tweets. To overcome that, we full reply thread (including direct and indi-
used GetOldTweets37 python library to download the rect replies).
search results using the same trending keywords and
hashtags we selected around those days. Along with the dataset, we provide, in our repository,
some pointers to publicly-available crawlers that users
3
https://gitlab.com/bigirqu/ArCOV-19/- can easily use to crawl the tweets given their IDs.
/tree/master/code/crawler
4 8
https://developer.twitter.com/en/docs/tweets/ https://pickaw.com/en/twrench-becomes-pickaw
9
search/api-reference/get-search-tweets By the time of writing, we have collected the replies
5
We are not aware of the criterion used by Twitter to for the first 22 days of the dataset and are still collecting
retrieve the top tweets, but we believe it is based on a mix the rest; we will make all available in the near future.
10
of likes, retweets, impressions, etc. https://github.com/azubiaga/pheme-twitter-
6
The language of tweets to return is a configurable pa- conversation-collection
11
rameter. https://developer.twitter.com/en/developer-
7
https://github.com/Mottl/GetOldTweets3 terms/agreement-and-policy

Tweets number across days (verified/unverified)
22,000
3. ArCOV-19 in Numbers 20,000
Posted by verified user
Posted by unverified user

In this section, we present a statistical summary and 18,000

conduct a preliminary analysis on ArCOV-19 to shed 16,000

Number of tweets
14,000
some light on its major characteristics.
12,000

10,000
3.1 Tweets and Users Distribution
8,000
Table 1 presents an overall summary of the tweets and 6,000

users statistics in ArCOV-19. It indicates that The to- 4,000

tal number of source tweets in the dataset is about 2,000

748k posted by about 249k unique users. We note that 0

01-Mar
03-Mar
05-Mar
07-Mar
09-Mar
11-Mar
13-Mar
15-Mar
17-Mar
19-Mar
21-Mar
23-Mar
25-Mar
27-Mar
29-Mar
31-Mar
27-Jan
29-Jan
31-Jan
02-Feb
04-Feb
06-Feb
08-Feb
10-Feb
12-Feb
14-Feb
16-Feb
18-Feb
20-Feb
22-Feb
24-Feb
26-Feb
28-Feb
14.7% of the tweets were posted by verified users (who
constitute only 1.42% of the unique users). This is a
relatively large percentage, showing that a good por- Figure 1: Distribution of source tweets in ArCOV-19
tion of the source tweets are indeed popular. The aver- over time. Dark blue indicates tweets posted by veri-
age numbers of followers, friends, and statuses of users fied users, while light blue indicates tweets posted by
are also relatively large, showing that users observed unverified users.
in ArCOV-19 are also popular (and possibly more in- Akhbar | ‫أخبار اآلن‬ 2613
‫صحيفة البيان‬ 1855
fluential). The table also indicates that 21.3% of the CGTN Arabic 1473
‫عبدالعزيز الملحم‬ 1189
tweets include URLs. We anticipate that this is due to Mulhak 1177
‫الخليج أونالين‬ 1150
the extensive spreading of news (linked from tweets) NEWS KTV 1146
Bawabaa News 1133
during this period. The numbers of tweets that are ‫قناة الغد‬ 1106
Media News Agency 1053
geotagged and geolocated are also indicated in the ta- ‫اليوم السابع‬ 1043
‫صدى البلد‬ 1036
ble; however, we defer the discussion of such types of ‫الجوازات السعودية‬ 955
‫العربية‬ 954
tweets to Section 3.3. ‫الحدث‬ 924
‫القبس اإللكتروني‬ 924
Figure 1 illustrates the distribution of tweets over ‫صحيفة الخليج‬ 911
‫بوابة الوطن‬ 900
time in ArCOV-19. It is clearly non-uniform. More- ‫الغد الكويتية‬ 897
‫ترند‬ 895
over, the volume of tweets hugely increases when the ‫أخبار كورونا‬ 829
‫جريدة األنباء‬ 822
virus started to spread in several countries in the Arab ‫صحيفة الشرق األوسط‬ 810
‫جريدة النهار الكويتية‬ 794
world. The figure also shows the distribution of the AlMamlakaTV 787
0 500 1000 1500 2000 2500
tweets posted by verified users vs. unverified users, Number of tweets
showing a similar pattern.
Figure 2 presents the top 25 tweeting users in Figure 2: Number of tweets posted by the top 25 tweet-
ArCOV-19. We notice that several of them are news ers in ArCOV-19.
sources, and 20 are verified Twitter accounts. This
is again consistent with our crawling criteria focusing discussed over Twitter during that period. To help ex-
only on popular tweets. amine this hypothesis, we analyzed the textual content
of the tweets; in particular, we identified the most-
Table 1: Tweets and users statistics of ArCOV-19. frequent words, hashtags, and Arab country names in
Tweets Statistics the entire dataset, then tracked their frequency over
time. Figure 3 shows the time series (over days) of the
Source Tweets 747,599 three types of keywords.12
Geotagged 486 (0.07%) The 10 most-frequent words shown in Figure 3(a) in-
Geolocated 20,031 (2.68%) dicate two different types of words: those that are di-
Posted by verified users 110,141 (14.7%) rectly related to COVID-19 (e.g., “health”) and those
Include URL 159,239 (21.3%) that are not but related to prayers and supplications
Top Subset 65,000 (8.69%) (e.g., “Allah/God”)). It is interesting to see the word
Users Statistics “Allah” is very frequent early on when the news about
the virus started to spread (probably over discussions
Unique 248,513
around whether the pandemic is a punishment from
Verified 3,537 (1.42%)
God or not, we believe), then declines over time only
Average followers count 8,334
to become frequent again when the virus started to
Average friends count 1,274
widely spread in the Arab world.
Average statuses count 11,201
Figure 3(b) demonstrates how “#China” hashtag was
the most trending one from 27 of January until 20 of
February, as COVID-19 was prevalent only in China
3.2 Tweets Content & Topics
and still not widely spread (at that time) in other
It is important to demonstrate that the tweets in
ArCOV-19 constitute a good representative sample of 12
When identifying the most frequent words and hash-
the Arabic tweets posted during the target period on tags, we excluded the ones we used in our search queries
COVID-19, and that they cover the prevalent topics since they are expected to be very frequent by definition.

30%
 ‫هللا‬ ‫هللا‬ ‫حالة‬ ‫بفيروس‬ ‫الصحة‬ ‫بسبب‬
 25% ‫اصابة‬ ‫العالم‬ ‫انتشار‬ ‫اللهم‬ ‫عدد‬

 ‫اصابة‬
 20% ‫بفيروس‬
 ‫حالة‬
 % tweets

 ‫الصحة‬ ‫بسبب‬
 15%
 ‫اللهم‬ ‫انتشار‬
 10%
 ‫العالم‬

 5%

 0% ‫عدد‬

 01-Mar
 03-Mar
 05-Mar
 07-Mar
 09-Mar
 11-Mar
 13-Mar
 15-Mar
 17-Mar
 19-Mar
 21-Mar
 23-Mar
 25-Mar
 27-Mar
 29-Mar
 31-Mar
 27-Jan
 29-Jan
 31-Jan
 02-Feb
 04-Feb
 06-Feb
 08-Feb
 10-Feb
 12-Feb
 14-Feb
 16-Feb
 18-Feb
 20-Feb
 22-Feb
 24-Feb
 26-Feb
 28-Feb
 (a) Top 10 most frequent words

 20%
 _‫كورونا‬# #‫الكويت‬ #‫الصين‬
 ‫الكويت‬ _‫كورونا‬# #‫إيران‬ #‫عاجل‬
 ‫الصين‬#
 _‫كورونا‬# ‫في_الكويت‬ #‫كورونا_في_الكويت‬ #‫السعودية‬
 15%
 ‫لبنان‬ #‫كورونا_الكويت‬ #‫كورونا_لبنان‬
 #‫لبنان‬ #covid19
 ‫لبنان‬#
 % tweets

 ‫الكويت‬# ‫عاجل‬#
 10%
 ‫إيران‬#
 #covid19
 5%
 ‫السعودية‬#

 0%
 11-Mar
 13-Mar
 15-Mar
 17-Mar
 19-Mar
 21-Mar
 23-Mar
 25-Mar
 27-Mar
 29-Mar
 31-Mar
 2-Feb
 4-Feb
 6-Feb
 8-Feb
 27-Jan
 29-Jan
 31-Jan

 10-Feb
 12-Feb
 14-Feb
 16-Feb
 18-Feb
 20-Feb
 22-Feb
 24-Feb
 26-Feb
 28-Feb
 1-Mar
 3-Mar
 5-Mar
 7-Mar
 9-Mar

 (b) Top 10 most frequent hashtags
 Time of first reported infected case
 ‫ليبيا‬
 ‫مصر‬

 ‫لبنان‬

 ‫الكويت عمان البحرين العراق‬

 ‫قطر‬
 ‫السعودية تونس المغرب األردن‬

 ‫سوريا‬
 ‫اإلمارات‬

 ‫الجزائر‬

 ‫السودان موريتانيا‬
 ‫فلسطين‬

 12%

 ‫مصر‬ ‫لبنان‬ ‫الكويت‬ ‫الكويت‬ ‫مصر‬ ‫لبنان‬ ‫السعودية‬
 ‫االمارات‬ ‫العراق‬ ‫البحرين‬ ‫قطر‬
 10% ‫البحرين‬
 ‫العراق‬ ‫اليمن‬ ‫االردن‬ ‫الجزائر‬ ‫سوريا‬

 8% ‫المغرب‬ ‫السودان‬ ‫تونس‬ ‫فلسطين‬
 % tweets

 ‫ليبيا‬ ‫عمان‬ ‫موريتانيا‬ ‫الصومال‬

 6% ‫الجزائر‬
 ‫اإلمارات‬ ‫عمان‬
 4% ‫السعودية‬ ‫قطر‬
 ‫ليبيا‬
 2%

 0%
 17-Mar

 31-Mar
 01-Mar
 03-Mar
 05-Mar
 07-Mar
 09-Mar
 11-Mar
 13-Mar
 15-Mar

 19-Mar
 21-Mar
 23-Mar
 25-Mar
 27-Mar
 29-Mar
 27-Jan
 29-Jan
 31-Jan
 02-Feb
 04-Feb
 06-Feb
 08-Feb
 10-Feb
 12-Feb
 14-Feb
 16-Feb
 18-Feb
 20-Feb
 22-Feb
 24-Feb
 26-Feb
 28-Feb

 (c) Word frequency of Arab countries

Figure 3: Time series of the frequency (in percentage of tweets) of top general keywords, hashtags, and Arab
county words over the period of ArCOV-19. The time of first reported cases in Arab countries is also indicated
and aligned with the time series.

 4

youm7.com 1044
countries. We can see how it started to be less trend- alarabiya.net 872
 youtube.com 627
ing as the number of cases started to decline in China cnn.com 508
 sabq.org 441
by that date.13 On the other hand, when COVID- cgtn.com 433
 skynewsarabia.com 431
19 started to spread in the Arab world, other hash- rt.com 404
 aa.com.tr 389
tags started to become viral. Furthermore, Figure 3(b) akhbaralaan.net 372
 elwatannews.com 360
shows that frequency spikes of “#Iran”, “#Lebanon”, aawsat.com 305
 almasryalyoum.com 290
and “#Kuwait” exactly match the confirmation dates euronews.com 276
 oaurl.com
of first reported cases in Iran,14 Lebanon,15 and
 275
 spa.gov.sa 261
 alhurra.com
Kuwait,16 respectively.
 250
 24.ae 241
 okaz.com.sa 226
To further analyze trending topics, Figure 3 features pal4all.live 212
 argaam.com 206
the timeline of the first reported cases in the Arab albayan.ae 195
 alqabas.com 185
countries and aligns them with the time series through- shorouknews.com 182
 alhadath.net 178
out the figure. Aligning the timeline with the series 0 200 400 600 800 1000

in Figure 3(c) reveals a significant match between the Number of occurrences

frequency peaks of several country names and the cor-
responding dates of first reported cases in those coun- Figure 4: Most frequently-linked domains in the top
tries, most notably in UAE, Egypt, Lebanon, Kuwait, tweets subset.
Oman, Bahrain, Algeria, and Libya.
Figure 3 also demonstrates the power of ArCOV-19 We observe that these news websites mainly origi-
in capturing further controversial and trending top- nate from three Arab countries, namely, Egypt, Saudi
ics. Table 2 shows a timeline covering dates of specific Arabia, and United Arab Emirates. Interestingly,
topics of discussion trending on social media around URLs from the China “Global Television Network”
times of peaks in tweeting frequency in Figure 3(c). (cgtn.com) are commonly shared as well, which can be
 Country Date Topic Related News explained by the fact that COVID-19 started in and
 UAE Feb 19 Yemeni Foreign http://tiny.cc/71qtmz heavily affected China.
 Affairs Minister
 thanks UAE for 3.3 Geographic Distribution of Tweets
 evacuating Yemeni
 students from Although Twitter provides automatic geo-reference
 China
 Iraq Feb 20 Iraq announces http://tiny.cc/yzrtmz
 functionally, few users (solely around 1-3% (Murdock,
 closure of borders 2011; Jurgens et al., 2015)) opt to enable it due to pri-
 with Iran vacy and safety reasons. Alternatively, to have an in-
 Egypt Mar 2 Egyptian health http://tiny.cc/7iqtmz
 minister announces
 sight about the geographical distribution and diversity
 a visit to China of the tweets in our dataset, we examined the place and
 Mar 7 Kuwait bans http://tiny.cc/lqstmz coordinates attributes of the Tweet object.17 We note
 travels with Egypt
 including entry of that the place attribute is an optional attribute that
 Egyptian residents allows the user to select a location from a list provided
 KSA Mar 4-5 Closure of The http://tiny.cc/iqrtmz by Twitter (therefore, the location might not necessar-
 Grand Mosque in
 Mecca ily show the actual location from where the tweet is
 posted), whereas the coordinates attribute represents
Table 2: Examples of trending topics in social media the geographic location of the Tweet as reported by
matched by spikes in tweeting frequency in ArCOV-19 the user or client application.
 Table 1 shows that ArCOV-19 has 20,031 geolocated
We further explore the topics discussed in ArCOV-19 by tweets (i.e., having values in the place attribute) and
considering the domains of URLs shared in the tweets. 486 geotagged tweets (i.e., having values in the co-
We focused on the top tweets subset of ArCOV-19 (con- ordinates attribute). Those tweets were posted by
structed as detailed in Section 2.2) and identified the 9,543 and 82 unique users respectively. The geolo-
URLs posted in those tweets. We then expanded them cated tweets constitute about 2.68% of the total source
(since Twitter URLs are shortened) and dropped URLs tweets, which is slightly higher than what is reported
of images and videos to focus only on URLs referring to in previous studies on disaster datasets (Anderson et
external sources. We extracted 17,308 URLs from 759 al., 2019). We anticipate that this is due to the cri-
unique domains. Figure 4 shows that URLs from news teria we followed to collect the tweets, which relies on
websites are dominantly the most-commonly shared. Twitter’s popularity criteria.
 13
 Focusing our attention on the geolocated tweets, Fig-
 https://www.worldometers.info/coronavirus/
 ures 5 and 6 depict the overall and daily distribu-
country/china/
 14 tions (respectively) of those tweets over Arab and other
 https://en.wikipedia.org/wiki/2020_
coronavirus_pandemic_in_Iran#February_2020 countries. The geolocated tweets were indeed posted
 15
 https://www.the961.com/first-coronavirus- by users from 81 countries from around the globe. We
case-confirmed-in-lebanon/
 16 17
 https://www.kuna.net.kw/ArticleDetails.aspx? https://developer.twitter.com/en/docs/tweets/
id=2865550&language=en data-dictionary/overview/geo-objects

 5

Bahrain
1.8%
3.4 Propagation Networks
Libya
2.0%
As discussed earlier, we collected the propagation net-
Jordan works of the top subset. Overall, the collected number
2.6%
Qatar KSA of retweets is 2,697,951 for the entire subset, and the
2.6% 35.3%
Iraq collected number of replies is 301,535 for only 22 days
3.7%
Egypt
of the time period.19 Both are released in ArCOV-19.
5.2% At the time of collecting the retweets, we were not
Oman
6.1% able to get the retweets of some of the tweets either
because they were deleted or they were posted from
UAE
6.8% private accounts. For those collected successfully, Fig-
ure 7 illustrates the distribution of retweets per day
Lebanon using boxplots. It shows that the average (and me-
8.2% Kuwait
Other 12.6% dian) number of retweets follows a similar pattern to
9.0% what was shown in Figure 1 (notice that the Y-axis
here is in log scale). We also notice that a good num-
Figure 5: Distribution of geolocated tweets. ber of tweets got more than 100 retweets; some of them
got even larger numbers reaching about 10k retweets
800 Somalia
Palestine or more, showing highly propagated content.
Mauritania
Syria
105
Yemen
600 Tunisia
Sudan
Algeria 104
Morocco
Number of tweets

Bahrain
Number of retweets

400
Libya
103
Jordan
Qatar
Iraq
Egypt 102
200
Oman
UAE
Lebanon
101
Other

0 Kuwait
27/01
29/01
31/01
02/02
04/02
06/02
08/02
10/02
12/02
14/02
16/02
18/02
20/02
22/02
24/02
26/02
28/02
01/03
03/03
05/03
07/03
09/03
11/03
13/03
15/03
17/03
19/03
21/03
23/03
25/03
27/03
29/03
31/03

KSA
100
27/01
29/01
31/01
02/02
04/02
06/02
08/02
10/02
12/02
14/02
16/02
18/02
20/02
22/02
24/02
26/02
28/02
01/03
03/03
05/03
07/03
09/03
11/03
13/03
15/03
17/03
19/03
21/03
23/03
25/03
27/03
29/03
31/03
Figure 6: Distribution of geolocated tweets per day.

Figure 7: Distribution of retweets per day for the top
found that 91% of them were posted from the Arab subset.
world. The largest contribution of the geolocated con-
tent (about 35.3%) comes from Saudi Arabia; this is
somewhat expected as Saudi users represent the high-
104
est number of active Twitter users in the Arab world.18
Surprisingly, Kuwait comes second with 12.6% of the
geolocated content. We believe the rationale is that it 103
Number of replies

was among the first countries that reported the cases
in the Middle East. Since then, Kuwait started a se- 102
ries of strict precautions such as a wide lock-down in
many vital facilities until the government had imposed 101
a nationwide curfew. We think, after the curfew, the
people become more active on Twitter as a platform
100
to break news and discuss developments of the virus.
27/01
29/01
31/01
02/02
04/02
06/02
08/02
10/02
12/02
14/02
16/02
18/02
20/02
22/02
24/02
26/02
28/02
01/03
03/03
05/03
07/03
09/03
11/03
13/03
15/03
17/03
19/03
21/03
23/03
25/03
27/03
29/03
31/03

Additionally, we used related phrases and hashtags to
“Kuwait”, “Lebanon”, and “UAE” among the tracking
keywords in the period between 22 of February and Figure 8: Distribution of replies per day for the top
13 of March which demonstrates the reason behind subset, for only 22 days.
their high percentages during that period, as shown in
Figure 6. Furthermore, it is not surprising to see the Figure 8 illustrates the distribution of replies per day
countries that have a few or no cases have the smallest using boxplots, but only for 22 days that we managed
portions of contribution to the content. Others have to crawl by the time of writing. However, the smaller
small audience userbases on Twitter (e.g., Tunisia). period shows similar pattern, with a bit lower numbers,
to the pattern shown in Figure 7.
18
https://www.statista.com/statistics/242606/
19
number-of-active-twitter-users-in-selected- Crawling is still going on for the rest of the days and
countries/ the repository will be updated accordingly.

4. Enabling Research versational threads available in ArCOV-19 provide a
With the spread of Novel Coronavirus in several Arab valuable resource for early detection of fake news and
countries and the subsequent procedural measures identification of malicious rumor spreading accounts.
taken by the local governments, it continued to domi- 4.3 Social Analytics
nate discussions over social media (and Twitter in par-
ticular) to the time of this writing. In addition to sharing reports and awareness, people
tend to express their emotions during the outbreak
To enable research on different tasks on Arabic Tweets,
(e.g., big change of lifestyle, social distancing, losing
we make ArCOV-19 publicly-available for the research
their beloved ones, distance learning, etc.). Therefore,
community. ArCOV-19 is composed of the popu-
analyzing the sentiment of the tweets to understand
lar tweets, according to Twitter popularity criteria.
the effects of the outbreak on people is of interest to
Therefore, we envision that it contains the tweets that
many stakeholders.
had received the maximum social interactions (e.g.,
Additionally, lots of discussions focused on the causes
impressions) which implies they are useful tweets to
of Novel Coronavirus and its rapid spread, which can
analyze as they got the most attention of the com-
be studied to understand the stance of users toward dif-
munity. We further provide propagation networks of
ferent aspects of the situation such as claims, and the
daily highly popular subset (the top 1,000 most pop-
consequences of the spread. For example, the catas-
ular) to ensure the quality of tweets. This sample is
trophic situation caused by the rapid spread of the
drawn after filtering out potential low-quality tweets
virus made hospitals go beyond their capacity, which
(e.g., spam) and content duplicates.
forced doctors to deal with agonizing life-death deci-
This careful design and architecture of ArCOV-19
sions triggering heated discussions on such issue.
makes it suitable for diverse natural language pro-
Moreover, in some countries, people are demanding to
cessing, information retrieval, and computational so-
return resident labors back to their countries to re-
cial science research tasks including, but not limited
duce the spread of the virus and limit the country’s
to, emergency management, misinformation detection,
resources to only citizens. This stance sparked contro-
and social analytics, as we discuss below. We antici-
versial discussion on social media on the legal, human-
pate ArCOV-19 to support these tasks on the popular
itarian, and ethical aspects of these demands. This re-
Arabic tweets that we assume have the most effect on
quires analytical tools to detect hate speech and, gen-
shaping public opinion and understanding.
erally, offensive language.
4.1 Emergency Management
As COVID-19 is an international pandemic, national
5. Conclusion
and international health organizations need to analyze In this paper, we presented ArCOV-19, the first Arabic
the effects of the outbreak. We envision ArCOV-19 Twitter dataset about the Novel Coronavirus (COVID-
to support several tasks such as filtering informative 19) that includes popular tweets and propagation net-
content, summarization, eyewitness identification, ge- works, focusing on popular discussions on Twitter. We
olocation, and studying information (and situational release all source tweets, top subset, search queries,
awareness) propagation. The popular tweets provide and the propagation networks. Preliminary analysis
key insights into different aspects that are of inter- showed that ArCOV-19 captured spikes in tweeting fre-
est to many users. For example, recovered users write quency for country-specific tweets that are consistent
about the symptoms of the virus they experienced, in- with the first reported cases of COVID-19 in several
fected users search for possible medications, and non- Arab countries. We also found dominance of news
infected people look for protective measures. Other agencies among top tweeting users and among most
types of information of interest that help for situa- shared URLs. ArCOV-19 enables research under many
tional awareness include reports (tolls of cases, deaths, domains including natural language processing, infor-
recovered), development of precautions (e.g., quaran- mation retrieval, and social computing. We plan to
tine, lockdown, curfew, etc.), among others. continue collecting tweets for the foreseeable future
and the dataset will be continuously updated with
4.2 Misinformation Detection newly collected tweets and propagation networks.
With the sheer amount of information shared about
COVID-19, many rumors are disseminated and get- Acknowledgments
ting high attention from the community, which causes The work of Tamer Elsayed and Maram Hasanain
a fast propagation of misinformation. This hinders the was made possible by NPRP grant# NPRP 11S-1204-
efforts of the health and governmental organizations 170060 from the Qatar National Research Fund (a
on fighting the pandemic as such rumors spread panic member of Qatar Foundation). The work of Reem
and may lead to undesirable consequences (e.g., in- Suwaileh was supported by GSRA grant# GSRA5-1-
crease of cases and mortality rate, or lack of supplies 0527-18082 from the Qatar National Research Fund
due to hoarding). ArCOV-19 supports studying infor- and the work of Fatima Haouari was supported by
mation/claims propagation, claims check-worthiness GSRA grant# GSRA6-1-0611-19074 from the Qatar
detection and verification tasks on the most popular National Research Fund. The statements made herein
tweets. Furthermore, the retweet networks and con- are solely the responsibility of the authors.

References American journal of infection control, 46(5):549–
 557.
Alqurashi, S., Alhindi, A., and Alanazi, E. (2020). Zubiaga, A., Liakata, M., Procter, R., Hoi, G. W. S.,
 Large arabic twitter dataset on covid-19. and Tolmie, P. (2016). Analysing how people orient
Alshaabi, T., Dewhurst, D. R., Minot, J. R., Arnold, to and spread rumours in social media by looking at
 M. V., Adams, J. L., Danforth, C. M., and Dodds, conversational threads. PloS one, 11(3).
 P. S. (2020a). The growing echo chamber of social
 media: Measuring temporal and social contagion dy-
 namics for over 150 languages on twitter for 2009–
 2020.
Alshaabi, T., Minot, J., Arnold, M., Adams, J. L.,
 Dewhurst, D. R., Reagan, A. J., Muhamad, R.,
 Danforth, C. M., and Dodds, P. S. (2020b). How
 the world’s collective attention is being paid to
 a pandemic: Covid-19 related 1-gram time se-
 ries for 24 languages on twitter. arXiv preprint
 arXiv:2003.12614.
Anderson, J., Casas Saez, G., Anderson, K. M., Palen,
 L., and Morss, R. (2019). Incorporating Context
 and Location Into Social Media Analysis: A Scal-
 able, Cloud-Based Approach for More Powerful Data
 Science, volume 6. IEEE.
Chen, E., Lerman, K., and Ferrara, E. (2020). Covid-
 19: The first public coronavirus twitter dataset.
 arXiv preprint arXiv:2003.07372.
Grover, P., Kar, A. K., Dwivedi, Y. K., and Janssen,
 M. (2019). Polarization and acculturation in us elec-
 tion 2016 outcomes âĂŞ can twitter analytics predict
 changes in voting preferences. Technological Fore-
 casting and Social Change, 145:438 – 460.
Jurgens, D., Finethy, T., McCorriston, J., Xu, Y. T.,
 and Ruths, D. (2015). Geolocation prediction in
 twitter using social networks: A critical analysis and
 review of current practice.
Kagashe, I., Yan, Z., and Suheryani, I. (2017). En-
 hancing seasonal influenza surveillance: Topic analy-
 sis of widely used medicinal drugs using twitter data.
 J Med Internet Res, 19(9):e315.
McNeill, A., Harris, P. R., and Briggs, P. (2016). Twit-
 ter influence on uk vaccination and antiviral uptake
 during the 2009 h1n1 pandemic. Frontiers in Public
 Health, 4:26.
Murdock, V. (2011). Your mileage may vary: on
 the limits of social media. SIGSPATIAL Special,
 3(2):62–66.
Roy, M., Moreau, N., Rousseau, C., Mercier, A., Wil-
 son, A., and Atlani-Duault, L. (2020). Ebola and
 localized blame on social media: analysis of twit-
 ter and facebook conversations during the 2014–2015
 ebola epidemic. Culture, Medicine, and Psychiatry,
 44(1):56–79.
Singh, L., Bansal, S., Bode, L., Budak, C., Chi,
 G., Kawintiranon, K., Padden, C., Vanarsdall, R.,
 Vraga, E., and Wang, Y. (2020). A first look at
 covid-19 information and misinformation sharing on
 twitter. arXiv preprint arXiv:2003.13907.
Vijaykumar, S., Nowak, G., Himelboim, I., and Jin, Y.
 (2018). Virtual zika transmission after the first us
 case: who said what and how it spread on twitter.

 8

You can also read