INFORMATION MINING FOR COVID-19 RESEARCH FROM A
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
I NFORMATION M INING FOR COVID-19 R ESEARCH F ROM A
L ARGE VOLUME OF S CIENTIFIC L ITERATURE
Sabber Ahamed Manar D. Samad
Asurion Department of Computer Science
Nashville, TN 38152 Tennessee State University
sabbers@gmail.com Nashville, TN 37209
arXiv:2004.02085v1 [cs.IR] 5 Apr 2020
msamad@tnstate.edu
April 7, 2020
A BSTRACT
The year 2020 has seen an unprecedented COVID-19 pandemic due to the outbreak of a novel
strain of coronavirus in 180 countries. In a desperate effort to discover new drugs and vaccines for
COVID-19, many scientists are working around the clock. Their valuable time and effort may benefit
from computer-based mining of a large volume of health science literature that is a treasure trove
of information. In this paper, we have developed a graph-based model using abstracts of 10,683
scientific articles to find key information on three topics: transmission, drug types, and genome
research related to coronavirus. A subgraph is built for each of the three topics to extract more
topic-focused information. Within each subgraph, we use a betweenness centrality measurement
to rank order the importance of keywords related to drugs, diseases, pathogens, hosts of pathogens,
and biomolecules. The results reveal intriguing information about antiviral drugs (Chloroquine,
Amantadine, Dexamethasone), pathogen-hosts (pigs, bats, macaque, cynomolgus), viral pathogens
(zika, dengue, malaria, and several viruses in the coronaviridae virus family), and proteins and
therapeutic mechanisms (oligonucleotide, interferon, glycoprotein) in connection with the core topic
of coronavirus. The categorical summary of these keywords and topics may be a useful reference to
expedite and recommend new and alternative directions for COVID-19 research.
Keywords coronavirus · COVID-19 · text mining · graph theory · topic modeling · betweenness centrality
1 Introduction
Severe acute respiratory syndrome (SARS) emerged in 2002 following the outbreak of a new SARS-related coronavirus
(SCV), which has resulted in hundreds of deaths world-wide after infecting thousands of individuals [Fouchier et al.,
2003]. Since then, there has been a plethora of research aiming at understanding the etiology of the disease [Weiss and
Navas-Martin, 2005], developing more reliable and faster diagnostic solutions [Yamaoka et al., 2016], and innovating
antiviral drugs [Keyaerts et al., 2009] and vaccines [Yong et al., 2019]. Many of these studies have successfully
repurposed previous findings to solve new challenges in biology and medicine. For example, if a group of viruses
shares a common protein structure, then therapies for one viral infection can be repurposed for new diseases like
COVID-19. Unfortunately, viruses often mutate to several challenging strains that are capable of infecting humans
in a highly contagious manner. This eventually triggers the onset of zoonotic infections, i.e., animal-to-human and
human-to-human transmission of the virus strain [Ozawa and Kawaoka, 2013].
The ongoing global pandemic of coronavirus disease (COVID-19) has emerged due to a new strain of SCV named as
the SARS coronavirus 2 (SARS-CoV-2) [Zhu et al., 2020] that is arguably hosted by bats [Wu et al., 2020]. As of
March 31st, 2020, 857,487 people of 180 countries have been infected by the virus in about three months, taking a toll
of 42,107 human lives as the numbers continue to grow exponentially [Dong et al., 2020]. This outbreak has already
started to cause an unprecedented impact on humankind at a scale that threatens to reshape the global economy, human
lives, and the world beyond. Unfortunately, it entails substantial time and effort to study these new strains of virusesA PREPRINT - A PRIL 7, 2020
before developing a targeted and fail-safe drug or vaccine. Without any immediate solution, several countries have
recently approved the use of hydroxychloroquine to treat critically ill patients diagnosed with COVID-19 despite its
known side-effects [Liu et al., 2020, Masuelli et al., 2017]. Notably, hydroxychloroquine is a drug for treating patients
diagnosed with malaria [Ben-Zvi et al., 2012]. This repurposing of malaria drug to treat coronavirus related patients
suggests that topics of virology, genetic engineering, bio-informatics, pharmacology in the literature may have valuable
recommendations for COVID-19 research.
Since the 2002 outbreak of coronavirus, many scholarly articles have been published on the topics of SARS and SCV. It
is not feasible for scientists to survey such a massive volume of diverse literature to derive useful cross-disciplinary
insights within a short period of time. Computer-based modeling of text documents can be a powerful approach in search
of valuable information from a large collection of diverse literature. In this paper, we have developed a graph-based
model using a large volume of scientific literature to discover latent keywords that may recommend new information to
combat the battle against COVID-19.
2 Technical Background
In text analytics, keywords searching plays an important role in document summarization [Hernández-castañeda et al.,
2020], classification [Diaz-Manriquez et al., 2018], and topic modeling [Jelodar et al., 2019]. However, the task of
preserving semantic and syntactic information of a language in numerical models is not trivial. There are three broad
categories of text-to-numeric representation models. First, term-frequency-inverse-document frequency (TF-IDF) [Yun-
tao et al., 2005], is one of the earliest methods for text representation based on word count. However, such models
do not capture semantic relevance and syntactic ordering of the words. Second, manifold learning techniques, such
as skip gram [Cheng et al., 2006] and GLOVe [Pennington et al., 2014], have been developed to efficiently capture
semantic relevance between words in numeric data vectors. These numeric embeddings do not directly map the intricate
relationship between words within a document. Third, graphs can be used to develop an interconnected network of
words of a document to directly model the syntactic ordering and connectivity between words. A graph-based keyword
extraction model using collective node weight (KECNW) has been proposed based on node-edge rank centrality
measure [Biswas et al., 2018]. Several attributes, such as the distance from the central node, the selectivity of each node
based on a number of its connecting edges, and the importance of neighboring nodes are taken into account to determine
the weight of each node. The node weights are used to rank individual words for keyword selection. Additionally,
graph-based models provide informative and interactive visualization of complex interconnected network of words in a
document [Rusu et al., 2009]. The goal of this paper is to develop a dense graphical network of words to summarize
thousands of research articles into topics related to coronavirus. The collection of influential nodes or keywords may
provoke new research topics for the ongoing COVID-19 research.
3 Methodology
The steps involved in the proposed research methodology are discussed below.
3.1 Dataset
In this work, we use the publicly available COVID-19 Open Research Dataset (CORD-19) released by a collaborative
initiative of the Allen Institute for AI, Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security
and Emerging Technology (CSET), Microsoft, and the National Library of Medicine (NLM) at the National Institutes
of Health (NIH) at the request of the White House Office of Science and Technology Policy. The dataset includes
about 13,000 scholarly articles with full texts in the areas of life science, including pharmacology, immunology,
physiology, microbiology, clinical medicine, genetics, veterinary, and many more. The articles are related to the study
of COVID-19, SARS-CoV-2, and similar other coronaviruses. The dataset comes with metadata and actual article texts
in structured data fields that identify title, paper and publication information, abstract, and the main body texts. More
details of the dataset can be found on the Kaggle website (https://www.kaggle.com/allen-institute-for-ai/
CORD-19-research-challenge), which is a community-based data science platform for solving problems around
the world.
3.2 Proposed Graph Model
We propose a directed graph G(N, E) based model that is represented by a set of nodes N and a set of edges E with
directions. Each node n ∈ N represents a unique word. A directed edge e ∈ E connects two adjacent words based on
their positioning in a sentence. The direction of an edge points from the preceding word to the succeeding word in a
2A PREPRINT - A PRIL 7, 2020
Gene
Drug
Corona
virus
Trans
mission
edge (E)
word W1 word W2
Figure 1: Left: A pair of graph nodes with a directed edge between words w1 and w2. Right: A schematic diagram
to illustrate the central coronavirus node. The size of neighoboring nodes varies with BC measurements of their
corresponding words. Three influential nodes with high BC measurements and their subgraphs are color coded.
sentence as shown in Figure 1. Using this principle, a sequence-based text graph is first generated using texts of the
articles. In the next step, we determine the importance of each node in the graph based on the concept of ’centrality’. In
general, a node with a higher centrality has more central positioning in the graph. For example, the node with relatively
shortest paths to all other nodes in the graph has the highest measure of centrality [Hage and Harary, 1995]. In this
paper, we measure the centrality of a node using the betweenness centrality (BC) algorithm that is widely applied in
social network analysis. However, the BC algorithm has not been well studied in topic selection and modeling with a
large volume of texts, as we propose in this article. The BC algorithm is originally proposed by Freeman [1977] and
later upgraded by Brandes [2001] to improve its computational performance.
In BC measurement, the node with highest centrality will be located in the highest number of shortest paths between
all other node pairs in the graph. In other words, if a word frequently appears between the shortest paths of other
word pairs, then the word will have a high value of centrality. Thus, the BC algorithm finds influential words that are
imperative for understanding the relationship between word pairs. Such transitive words are weighted as representative
nodes in the graph and selected as key topics for subsequent analysis. Intuitively, this approach is more sophisticated
and informative than using a distance-based centrality measurement. If (u, v, w)∈ N such that u 6= v 6= w, the shortest
path between any two nodes can be calculated using a standard Dijkstra’s algorithm. To measure BC of a node w in the
graph, the number of shortest paths between all other node pairs is first obtained and denoted by σuv . Next, the number
of shortest paths (σuv (w)) is counted between all other node pairs that include the target node w on the path. The ratio
of these two counting numbers is calculated for each pair of nodes as the pair-dependency (δu,v ) below.
σuv (w)
δu,v = . (1)
σuv
The sum of pair-dependency across all possible pairs (excluding w) in the graph yields the BC of the target node w as
below.
X σuv (w)
BC(w) = . (2)
σuv
u6=v6=w
Each node representing a unique word can be a representative topic by itself. The corpus words are ranked by BC
measurements to select semantically important key topics with relatively high BC values. The key topics are then used
to create their corresponding subgraphs. Each topic word is the center of the subgraph, and its neighboring words are
the constituents of the topic. The proposed method does not require the number of topics in advance compared to other
topic modeling algorithms such as latent Dirichlet allocation (LDA) [Hoffman et al., 2010] and latent semantic analysis
(LSA) [Landauer et al., 1998].
3.3 Data Processing
We develop the proposed model and perform the text analysis on a Python programming environment using several
Python packages. These packages include natural language tool kit (NLTK) for basic text analysis and regular expressions
package (re) for text preprocessing. A graph database Neo4J and a Python to database connector library (py2neo) are
3A PREPRINT - A PRIL 7, 2020
Table 1: Neighboring words in the transmission subgraph linked to coronavirus.
Topic Connected nodes in the transmission subgraph
Host ’Rats’, ’Livestock’, ’Mammals’, ’Goats’, ’Camels’, ’Mosquito’, ’Pigs’, ’Monkeys’, ’Dogs’, ’Poul-
try’, ’Rodents’, ’Horses’, ’Bird’, ’Chimpanzee’, ’Honeybees’, ’Mites’, ’Puma’, ’Snails’, ’Infants’,
’Ants’, ’Flies’ , Rodents, Drosophila
Disease ’Malaria’, ’Rabies’, ’Dengue’, ’Chikungunya’, ’Measles’, ’Schistosomiasis’, ’Monkeypox’, ’Po-
liovirus’, ’Cholera’, ’Lyssavirus’, ’Tuberculosis’, ’Rhinovirus’, ’Adenovirus’, ’HFMD’, ’HIV’,
’Paramyxovirus’, ’Zikavirus’, ’Fungal’, ’Leishmania’, ’Myocarditis’, ’Venereal’
Miscellaneous ’Distance’, ’Fever’, ’Traveler’, ’Contacts’, ’Exposures’, ’Humidity’, ’Infants’, ’Pregnancy’, ’In-
terspecies’, ’Mucosal’, ’Reticulum’, ’Zoonosis’, ’Sanitization’, ’Bioaerosols’, ’Aerobiology’,
’Tropisms’, ’Human-to-human’, ’Transspecies’, ’Touches’, ’MDRO’, ’Fluoroquinolone’.
used to build the graph. The first step in text processing is to identify and remove ’stop words’ (words with little or no
merit as keywords) from the corpus. This step not only reduces the complexity in the graph modeling but also puts
more weights on relevant candidate words in the graph. The dataset includes research articles composed in Chinese,
Russian, Arabic, and Korean languages. We confine our analysis to English articles by excluding all non-English words
using regular expression (regex). Regex is also used to filter out undesired symbols and numeric data. The graph data,
including nodes and edges, are exported to Gephi [Bastian et al., 2009] for better visualization and manipulation of the
graph models and model-driven results. All the notebooks with source code and data are publicly shared in a Github
project repository (https://github.com/msahamed/covid19-text-network).
3.4 Modeling and Analysis
The proposed graph model is developed using all available article abstracts in the dataset. We assume that an abstract
has succinct and sufficient information to represent the overall content of a research article. Once the global graph
is developed using the article abstracts, we find several influential words based on their BC values that are linked to
coronavirus. Next, we build separate subgraph for each of the selected key topic as shown in Figure 1. Within each
subgraph, neighboring nodes of a key topic are expected to reveal a variety of information about the topic word. The
neighboring words are further grouped into several categories. Within each category, we rank order the neighboring
words using BC values obtained from the subgraph model to identify their relative importance. We discuss our findings
of this approach in the next section.
4 Results and Discussion
The CORD-19 dataset has a substantial number of articles with missing abstracts. Therefore, we use available abstracts
for 10,683 research articles to build our proposed graph model. In the data processing step, we have created a list of 2796
stop words based on the word frequency and visual scrutiny to exclude them from the corpus. We first find the word
coronavirus in the global graph to select its three neighboring nodes with high BC values. Since BC values can be
exponentially very large, we take natural logarithm of these values (log BC). We select three key topics: transmission,
drug, and gene with log BC values of 16.65, 16.75, and 18.27, respectively. In the following subsections, we discuss
our findings on the three subgraph models corresponding to transmission, drug, and gene topics, respectively.
4.1 Transmission Subgraph
Studies related to infectious disease transmission are important in understanding the spread and growth of pathogens
in different media. Table 1 shows the neighboring words of the transmission subgraph. The subgraph reveals a
number of host-pathogen interactions. Animal hosts such as poultry, camel, dogs, pigs, monkeys, chimpanzees may
be candidates for studying the transmission mechanism of different zoonotic pathogens related coronavirus. The
coronavirus related topic transmission also connects to a number of diseases and pathogens, including rabies, dengue,
tuberculosis, adenovirus, malaria, measles, and hand-foot-and-mouth disease (HFMD). These pathogens may have
interesting traits in common with the transmission of coronavirus. Keywords such as multi-drug resistant organism
(MDRO), tropism (the growth or movement of organisms), zoonosis (the topic of disease transmission from animals
to humans), sanitization, bioaerosols have expectedly appeared in the subgraph of transmission topic. Additionally,
the subgraph includes words related to different transmission media such as blood, respiratory, lung, water, aerosol,
4A PREPRINT - A PRIL 7, 2020
(a) Host keywords (b) Disease keywords
10.0 10.0
Log of BC
Log of BC
7.5 7.5
5.0 5.0
2.5 2.5
0.0 0.0
goats
rats
flies
bird
horses
camels
dogs
pigs
honeybees
rodents
monkeys
cholera
measles
hfmd
rabies
dengue
mosquito
mammals
livestock
infants
poultry
malaria
monkeypox
lyssavirus
schistosomiasis
chikungunya
poliovirus
tuberculosis
rhinovirus
adenovirus
Figure 2: Rank ordering of the most important keywords in the transmission subgraph that is originally linked to the
coronavirus node. The subgraph keywords are grouped into (a) host and (b) disease categories. Log of BC is the natural
logarithm (log) of betweenness centrality.
Table 2: Neighboring words in the drug subgraph linked to coronavirus.
Topic Connected nodes in the drug subgraph
Drug ’Oseltamivir’, ’Amantadine’, ’Viagra’, ’Tunicamycin’, ’Aspirin’, ’Paracetamol’, ,’Chloroquine’,
’Meropenem’,’Troglitazone’, ’Bleomycin’, ’Dexamethasone’, ’Antimoniate’, ’Nitazoxanide’,
’Niclosamide’, ’Resveratrol’, ’Gemcitabine’, ’Rivastigmine’, ’Amphotericin’, ’Paromomycin’,
’Miltefosine’, ’Sitamaquine’, ’Berenil’, ’Daunorubicin’, ’Artemether’, ’Buprenorphine’, ’Ac-
etaminophen’, ’Verdinexor’, ’Mizoribine’, ’Amodiaquine’, ’Sunitinib’, ’Carbapenem’, ’Amio-
darone’, ’Boceprevir’, ’Pranobex’, ’Dapsone’, ’Betalactam’, ’Tanespimycin’, ’Saracatinib’,
’Vildagliptin’, ’Ciclosporin’, ’Cyclosporin’
Chemical ’Glycoprotein’, ’Oligonucleotides’, ’Glycosylation’, ’Cyclophilins’, ’Glucocorticoid’, ’Acetyl-
cholinesterase’, ’Estrogen’, ’Cyclooxygenase’, ’Glycyrrhizin’, ’Hexachlorophene’, ’Myricetin’,
’Glycosides’, ’Hexamethylene’, ’Phytomedicines’, ’Phytoconstituents’, ’Farnesyltransferase’,
’Imbricataloic’, ’Salicylanilide’, ’Quinoxalines’, ’Amidohydrolases’, ’Enantiomers’, ’Ectoen-
zyme’,’Isoleucin’, ’Hydrochloride’, ’Calcium’, Glutamate, Phytochemicals’, ’Diphosphate’,
’Phisohex’, ’Monoammonium’
Disease ’HIV’, ’Pneumonia’, ’Flavivirus’, ’Chikungunya’, ’Zikavirus’, ’Monkeypox’,
’Melanoma’,’Dengue’,’Tuberculosis’, ’BTCOV’, ’Hypertension’, ’Phospholipidosis’, ’He-
matemesis’, ’Acinetobacter’, ’Baumannii’
Miscellaneous ’Metabolism’, ’Pharmacokinetics’, ’Seroprevalence’, ’Bactericidal’, ’Antifungals’, ’Macropinocy-
tosis’, ’Toxoplasmosis’, ’Syncytia’, ’Pyrenees’, ’Decoction’,’Nutraceuticals’, ’Chemoprevention’,
’Clerodendrum’ , ’Molluscicide’, ’Myelosuppression’, ’Prophylaxis’, ’Macrocyclization’,
oronasal, gut, mouth, trachea, glands, droplet, saliva, airborne, and semen. Other miscellaneous keywords are presented
in Table 1 that includes infants and pregnancy that are important topics for transmission studies.
Figure 2 shows the most important keywords in the transmission subgraph based on their BC measurements. Among
the pathogen hosts, pigs, dogs, and camels have topped the list. Although bats do not appear in this list of hosts, they do
appear later in the gene subgraph. The top diseases in the rank ordering, such as dengue, rabies, malaria, adenovirus,
may have some topics common with the study of coronavirus transmission. The rhinovirus related to common cold
and lyssavirus often hosted by bats have appeared in the ranking. The adenovirus, ranked in the third position, causes
symptoms similar to those of coronavirus infection.
4.2 Drug Subgraph
The study of drugs (pharmacology) plays a big role in the battle against diseases. Table 2 shows the connected words of
the drug subgraph that is linked to the coronavirus node. A long list of drugs have appeared in the literature, including
Chloroquine. Several antiviral drugs such as ’Verdinexor’, ’Oseltamivir’, ’Amantadine’, ’Pranobex’ have appeared in
the drug subgraph. Other important drugs that appear in the subgraph are drugs for cancer treatment (Gemcitabine,
Berenil, Daunorubicin, Tanespimycin), anti-fungal drug (Amphotericin), antibiotics (Paromomycin, Carbapenem,
5A PREPRINT - A PRIL 7, 2020
(a) Drug keywords (b) Chemical compound keywords
10.0 10.0
Log of BC
Log of BC
7.5 7.5
5.0 5.0
2.5 2.5
0.0 0.0
estrogen
calcium
artemether
resveratrol
glutamate
sunitinib
cyclosporin
carbapenem
gemcitabine
paromomycin
oseltamivir
paracetamol
dexamethasone
chloroquine
amantadine
cyclooxygenase
enantiomers
myricetin
glycosides
amodiaquine
saracatinib
niclosamide
amphotericin
nitazoxanide
acetaminophen
cyclophilins
glycosylation
glycoprotein
buprenorphine
farnesyltransferase
phytochemicals
hydrochloride
hexachlorophene
glucocorticoid
glycyrrhizin
acetylcholinesterase
oligonucleotides
(c) Disease keywords (d) Micellaneous keywords
10.0 10.0
Log of BC
Log of BC
7.5 7.5
5.0 5.0
2.5 2.5
0.0 0.0
hiv
dengue
melanoma
syncytia
decoction
acinetobacter
flavivirus
pneumonia
zikavirus
chikungunya
baumannii
hypertension
tuberculosis
antifungals
metabolism
toxoplasmosis
bactericidal
seroprevalence
prophylaxis
macrocyclization
macropinocytosis
pharmacokinetics
Figure 3: Rank ordering of the most influential keywords in the drugs subgraph that is linked to the coronavirus
node. The keywords of the subgraph are then grouped into drugs, chemical compounds, diseases, and miscellaneous
categories. Log of BC is the natural logarithm (log) of betweenness centrality.
Betalactam), immunosuppressive drugs (Mizoribine, Cyclosporin), drugs for malaria (Amodiaquine) and Hepatitis
C virus treatment (boceprevir). Many chemical and organic compounds, proteins and enzymes are mentioned in the
literature that may have important pharmaceutical properties. Plants (Clerodendrum), plant-based organic compounds
(Resveratrol, Glycyrrhizin), and medicine derived from plants (Phytomedicines, Phytochemicals, Phytoconstituents)
have appeared in the literature. Bat coronavirus (BTCOV) appears on the list as well. Interestingly, ’Chikungunya’,
’Monkeypox’, ’Dengue’, and ’Tuberculosis’ appeared both in the subgraphs of drug and transmission. The pathogens
of these diseases may have useful links to the study of drug and transmission of coronavirus.
We group the neighboring words within the drug subgraph into four categories: drug, chemical compound, disease, and
miscellaneous. Figure 3 shows the most important keywords under four different categories. Interestingly, antiviral drug
Amantadine has the highest rank even before Chloroquine. Amantadine is administered to treat symptoms related
to flu viruses although it may not be effective for treating illness due to all types of virus strains. Dexamethasone
has almost similar BC measurement as Chloroquine, which is a drug to aid the immune system in the fight against
inflammation. In the disease category of the drug subgraph, zikavirus and pneumonia appear as the top two keywords
followed by HIV and dengue. This rank ordering is quite different from that of the transmission subgraph (Figure 2)
where dengue topped the list. There may be common pharmaceutical interests shared by the drugs used for zikavirus
and coronavirus treatment.
In the chemical category, Glycoprotein and a chemical reaction known as Glycosylation have topped the list. Oligonu-
cleotides, single strands of RNA, have ranked fourth in this category presumably because of their known effect in
inhibiting the replication of influenza virus [Kumar et al., 2013]. Interestingly, plant-based compound Glycyrrhizin
ranks fifth in the list, which has been used to cure bronchitis, gastritis, and jaundice. There are other proteins and
enzymes in the list that will require further discussion by the field experts. The miscellaneous category suggests research
studies focusing on ’metabolism’, ’prophylaxis’ (actions to prevent disease), ’pharmacokinetics’ (movement of drugs in
the body), and ’seroprevalence’ (number of individuals tested positive based on blood serum).
6A PREPRINT - A PRIL 7, 2020
Table 3: Neighboring words in the gene subgraph linked to coronavirus.
Topic Connected nodes in the gene subgraph
Disease ’Cancer’, ’Zikavirus’, ’HCV’, ’RSV’, ’Pneumonia’, ’Influenza’,’SARS’, ’PEDV’, ’Dengue’,
’HIV’, ’IBV’, Adenovirus’, ’Rabies’, ’Hepatitis’, ’PRRSV’, ’H1N1’, ’H5N1’, ’CHIKV’, ’Rhe-
sus’, ’HCOV’, ’ARDS’, ’Denv’, ’Flavivirus’, ’HRV’, ’Rhinovirus’, ’Malaria’, ’Rotavirus’,
’HMPV’, ’Tuberculosis’, ’Enterovirus’, ’HBV’,’HEV’,’COPD’, ’Asthma’, ’Ebolavirus’, ’Her-
pesvirus’ ,’Measles’, ’MHV’, ’Cytomegalovirus’,’Sepsis’, ’Astrovirus’, ’Bronchitis’, ’FIP’,
’Alphavirus’, ’EVB’, ’H7N9’, ’PVC2’, ’CMV’, ’Paramyxovirus’, ’Betacoronavirus’, ’Re-
ovirus’, ’VSV’, ’Polyomavirus’, ’Atherosclerosis’,’HSV’, ’BRSV’, ’KSHV’, ’CPV’, ’Torovirus’,
’HKU1’, ’Circovirus’, ’BRD’, ’GBS’, ’RRV’, ’Cirrhosis’, ’Orthoreovirus’,’Hypoxia’, ’Bunyavirus’,
’PHEV’,’Hartmanivirus’,’PVX’, ’Dyskinesia’, ’Lassa’, ’PR8’, ’VWD’, ’HPYV6’
Host ’Macaque’, ’Pigs’, ’Cats’,’Bat’, ’Mice’, ’Swine’, ’Poultry’, ’Birds’, ’Porcine’, ’Cattle’, ’Camels’,
’Cynomolgus’, ’Horses’, ’Ades’
Biomolecular ’PCR’, ’RNA’, ’DNA’, ’ACE2’, ’NGS’, ’Aptamers’, ’Metagenomics’, ’QPCR’, ’SIRNA’, ’CD8’,
’Ceacam1’, ’Haplotypes’, ’MIRNA’,’DPP4’, ’KDA’, ’STAT3’, ’APOD’,’IFNAR’, ’SHRNAS,
’APOE’, ’vsiRNAs’, ’APOM’, ’IFN’, ’Glycoprotein’, ’Corticosteroids’,’PRP’, ’Macrolide’,
’Macrophages’, ’Cytokine’, ’Interferon’, ’Nucleotide’, ’Polymerase’, ’Demyelination’, ’Im-
munoglobulin’, ’Astrocytes’, ’Chemokines’, ’Myelin’,’PRF’,’TH2’, ’Kinases’, ’Pleiotropic’,
’ORF3’, ’Phosphoprotein’, ’Kegg’, ’Caspase’, ’Peptidase’, ’Eukaryotes’, ’Mycoplasma’, ’Cryp-
tosporidium’, ’Mycobacterium’, ’MBL2’
4.3 Gene Subgraph
The study of genes is imperative as it distinguishes different virus strains and identifies target proteins for successful
discovery of drugs and vaccines. The gene subgraph includes a large number of diseases and pathogens as shown in
Table 3. The list includes Hepatitis viruses (HCV, HBV, HEV), influenza viruses (H1N1, H5N1, H7N9, PR8), other
viruses related to Dengue (Denv), Chikungunya (CHIKV), HIV, zikavirus, and malaria. Many of these viruses can share
similar genomic structure and replication strategy that can be repurposed in the pathogenic study and drug discovery of
novel virus strains such as COV-2. Coronaviruses related to pigs, such as porcine epidemic diarrhea virus (PEDV),
porcine hemagglutinating encephalomyelitis virus (PHEV), mouse hepatitis virus (MHV), feline infectious peritonitis
(FIP) virus, mice originated human coronavirus (HKU1), alpha- and Betacoronaviruses (mainly infect bats) appear
in the subgraph as they belong to the same coronaviridae virus family. In the host list, several species of monkey
(Macaque and Cynomolgus) have appeared in addition to pigs and bats. The result is a proof-of-concept that bats are
subjects of genetic research related to coronavirus. The discovery of coronavirus strains in hosts like pangolin is a very
recent discovery and is still under active research [Lam et al., 2020].
We further group the neighboring words of the gene subgraph into four categories: host, disease, bio-molecular, and
miscellaneous. The biomolecular category includes proteins, cellular mechanisms, drugs, and biochemical compounds.
Figure 4 shows the rank ordering of keywords of each category based on BC measurements. Interestingly, cancer
appears after SARS, influenza, pneumonia, which suggests a link between the genomic studies of cancer and virus. The
porcine epidemic diarrhea virus (PEDV) appears in the fifth position suggesting an important pathogen in the genetic
study of flu viruses. In the bimolecular category, polymerase chain reaction (PCR), a mechanism used in molecular
biology for testing coronavirus, has ranked second after RNA. The abbreviation of interferon (IFN) appears fourth
in this category and second in the miscellaneous keyword category, which is a set of proteins released by the host in
response to a virus attack. In the host category, mice rank the top followed by bat, cat, pig, and a species of monkey
(macaque). There are other potentially intriguing keywords in the ranking that will require more space and scope for
further discussion.
4.4 Summary and Limitations
We have developed a graph-based model to mine information for COVID-19 research. The major findings are as
follows. 1) The proposed graph model is effective in searching informative and connected keywords from a large
volume of documents. The BC measurement of individual words can rank their importance that has shown intuitive
and informative results. 2) pigs appear in both transmission and gene subgraphs as an important host of pathogens
related to the coronaviridae virus family (PEDV and PHEV), 3) in addition to dengue and malaria, zikavirus may
have a connection with the development of drugs for flu viruses, 4) our model finds a potential link between genome
7A PREPRINT - A PRIL 7, 2020
(a) Disease keywords (b) Biomolecular keywords
Log of BC
10.0 10.0
Log of BC
7.5 7.5
5.0 5.0
2.5 2.5
0.0 hrv 0.0
cd8
ace2
denv
ards
hcov
rhesus
h5n1
hiv
ibv
hcv
rsv
sars
prp
orf3
kda
ifn
dna
pcr
rna
chikv
h1n1
prrsv
rabies
pedv
cancer
dengue
dpp4
ceacam1
mirna
sirna
qpcr
ngs
malaria
aptamers
hepatitis
influenza
chemokines
flavivirus
adenovirus
pneumonia
haplotypes
pleiotropic
metagenomics
corticosteroids
glycoprotein
(c) Host keywords (d) Micellaneous keywords
10.0 10.0
Log of BC
Log of BC
7.5 7.5
5.0 5.0
2.5 2.5
0.0 0.0
horses
camels
swine
cats
cattle
pigs
bat
mice
caspase
kegg
th2
prf
porcine
birds
macaque
kinases
myelin
poultry
eukaryotes
peptidase
astrocytes
cynomolgus
chemokines
cytokine
polymerase
nucleotide
interferon
macrophages
mycobacterium
demyelination
cryptosporidium
immunoglobulin
Figure 4: Rank ordering of the most influential keywords in the gene subgraph linked to the coronavirus node. The
subgraph keywords are grouped into disease, host, biomolecular, and miscellaneous categories. Log of BC is the natural
logarithm of betweenness centrality.
research of cancer and flu virus, 5) a list of antiviral drugs have been identified where ’Amantadine’ ranks above
’Chloroquine’ in the treatment of flu virus infection. More importantly, the authors of a recent review on finding
therapies for COVID-19 [Li and De Clercq, 2020] highlight the importance of a) repurposing antiviral agents used
in treating HIV, HBV, HCV, and Influenza, b) oligonucleotide and interferon-based therapies, c) glycoprotein as a
target for an antiviral agent, d) reference to RNA viruses (Influenza, Ebola, chikungunya, and enterovirus), e) a drug
’Nitazoxanide’ to inhibit COV-2 - all of which appear in our subgraph models and rank order. We believe similar other
keywords will intrigue the scientists to find repurposing candidates for COVID-19 research.
Despite promising findings, there are several limitations of this study. First, high BC values of top ranked keywords
may be partly attributed to their high prevalence in the literature. There is a possibility that keywords in the lower
rank order are less prevalent in the literature. However, low-ranked keywords can still be strong candidates for future
COVID-19 research. Second, our analysis is limited to article abstracts without incorporating full article texts that can
be a richer source of information. Third, we do not use any explicit metric to evaluate our graph model performance
and findings since this study is entirely exploratory. Fourth, our graph model includes a broad array of diverse health
science literature. A grouping of research articles and subsequent analysis within individual subgroups may yield more
focused results. We leave these tasks for our future work.
5 Conclusions
This paper demonstrates one of the first studies on mining a large volume of health science literature in search of
information for the ongoing battle against COVID-19. The results presented in tables and figures under different topics
effectively summarize the key information contents of the literature with the aid of a graph-based computational model
and betweenness centrality based keyword ranking. The proposed model has been able to recommend informative
drug types, pathogens, hosts, chemical and biomolecules for COVID-19 research that may not always be evident to the
concerned researchers. The findings may be a useful reference guide to provoke new ideas and expedite new directions
of research in the fight against the COVID-19 pandemic.
8A PREPRINT - A PRIL 7, 2020
References
Ron A.M. Fouchier, Thijs Kuiken, Martin Schutten, Geert Van Amerongen, Gerard J.J. Van Doornum, Bernadette G.
Van Den Hoogen, Malik Peiris, Wilina Lim, Klaus Stöhr, and Albert D.M.E. Osterhaus. Koch’s postulates fulfilled
for SARS virus. Nature, 423(6937):240, may 2003. ISSN 00280836. doi: 10.1038/423240a.
S. R. Weiss and S. Navas-Martin. Coronavirus Pathogenesis and the Emerging Pathogen Severe Acute Respiratory
Syndrome Coronavirus. Microbiology and Molecular Biology Reviews, 69(4):635–664, dec 2005. ISSN 1092-2172.
doi: 10.1128/mmbr.69.4.635-664.2005.
Yutaro Yamaoka, Shutoku Matsuyama, Shuetsu Fukushi, Satoko Matsunaga, Yuki Matsushima, Hiroyuki Kuroyama,
Hirokazu Kimura, Makoto Takeda, Tomoyuki Chimuro, and Akihide Ryo. Development of Monoclonal Antibody
and Diagnostic Test for Middle East Respiratory Syndrome Coronavirus Using Cell-Free Synthesized Nucleocapsid
Antigen. Frontiers in Microbiology, 7(APR):509, apr 2016. ISSN 1664-302X. doi: 10.3389/fmicb.2016.00509. URL
http://journal.frontiersin.org/Article/10.3389/fmicb.2016.00509/abstract.
Els Keyaerts, Sandra Li, Leen Vijgen, Evelien Rysman, Jannick Verbeeck, Marc Van Ranst, and Piet Maes. Antiviral
activity of chloroquine against human coronavirus OC43 infection in newborn mice. Antimicrobial Agents and
Chemotherapy, 53(8):3416–3421, aug 2009. ISSN 00664804. doi: 10.1128/AAC.01509-08.
Chean Yeah Yong, Hui Kian Ong, Swee Keong Yeap, Kok Lian Ho, and Wen Siang Tan. Recent Advances in the
Vaccine Development Against Middle East Respiratory Syndrome-Coronavirus. Frontiers in Microbiology, 10, aug
2019. ISSN 1664-302X. doi: 10.3389/fmicb.2019.01781.
Makoto Ozawa and Yoshihiro Kawaoka. Cross Talk Between Animal and Human Influenza Viruses. Annual Review of
Animal Biosciences, 1(1):21–42, jan 2013. ISSN 2165-8102. doi: 10.1146/annurev-animal-031412-103733.
Na Zhu, Dingyu Zhang, Wenling Wang, Xingwang Li, Bo Yang, Jingdong Song, Xiang Zhao, Baoying Huang, Weifeng
Shi, Roujian Lu, Peihua Niu, Faxian Zhan, Xuejun Ma, Dayan Wang, Wenbo Xu, Guizhen Wu, George F. Gao, and
Wenjie Tan. A novel coronavirus from patients with pneumonia in China, 2019. New England Journal of Medicine,
382(8):727–733, feb 2020. ISSN 15334406. doi: 10.1056/NEJMoa2001017. URL http://www.nejm.org/doi/
10.1056/NEJMoa2001017.
Fan Wu, Su Zhao, Bin Yu, Yan-Mei Chen, Wen Wang, Zhi-Gang Song, Yi Hu, Zhao-Wu Tao, Jun-Hua Tian, Yuan-Yuan
Pei, Ming-Li Yuan, Yu-Ling Zhang, Fa-Hui Dai, Yi Liu, Qi-Min Wang, Jiao-Jiao Zheng, Lin Xu, Edward C. Holmes,
and Yong-Zhen Zhang. A new coronavirus associated with human respiratory disease in China. Nature, 579(7798):
265–269, mar 2020. ISSN 0028-0836. doi: 10.1038/s41586-020-2008-3.
Ensheng Dong, Hongru Du, and Lauren Gardner. An interactive web-based dashboard to track COVID-19 in real time.
The Lancet. Infectious diseases, 0(0), 2020. ISSN 1474-4457. doi: 10.1016/S1473-3099(20)30120-1.
Jia Liu, Ruiyuan Cao, Mingyue Xu, Xi Wang, Huanyu Zhang, Hengrui Hu, Yufeng Li, Zhihong Hu, Wu Zhong, and
Manli Wang. Hydroxychloroquine, a less toxic derivative of chloroquine, is effective in inhibiting SARS-CoV-2
infection in vitro. Cell Discovery, 6(1):16, mar 2020. ISSN 2056-5968. doi: 10.1038/s41421-020-0156-0. URL
http://www.nature.com/articles/s41421-020-0156-0.
L. Masuelli, M. Granato, M. Benvenuto, R. Mattera, R. Bernardini, M. Mattei, G. D’Amati, G. D’Orazi, A. Faggioni,
R. Bei, and M. Cirone. Chloroquine supplementation increases the cytotoxic effect of curcumin against Her2/neu
overexpressing breast cancer cells in vitro and in vivo in nude mice while counteracts it in immune competent mice.
OncoImmunology, 6(11), nov 2017. ISSN 2162402X. doi: 10.1080/2162402X.2017.1356151.
Ilan Ben-Zvi, Shaye Kivity, Pnina Langevitz, and Yehuda Shoenfeld. Hydroxychloroquine: from malaria to autoimmu-
nity. Clinical reviews in allergy & immunology, 42(2):145–153, 2012.
Ángel Hernández-castañeda, René Arnulfo García-hernández, Yulia Ledeneva, and Christian Eduardo Millán-hernández.
Extractive Automatic Text Summarization Based on Lexical-Semantic Keywords. 8, 2020.
Alan Diaz-Manriquez, Ana Bertha Rios-Alvarado, Jose Hugo Barron-Zambrano, Tania Yukary Guerrero-Melendez, and
Juan Carlos Elizondo-Leal. An Automatic Document Classifier System Based on Genetic Algorithm and Taxonomy.
IEEE Access, 6:21552–21559, 2018. ISSN 21693536. doi: 10.1109/ACCESS.2018.2815992.
Hamed Jelodar, Yongli Wang, Chi Yuan, Xia Feng, Xiahui Jiang, Yanchao Li, and Liang Zhao. Latent Dirichlet
allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 78(11):
15169–15211, jun 2019. ISSN 15737721. doi: 10.1007/s11042-018-6894-4.
9A PREPRINT - A PRIL 7, 2020
Zhang Yun-tao, Gong Ling, and Wang Yong-cheng. An improved TF-IDF approach for text classification. Journal of
Zhejiang University-SCIENCE A 2005 6:1, 6(1):49–55, aug 2005. ISSN 1862-1775. doi: 10.1007/BF02842477.
Winnie Cheng, Chris Greaves, and Martin Warren. From n-gram to skipgram to concgram. International Journal of
Corpus Linguistics, 11(4):411–433, jan 2006. ISSN 1384-6655. doi: 10.1075/ijcl.11.4.04che.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global Vectors for Word Representation.
pages 1532–1543, 2014. doi: 10.3115/V1/D14-1162.
Saroj Kr Biswas, Monali Bordoloi, and Jacob Shreya. A graph based keyword extraction model using collective node
weight. Expert Systems with Applications, 97:51–59, may 2018. ISSN 09574174. doi: 10.1016/j.eswa.2017.12.025.
Delia Rusu, Blaž Fortuna, Dunja Mladenic, Marko Grobelnik, and Ruben Sipoš. Document visualization based on
semantic graphs. Proceedings of the International Conference on Information Visualisation, pages 292–297, 2009.
ISSN 10939547. doi: 10.1109/IV.2009.57.
Per Hage and Frank Harary. Eccentricity and centrality in networks. Social Networks, 17(1):57–63, jan 1995. ISSN
03788733. doi: 10.1016/0378-8733(94)00248-9.
Linton C Freeman. A set of measures of centrality based on betweenness. Sociometry, pages 35–41, 1977.
Ulrik Brandes. A faster algorithm for betweenness centrality. Journal of mathematical sociology, 25(2):163–177, 2001.
Matthew Hoffman, Francis R Bach, and David M Blei. Online Learning for Latent Dirichlet Allocation. In J D
Lafferty, C K I Williams, J Shawe-Taylor, R S Zemel, and A Culotta, editors, Advances in Neural Information
Processing Systems 23, pages 856–864. Curran Associates, Inc., 2010. URL http://papers.nips.cc/paper/
3902-online-learning-for-latent-dirichlet-allocation.pdf.
Thomas K Landauer, Peter W. Foltz, and Darrell Laham. An introduction to latent semantic analysis. Discourse
Processes, 25(2-3):259–284, jan 1998. ISSN 0163-853X. doi: 10.1080/01638539809545028.
Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. Gephi: An Open Source Software for Exploring and
Manipulating Networks Visualization and Exploration of Large Graphs. Technical report, mar 2009. URL www.
aaai.org.
Prashant Kumar, Binod Kumar, Roopali Rajput, Latika Saxena, Akhil C. Banerjea, and Madhu Khanna. Cross-protective
effect of antisense oligonucleotide developed against the common 3 NCR of influenza A virus genome. Molecular
Biotechnology, 55(3):203–211, nov 2013. ISSN 10736085. doi: 10.1007/s12033-013-9670-8.
Tommy Tsan-Yuk Lam, Marcus Ho-Hin Shum, Hua-Chen Zhu, Yi-Gang Tong, Xue-Bing Ni, Yun-Shi Liao, Wei
Wei, William Yiu-Man Cheung, Wen-Juan Li, Lian-Feng Li, Gabriel M. Leung, Edward C. Holmes, Yan-Ling
Hu, and Yi Guan. Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins. Nature, pages 1–6,
mar 2020. ISSN 0028-0836. doi: 10.1038/s41586-020-2169-0. URL http://www.nature.com/articles/
s41586-020-2169-0.
Guangdi Li and Erik De Clercq. Therapeutic options for the 2019 novel coronavirus (2019-nCoV), mar 2020. ISSN
14741784.
10You can also read